MLOps Engineer - aiXplain
- Full-Time
- Remote
Job Description:
Job Summary
We are looking for a skilled MLOps Engineer to design, build, and maintain scalable machine learning infrastructure and cloud-based platforms. The ideal candidate will be responsible for architecting DevOps/MLOps environments, automating infrastructure and deployment pipelines, and enabling reliable production-grade machine learning services.
This role involves working closely with engineering, data, and cloud teams to ensure scalable, secure, and high-performance ML systems.
Key Responsibilities
-
Architect and maintain DevOps / MLOps infrastructure
-
Implement Infrastructure as Code (IaC) solutions
-
Maintain codebase integrity including validation rules and branch/merge processes
-
Develop automation solutions for infrastructure, testing, and deployments
-
Promote DevOps culture, best practices, and tooling across teams
-
Participate in effort estimation and Agile project planning
-
Identify and manage technical risks and issues
-
Stay up to date with cloud technologies and vendor solutions (AWS, Azure, GCP)
-
Design and build scalable machine learning services and data platforms
-
Monitor system performance using metrics, benchmarks, and monitoring tools
-
Support production deployment, scaling, and monitoring of ML models
Required Skills & Qualifications
-
Bachelors degree in Computer Science, Computer Engineering, or related technical field (or equivalent experience)
-
3–5 years of experience in DevOps, MLOps, or Cloud Engineering
-
Strong experience with cloud infrastructure and automation
-
Experience deploying and managing production-grade ML systems
-
Strong Linux knowledge
-
Strong Python experience for services, automation, and pipelines
-
Experience building CI/CD pipelines
Preferred / Nice to Have
Model Deployment & ML Infrastructure
-
Experience hosting and serving LLM models (VLLM knowledge is required)
-
Experience working with ASR, transcription, or streaming models
-
Experience scaling ML models in production environments
-
Understanding GPU optimization and resource utilization
-
Experience serving multiple models simultaneously
Containerization & Orchestration
-
Kubernetes (EKS or on-prem environments)
-
Docker and Docker Compose
-
Helm
-
Flux CD
-
KEDA, KServe, Knative for autoscaling and serverless ML serving
-
Network security and load balancing knowledge in Kubernetes
Cloud & Infrastructure
-
AWS services (S3, EKS, Load Balancers, general AWS infrastructure)
-
Terraform for infrastructure and cluster management
-
GPU node configuration (NVIDIA drivers, Fabric Manager, GPU container exposure)
-
Certificate management
Data & Messaging
-
Kafka for messaging and streaming
-
Redis
-
SQL databases
Monitoring & Observability
-
Datadog or similar monitoring tools
-
Benchmarking and performance profiling for ML models
DevOps & Workflow Tools
-
GitHub Actions or similar CI/CD tools
-
Workflow orchestration tools such as Flyte
Soft Skills
-
Strong communication skills
-
Strong analytical and problem-solving skills
-
Ability to work independently and in cross-functional teams
-
Strong time management and organizational skills
Languages
-
English (Required)
-
Arabic (Preferred)