MLOps Engineer - aiXplain

  • Full-Time
  • Remote

Job Description:

Job Summary

We are looking for a skilled MLOps Engineer to design, build, and maintain scalable machine learning infrastructure and cloud-based platforms. The ideal candidate will be responsible for architecting DevOps/MLOps environments, automating infrastructure and deployment pipelines, and enabling reliable production-grade machine learning services.

This role involves working closely with engineering, data, and cloud teams to ensure scalable, secure, and high-performance ML systems.

Key Responsibilities

  • Architect and maintain DevOps / MLOps infrastructure

  • Implement Infrastructure as Code (IaC) solutions

  • Maintain codebase integrity including validation rules and branch/merge processes

  • Develop automation solutions for infrastructure, testing, and deployments

  • Promote DevOps culture, best practices, and tooling across teams

  • Participate in effort estimation and Agile project planning

  • Identify and manage technical risks and issues

  • Stay up to date with cloud technologies and vendor solutions (AWS, Azure, GCP)

  • Design and build scalable machine learning services and data platforms

  • Monitor system performance using metrics, benchmarks, and monitoring tools

  • Support production deployment, scaling, and monitoring of ML models

Required Skills & Qualifications

  • Bachelors degree in Computer Science, Computer Engineering, or related technical field (or equivalent experience)

  • 3–5 years of experience in DevOps, MLOps, or Cloud Engineering

  • Strong experience with cloud infrastructure and automation

  • Experience deploying and managing production-grade ML systems

  • Strong Linux knowledge

  • Strong Python experience for services, automation, and pipelines

  • Experience building CI/CD pipelines

Preferred / Nice to Have

Model Deployment & ML Infrastructure

  • Experience hosting and serving LLM models (VLLM knowledge is required)

  • Experience working with ASR, transcription, or streaming models

  • Experience scaling ML models in production environments

  • Understanding GPU optimization and resource utilization

  • Experience serving multiple models simultaneously

Containerization & Orchestration

  • Kubernetes (EKS or on-prem environments)

  • Docker and Docker Compose

  • Helm

  • Flux CD

  • KEDA, KServe, Knative for autoscaling and serverless ML serving

  • Network security and load balancing knowledge in Kubernetes

Cloud & Infrastructure

  • AWS services (S3, EKS, Load Balancers, general AWS infrastructure)

  • Terraform for infrastructure and cluster management

  • GPU node configuration (NVIDIA drivers, Fabric Manager, GPU container exposure)

  • Certificate management

Data & Messaging

  • Kafka for messaging and streaming

  • Redis

  • SQL databases

Monitoring & Observability

  • Datadog or similar monitoring tools

  • Benchmarking and performance profiling for ML models

DevOps & Workflow Tools

  • GitHub Actions or similar CI/CD tools

  • Workflow orchestration tools such as Flyte

Soft Skills

  • Strong communication skills

  • Strong analytical and problem-solving skills

  • Ability to work independently and in cross-functional teams

  • Strong time management and organizational skills

Languages

  • English (Required)

  • Arabic (Preferred)