MLOps Engineer - aiXplain | SYNC Job Openings

MLOps Engineer - aiXplain

Full-Time
Remote

Job Description:

Job Summary

We are looking for a skilled MLOps Engineer to design, build, and maintain scalable machine learning infrastructure and cloud-based platforms. The ideal candidate will be responsible for architecting DevOps/MLOps environments, automating infrastructure and deployment pipelines, and enabling reliable production-grade machine learning services.

This role involves working closely with engineering, data, and cloud teams to ensure scalable, secure, and high-performance ML systems.

Key Responsibilities

Architect and maintain DevOps / MLOps infrastructure
Implement Infrastructure as Code (IaC) solutions
Maintain codebase integrity including validation rules and branch/merge processes
Develop automation solutions for infrastructure, testing, and deployments
Promote DevOps culture, best practices, and tooling across teams
Participate in effort estimation and Agile project planning
Identify and manage technical risks and issues
Stay up to date with cloud technologies and vendor solutions (AWS, Azure, GCP)
Design and build scalable machine learning services and data platforms
Monitor system performance using metrics, benchmarks, and monitoring tools
Support production deployment, scaling, and monitoring of ML models

Required Skills & Qualifications

Bachelors degree in Computer Science, Computer Engineering, or related technical field (or equivalent experience)
3–5 years of experience in DevOps, MLOps, or Cloud Engineering
Strong experience with cloud infrastructure and automation
Experience deploying and managing production-grade ML systems
Strong Linux knowledge
Strong Python experience for services, automation, and pipelines
Experience building CI/CD pipelines

Preferred / Nice to Have

Model Deployment & ML Infrastructure

Experience hosting and serving LLM models (VLLM knowledge is required)
Experience working with ASR, transcription, or streaming models
Experience scaling ML models in production environments
Understanding GPU optimization and resource utilization
Experience serving multiple models simultaneously

Containerization & Orchestration

Kubernetes (EKS or on-prem environments)
Docker and Docker Compose
Helm
Flux CD
KEDA, KServe, Knative for autoscaling and serverless ML serving
Network security and load balancing knowledge in Kubernetes

Cloud & Infrastructure

AWS services (S3, EKS, Load Balancers, general AWS infrastructure)
Terraform for infrastructure and cluster management
GPU node configuration (NVIDIA drivers, Fabric Manager, GPU container exposure)
Certificate management

Data & Messaging

Kafka for messaging and streaming
Redis
SQL databases

Monitoring & Observability

Datadog or similar monitoring tools
Benchmarking and performance profiling for ML models

DevOps & Workflow Tools

GitHub Actions or similar CI/CD tools
Workflow orchestration tools such as Flyte

Soft Skills

Strong communication skills
Strong analytical and problem-solving skills
Ability to work independently and in cross-functional teams
Strong time management and organizational skills

Languages

English (Required)
Arabic (Preferred)