Full Job Description
đź“‹ Description
• Lead the design, build, and continuous improvement of Nexos’ cloud-native infrastructure, engineered specifically for AI-heavy workloads that power our distributed, product-grade machine-learning platform.
• Own end-to-end Kubernetes lifecycle management: provision, configure, upgrade, and harden clusters across multiple environments (dev, staging, prod) while guaranteeing >99.9 % availability and sub-second scaling responsiveness.
• Architect and maintain bullet-proof CI/CD pipelines using GitLab CI and ArgoCD, enabling multiple daily releases of micro-services, models, and data pipelines with zero-downtime blue/green and canary strategies.
• Automate every layer of the stack with Infrastructure-as-Code: author modular, version-controlled Terraform and Ansible playbooks that provision VPCs, IAM roles, storage, GPU node pools, and security groups in minutes, not hours.
• Instrument and operate a best-in-class observability stack—Prometheus, Zabbix, Grafana, and the Elastic Stack—to surface real-time metrics, traces, and logs; define SLOs/SLIs and own the incident-response playbook to keep MTTR <15 min.
• Collaborate daily with AI researchers, MLOps engineers, and product squads to translate experimental prototypes into production-ready services, ensuring GPU scheduling, autoscaling, and cost optimisation are baked in from day one.
• Drive cloud-cost governance: implement resource tagging, right-sizing policies, and spot-instance strategies that cut infrastructure spend without compromising performance or reliability.
• Champion security by design—enforce network segmentation, secrets management, image scanning, and policy-as-code (OPA/Gatekeeper) so that compliance audits become a non-event.
• Document tribal knowledge into living runbooks, architectural decision records (ADRs), and onboarding guides that empower every engineer to ship safely and autonomously.
• Contribute to strategic PoCs for next-generation AI infrastructure (think serverless GPUs, confidential computing, or edge federated learning) and turn the most promising ideas into production reality.
• Mentor junior DevOps engineers through pair-programming, design reviews, and brown-bag sessions, cultivating a culture of continuous learning and relentless automation.
• Participate in an agile, remote-first environment with daily stand-ups, fortnightly retros, and quarterly OKR planning—your voice directly shapes product and technical roadmaps.
Skills & Technologies
AWS
GCP
Docker
Kubernetes
Terraform
DevOps
Senior
Remote