This job has expired

This position was posted on December 3, 2025 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Cusp AI Ltd logo

ML Infrastructure Engineer (ML Platform)

Job Overview

Location

Cambridge

Job Type

Full-time

Category

Software Engineering

Date Posted

December 3, 2025

Full Job Description

đź“‹ Description

  • • Architect the beating heart of CuspAI’s research engine. You will design, build and continuously evolve a cloud-native ML platform on Google Cloud Platform and Kubernetes that lets our world-leading AI chemists, physicists and materials scientists spin up massive distributed training jobs, track thousands of experiments and ship models to production—all without ever touching a YAML file.
  • • Own the full MLOps lifecycle end-to-end. From source-controlled infrastructure-as-code (Terraform, Helm, Kapitan) through CI/CD, model registries, experiment tracking (MLflow, Weights & Biases or similar) and automated deployment, you will be the single source of truth for how code becomes a running, monitored, cost-optimised service.
  • • Enable planet-scale distributed training. You will provision and tune multi-node GPU clusters (A100s, H100s, TPU pods) with smart checkpointing, elastic resource scaling and fault-tolerant data pipelines so that a 10-billion-parameter model can train overnight and resume gracefully if a node fails.
  • • Guarantee 99.9 % uptime for the platform that powers breakthrough discoveries. Build real-time observability (Prometheus, Grafana, Alertmanager), self-healing automation and on-call playbooks so researchers sleep well while GPUs churn through exaflops of computation.
  • • Optimise every dollar of cloud spend. Implement quota management, spot-instance orchestration and workload-aware bin-packing so that we can run 30 % more experiments without increasing budget—freeing cash for even bigger clusters.
  • • Craft a delightful developer experience. Create opinionated SDKs, CLI tools and JupyterHub templates that abstract Kubernetes complexity, letting a chemist type `cuspai train --dataset water-filtration --gpus 64` and watch the magic happen.
  • • Champion GitOps and reproducibility. Every environment—from a researcher’s laptop to production—is declared in Git, reviewed like code and rolled out automatically, ensuring that yesterday’s breakthrough can be reproduced next year.
  • • Collaborate across disciplines daily. Sit shoulder-to-shoulder with ML researchers debugging convergence issues, pair with chemists optimising molecular featurisation pipelines, and sync with software engineers integrating models into customer-facing APIs.
  • • Shape the strategic roadmap. As the first ML Infrastructure hire, you will define standards, pick the next tools and mentor future teammates, leaving a lasting architectural imprint on a platform that could accelerate the discovery of carbon-capture membranes, room-temperature superconductors or next-gen batteries.
  • • Travel and connect. Expect quarterly trips to our London, Amsterdam or Berlin hubs to run workshops, share best practices and keep the global team aligned.

Skills & Technologies

Python
Go
AWS
GCP
Kubernetes
DevOps
Onsite
Remote

Ready to Apply?

You will be redirected to an external site to apply.

Cusp AI Ltd logo
Cusp AI Ltd
Visit Website

About Cusp AI Ltd

Cusp AI is a Cambridge-based startup applying generative artificial intelligence and deep learning to the discovery and design of next-generation materials for carbon capture, hydrogen storage and other clean-energy applications. The company combines physics-informed models, molecular simulation and high-throughput cloud computing to predict and optimize porous frameworks such as metal-organic frameworks and covalent organic frameworks, dramatically reducing the time and cost needed to identify candidates for scalable carbon dioxide removal. Founded in 2023 by ex-Google researchers, Cusp AI collaborates with national laboratories and industrial partners to translate AI-generated molecules into pilot-scale demonstrations.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

❌ EXPIRED
Nantes
Full-time
Expired Jan 5, 2026
Java
Scala
Docker
+3 more

6 months ago

Apply
San Francisco
Full-time
Expires Jun 21, 2026
Python
Docker
Apache Spark
+2 more

18 hours ago

Apply
GameChanger Remote - US
Full-time
Expires Jun 10, 2026
Python
JavaScript
TypeScript
+4 more

12 days ago

Apply
❌ EXPIRED
Toronto
Full-time
Expired Jan 18, 2026
Rust
PostgreSQL
Kubernetes
+3 more

5 months ago

Apply