Decagon logo

Senior Software Engineer, ML Infrastructure

Job Overview

Location

San Francisco

Job Type

Full-time

Category

Machine Learning Engineer

Date Posted

March 28, 2026

Full Job Description

đź“‹ Description

  • • As a Senior Software Engineer, ML Infrastructure at Decagon, you will play a pivotal role in shaping the backbone of the company’s AI capabilities by designing and scaling the systems that enable cutting-edge model training and reliable, efficient inference across diverse customer environments. Your work will directly impact how quickly Decagon’s research translates into production-grade AI agents that power personalized customer experiences for industry leaders like Avis Budget Group, Block’s Cash App and Square, Chime, Oura Health, and Hunter Douglas.
  • • Day to day, you will design and build distributed training platforms for large language model (LLM) and multimodal fine-tuning and post-training at scale, ensuring fault tolerance and optimal resource utilization across multi-node GPU clusters. You will integrate state-of-the-art training algorithms into production pipelines, enabling faster iteration and improved model quality. You will own the inference architecture and multi-provider routing layer, implementing failover mechanisms, latency optimization strategies, and cost-efficient serving architectures that balance performance with operational excellence. You will lead initiatives to improve latency and cost efficiency across the entire training and serving stack, conducting performance analysis and driving architectural improvements. You will build evaluation and experimentation infrastructure that allows research and product teams to rapidly test, validate, and deploy model improvements with confidence. Additionally, you will drive technical direction for the ML infrastructure team, mentor junior engineers, establish best practices, and collaborate closely with Research, Infrastructure, and Product teams to align systems with business goals.
  • • The ML Infrastructure team at Decagon operates at the critical intersection of research and production, responsible for the full model lifecycle — from training platforms and experimentation frameworks to the routing layer that manages inference across multiple cloud and on-premise providers. The team is known for its technical rigor, pragmatic problem-solving, and commitment to building systems that are not only scalable and reliable but also intuitive and enjoyable for other engineers to use. Decagon fosters an in-office culture grounded in values like 'Just Get It Done,' 'Invent What Customers Want,' 'Winner’s Mindset,' and 'The Polymath Principle,' promoting high velocity, ownership, and cross-disciplinary learning.
  • • In this role, you will have the opportunity to deepen your expertise in large-scale ML systems, influence the long-term architecture of Decagon’s AI stack, and gain experience leading complex, multi-quarter technical initiatives from conception to deployment. You will work alongside world-class researchers and engineers backed by top-tier investors including a16z, Accel, Bain Capital Ventures, Coatue, and Index Ventures, accelerating your growth in a high-impact environment where your contributions directly shape customer-facing AI innovation.

🎯 Requirements

  • • 6+ years of experience building ML infrastructure or production systems at scale, with a focus on reliability, scalability, and performance
  • • Deep expertise in distributed training systems, including multi-node GPU cluster management, fault tolerance mechanisms, and training optimization techniques
  • • Strong understanding of LLM inference architectures, including latency optimization strategies, provider-specific tradeoffs (e.g., AWS, GCP, Azure, specialized AI chips), and serving patterns like dynamic batching and model quantization
  • • Proven ability to lead complex, multi-quarter technical projects, drive technical direction, and mentor engineers in fast-paced, high-growth environments

🏖️ Benefits

  • • Comprehensive medical, dental, and vision insurance coverage
  • • Flexible 'take what you need' vacation policy to support work-life balance and personal well-being
  • • Daily catered lunches, dinners, and snacks provided in the office to foster collaboration and sustain energy throughout the workday

Skills & Technologies

Node.js
DevOps
Senior
Onsite
$250k-330k

Ready to Apply?

You will be redirected to an external site to apply.

About Decagon

Decagon is an agricultural technology company focused on developing advanced solutions to improve crop yields and sustainability. They specialize in creating controlled environment agriculture (CEA) systems, including advanced greenhouses equipped with proprietary hardware and software. These systems optimize growing conditions such as light, temperature, humidity, and nutrient delivery, enabling year-round production of high-quality produce. Decagon's technology aims to reduce water usage, minimize pesticide reliance, and shorten supply chains, contributing to a more resilient and efficient global food system. Their approach combines biological expertise with cutting-edge engineering to address the challenges of modern farming.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

ARGENTINA
Full-time
Expires Jun 20, 2026
AWS
Terraform
TensorFlow
+4 more

5 days ago

Apply
Qualysoft GmbH logo

Qualysoft GmbH

Bucharest
Full-time
Expires Jun 22, 2026
Data Science
Senior
Onsite

3 days ago

Apply
Melbourne
Full-time
Expires May 15, 2026
Python
Kubernetes
PyTorch
+4 more

1 month ago

Apply
Heidi Health Pty Ltd logo

Heidi Health Pty Ltd

Melbourne
Full-time
Expires May 15, 2026
Python
Go
TensorFlow
+4 more

1 month ago

Apply