
Job Overview
Location
San Francisco
Job Type
Full-time
Category
Machine Learning Engineer
Date Posted
March 28, 2026
Full Job Description
đź“‹ Description
- • As a Senior Software Engineer, ML Infrastructure at Decagon, you will play a pivotal role in shaping the backbone of the company’s AI capabilities by designing and scaling the systems that enable cutting-edge model training and reliable, efficient inference across diverse customer environments. Your work will directly impact how quickly Decagon’s research translates into production-grade AI agents that power personalized customer experiences for industry leaders like Avis Budget Group, Block’s Cash App and Square, Chime, Oura Health, and Hunter Douglas.
- • Day to day, you will design and build distributed training platforms for large language model (LLM) and multimodal fine-tuning and post-training at scale, ensuring fault tolerance and optimal resource utilization across multi-node GPU clusters. You will integrate state-of-the-art training algorithms into production pipelines, enabling faster iteration and improved model quality. You will own the inference architecture and multi-provider routing layer, implementing failover mechanisms, latency optimization strategies, and cost-efficient serving architectures that balance performance with operational excellence. You will lead initiatives to improve latency and cost efficiency across the entire training and serving stack, conducting performance analysis and driving architectural improvements. You will build evaluation and experimentation infrastructure that allows research and product teams to rapidly test, validate, and deploy model improvements with confidence. Additionally, you will drive technical direction for the ML infrastructure team, mentor junior engineers, establish best practices, and collaborate closely with Research, Infrastructure, and Product teams to align systems with business goals.
- • The ML Infrastructure team at Decagon operates at the critical intersection of research and production, responsible for the full model lifecycle — from training platforms and experimentation frameworks to the routing layer that manages inference across multiple cloud and on-premise providers. The team is known for its technical rigor, pragmatic problem-solving, and commitment to building systems that are not only scalable and reliable but also intuitive and enjoyable for other engineers to use. Decagon fosters an in-office culture grounded in values like 'Just Get It Done,' 'Invent What Customers Want,' 'Winner’s Mindset,' and 'The Polymath Principle,' promoting high velocity, ownership, and cross-disciplinary learning.
- • In this role, you will have the opportunity to deepen your expertise in large-scale ML systems, influence the long-term architecture of Decagon’s AI stack, and gain experience leading complex, multi-quarter technical initiatives from conception to deployment. You will work alongside world-class researchers and engineers backed by top-tier investors including a16z, Accel, Bain Capital Ventures, Coatue, and Index Ventures, accelerating your growth in a high-impact environment where your contributions directly shape customer-facing AI innovation.
🎯 Requirements
- • 6+ years of experience building ML infrastructure or production systems at scale, with a focus on reliability, scalability, and performance
- • Deep expertise in distributed training systems, including multi-node GPU cluster management, fault tolerance mechanisms, and training optimization techniques
- • Strong understanding of LLM inference architectures, including latency optimization strategies, provider-specific tradeoffs (e.g., AWS, GCP, Azure, specialized AI chips), and serving patterns like dynamic batching and model quantization
- • Proven ability to lead complex, multi-quarter technical projects, drive technical direction, and mentor engineers in fast-paced, high-growth environments
🏖️ Benefits
- • Comprehensive medical, dental, and vision insurance coverage
- • Flexible 'take what you need' vacation policy to support work-life balance and personal well-being
- • Daily catered lunches, dinners, and snacks provided in the office to foster collaboration and sustain energy throughout the workday
Skills & Technologies
About Decagon
Decagon is an agricultural technology company focused on developing advanced solutions to improve crop yields and sustainability. They specialize in creating controlled environment agriculture (CEA) systems, including advanced greenhouses equipped with proprietary hardware and software. These systems optimize growing conditions such as light, temperature, humidity, and nutrient delivery, enabling year-round production of high-quality produce. Decagon's technology aims to reduce water usage, minimize pesticide reliance, and shorten supply chains, contributing to a more resilient and efficient global food system. Their approach combines biological expertise with cutting-edge engineering to address the challenges of modern farming.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities

Heidi Health Pty Ltd
1 month ago

Heidi Health Pty Ltd
1 month ago

