Decagon logo

Staff Software Engineer, ML Infrastructure

Job Overview

Location

San Francisco

Job Type

Full-time

Category

Machine Learning Engineer

Date Posted

February 27, 2026

Full Job Description

đź“‹ Description

  • • As a Staff Software Engineer on the ML Infrastructure team at Decagon, you will play a pivotal role in shaping the future of our cutting-edge conversational AI platform. You will be instrumental in building and owning the core systems that power every stage of Decagon's model lifecycle, from initial training to seamless inference in production environments. This is a unique opportunity to work at the forefront of machine learning, translating complex, state-of-the-art ML techniques into robust, scalable, and reliable systems that directly impact the customer experience for industry-defining enterprises.
  • • Your primary focus will be on designing and constructing distributed training platforms capable of handling large-scale LLM and multimodal fine-tuning and post-training. This involves architecting systems that can efficiently leverage multi-node GPU clusters, ensuring fault tolerance, and implementing sophisticated optimization strategies to maximize training throughput and minimize resource consumption. You will be responsible for integrating advanced training algorithms into our production pipelines, ensuring that our models are trained effectively and efficiently.
  • • A significant aspect of this role involves owning the inference architecture and multi-provider routing. You will design and implement systems that manage inference requests across various providers, ensuring high availability through robust failover mechanisms and optimizing for performance and cost-efficiency. This includes researching and implementing cutting-edge inference optimizations such as quantization, speculative decoding, and advanced batching strategies to minimize latency and maximize throughput.
  • • You will lead critical, multi-quarter initiatives aimed at significantly improving both latency and cost efficiency across the entire training and serving stack. This requires a strategic mindset, the ability to break down complex problems into manageable steps, and the skill to drive these initiatives to successful completion. Your work will directly contribute to Decagon's ability to deliver faster, more responsive AI experiences to our customers.
  • • Furthermore, you will be responsible for building and enhancing our evaluation and experimentation infrastructure. This infrastructure is crucial for enabling our Research and Product teams to iterate rapidly and reliably on new models and features. You will create frameworks and tools that streamline the process of evaluating model performance, conducting A/B tests, and gathering insights, thereby accelerating the pace of innovation.
  • • Beyond the technical contributions, this role offers the opportunity to drive the technical direction of the ML infrastructure team. You will mentor junior engineers, share your expertise, and establish best practices for ML infrastructure development and operations. Your leadership will be key in fostering a culture of technical rigor, pragmatic decision-making, and a commitment to building systems that are not only powerful but also a pleasure for other teams to use.
  • • Decagon is an in-office company, fostering a highly collaborative environment driven by a shared commitment to excellence and velocity. Our core values – Just Get It Done, Invent What Customers Want, Winner’s Mindset, and The Polymath Principle – are deeply embedded in our culture and guide our approach to problem-solving and innovation. You will be joining a team that thrives on tackling challenging problems, pushing the boundaries of what's possible in AI, and delivering exceptional value to our clients.
  • • This role is ideal for an engineer who possesses deep technical expertise, a proven ability to lead complex projects from conception to deployment, and a passion for building foundational ML systems that enable rapid product development and deliver outstanding customer experiences. You will be a key player in scaling our ML capabilities to meet the growing demands of our enterprise clients and solidifying Decagon's position as the leading conversational AI platform.

🎯 Requirements

  • • 8+ years of experience building and scaling ML infrastructure or production systems.
  • • Deep expertise in distributed training methodologies, including multi-node GPU clusters, fault tolerance, and performance optimization.
  • • Strong understanding of LLM inference architectures, latency optimization techniques, and multi-provider serving strategies.
  • • Proficiency in Python and modern machine learning frameworks such as PyTorch, JAX, or TensorFlow.
  • • Demonstrated track record of leading complex, multi-quarter technical initiatives from inception to completion.

🏖️ Benefits

  • • Comprehensive medical, dental, and vision insurance plans.
  • • A flexible 'take what you need' vacation policy to support work-life balance.
  • • Daily catered lunches, dinners, and snacks provided in the office to fuel your productivity and well-being.

Skills & Technologies

Python
Node.js
TensorFlow
PyTorch
DevOps
Senior
Onsite
$300k-430k

Ready to Apply?

You will be redirected to an external site to apply.

About Decagon

Decagon is an agricultural technology company focused on developing advanced solutions to improve crop yields and sustainability. They specialize in creating controlled environment agriculture (CEA) systems, including advanced greenhouses equipped with proprietary hardware and software. These systems optimize growing conditions such as light, temperature, humidity, and nutrient delivery, enabling year-round production of high-quality produce. Decagon's technology aims to reduce water usage, minimize pesticide reliance, and shorten supply chains, contributing to a more resilient and efficient global food system. Their approach combines biological expertise with cutting-edge engineering to address the challenges of modern farming.

Similar Opportunities

Melbourne, Australia
Full-time
Expires Apr 26, 2026
Python
Node.js
AWS
+3 more

15 days ago

Apply
Brazil
Full-time
Expires Apr 25, 2026
Python
AWS
Azure
+4 more

16 days ago

Apply
Brazil
Full-time
Expires Apr 28, 2026
Python
AWS
Remote

12 days ago

Apply
Juniper Square, Inc. logo

Juniper Square, Inc.

Canada
Full-time
Expires May 9, 2026
Python
AWS
GCP
+6 more

24 hours ago

Apply