This job has expired

This position was posted on February 26, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Lead Infrastructure and Reliability Engineer (Systems Scale)

Luma Labs, Inc.

Job Overview

Location

Palo Alto, CA

Job Type

Full-time

Full Job Description

📋 Description

• Luma AI is at the forefront of a new era of intelligence, developing systems capable of understanding and generating content across video, images, audio, and language. This groundbreaking work in multimodal Artificial General Intelligence (AGI) presents not only a significant modeling challenge but also an immense infrastructure challenge, pushing the boundaries of what current hardware, software, and organizational structures can support.
• At Luma, we operate and manage rapidly scaling fleets of 10,000+ GPUs. Our work involves aggressively optimizing utilization, throughput, and reliability to such an extent that conventional solutions frequently break. The researchers at Luma depend on this cutting-edge infrastructure to advance the frontiers of AI, while our customers rely on it to power their creative endeavors.
• While many companies utilize accelerators, very few have their infrastructure teams working in direct proximity to the teams inventing the very models that redefine the capabilities of these accelerators. At Luma, this close integration means that improvements in scheduling, efficiency, and reliability have an immediate and tangible impact, translating directly into faster research iteration cycles and the enablement of entirely new product capabilities.
• We are still in the early stages of our journey, and the established playbook for this domain is actively being written. In this dynamic environment, a single exceptional engineer has the potential to fundamentally reshape how the entire company operates.
• **Your Role and Responsibilities:**
• As a Lead Infrastructure and Reliability Engineer (Systems at Scale), you will be a pivotal technical authority and an organizational force multiplier. You will be instrumental in defining the strategic direction for our infrastructure and reliability efforts, attracting and nurturing top engineering talent.
• **Reliability of the Frontier:** You will architect and operate large, heterogeneous GPU environments that are subjected to extreme and demanding workloads. A key focus will be on improving utilization and performance, where even marginal gains can significantly alter company outcomes.
• You will be responsible for resolving complex failures that span across hardware, operating systems, runtimes, and orchestration layers. Your goal will be to proactively eliminate entire classes of instability before they impact users.
• A critical aspect of this role involves building robust mechanisms and automated solutions that render 'heroic' interventions unnecessary, ensuring consistent and predictable system performance.
• **Scaling Training and Inference:** You will define the evolution of our infrastructure and workloads as cluster sizes and concurrency grow exponentially. This includes designing sophisticated scheduling, placement, and resource management approaches for increasingly complex and demanding AI jobs.
• You will collaborate directly with our research teams to build the foundational systems required for developing and deploying next-generation model capabilities.
• Ensuring our inference platforms can scale rapidly and efficiently without compromising reliability or introducing unacceptable latency is paramount.
• A crucial part of your role will be anticipating where current abstractions will inevitably fail under future demands and proactively redesigning systems to stay ahead of these challenges.
• **Building the Organization:** You will play a key role in hiring and developing exceptional systems and reliability engineers, setting a high bar for technical depth, sound judgment, and a strong sense of production ownership.
• You will shape the early architecture of our platform through the cultivation of strong, collaborative partnerships with our research and product teams.
• You will translate critical reliability constraints into a clear, long-term platform strategy that guides our development efforts.
• **Who You Are:**
• You possess deep expertise in Linux and distributed systems, with a proven track record of operating GPU and accelerator clusters in demanding, real-world production environments.
• You have strong fluency in Kubernetes and modern open-source infrastructure tooling.
• You are comfortable and adept at debugging complex issues that traverse hardware, kernel, runtime, and orchestration layers.
• You have a profound understanding of how systems behave under contention and at scale, identifying and mitigating performance bottlenecks.
• You are a builder who writes code and creates automation to solve problems efficiently.
• You naturally think in terms of bottlenecks, potential failure modes, and critical tradeoffs.
• Engineers trust your judgment and technical guidance, particularly during high-pressure situations when systems are failing.
• **Important Considerations:** This role requires a comfort level operating close to the upstream components and 'close to the metal.' If your experience has primarily been within highly abstracted internal platforms where others managed the underlying machinery, this position may not be the ideal fit.
• **Leadership Expectations:** You are expected to raise reliability standards across the entire company, influencing product and research architecture at the earliest stages.
• You excel at building strong, collaborative partnerships rather than simply managing ticket queues.
• You have a proven ability to attract, recruit, and develop exceptional engineers.
• You are deeply curious about how AI models utilize infrastructure, understanding that improving systems directly expands the possibilities of what can be achieved.
• **Why This Role Is Special:** Unlike most infrastructure roles that focus on optimizing mature, established systems, this position offers the unique opportunity to help define how reliability works for a new generation of AI infrastructure. The decisions you make here will have a profound influence on how research progresses, how products scale to meet market demand, how customers perceive and trust our capabilities, and how our engineering organization grows and evolves. If you are driven by the prospect of building the reliability foundations for a company operating at the technological frontier, we encourage you to connect with us.

Skills & Technologies

Kubernetes

Linux

DevOps

Senior

Onsite

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Luma Labs, Inc.

Visit Website

About Luma Labs, Inc.

Luma Labs is a technology company focused on developing advanced AI-powered tools for 3D content creation. Their flagship product enables users to capture real-world objects and environments and transform them into high-fidelity 3D models using just a smartphone. This technology serves various industries, including gaming, augmented reality (AR), virtual reality (VR), and e-commerce, by democratizing the creation of immersive digital assets. Luma Labs aims to make 3D scanning and modeling accessible to a broader audience, accelerating the development of the metaverse and other spatial computing applications.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.