This job has expired

This position was posted on May 21, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Staff Software Engineer - Managed Kubernetes

Lambda Inc.

Job Overview

Location

Bellevue, WA

Job Type

Full-time

Full Job Description

📋 Description

• Lead the technical vision and development of Lambda’s Managed Kubernetes platform, purpose-built for AI workloads on bare metal infrastructure, ensuring scalability, multi-tenancy, and high availability across tens of thousands of customer deployments.
• Design and implement GPU-aware orchestration systems by integrating and extending NVIDIA’s open-source ecosystem, including GPU Operator, Network Operator, DCGM, NCCL, AICR, and Topograph for topology-aware scheduling and placement.
• Drive the architecture and development of Managed Slurm on Kubernetes, enabling seamless coexistence of traditional HPC batch workloads and containerized AI workloads on the same platform.
• Design and build higher-level platform services for AI inference, including model serving infrastructure, dynamic autoscaling based on inference load, and multi-model deployment patterns.
• Define and implement networking solutions for AI workloads by collaborating with the Network team on CNI integration (Cilium, Multus), high-performance fabrics (InfiniBand, RoCE), RDMA, and GPUDirect.
• Partner with Storage teams to define storage architecture requirements for Managed Kubernetes, Slurm, and future services, ensuring optimal I/O performance for large-scale AI training and inference.
• Establish operational excellence for a managed service through automation of upgrades, security patching, zero-downtime maintenance, and infrastructure-as-code/GitOps workflows.
• Lead chaos engineering initiatives to validate system resilience under failure conditions at scale, ensuring platform reliability for mission-critical AI workloads.
• Build self-healing systems and automation for incident response, root cause analysis, and proactive platform resilience across distributed GPU clusters.
• Serve as the technical bridge between Orchestration and cross-functional teams (Network, Storage, Security), translating platform needs into actionable infrastructure specifications across the full stack.
• Drive infrastructure-wide decisions that align bare-metal provisioning, network topology, and storage systems with the requirements of managed orchestration services.
• Champion consistency and standardization across Lambda’s infrastructure stack, ensuring unified design patterns, monitoring, and operational practices.
• Mentor and grow engineers on the Orchestration team, establishing best practices in Kubernetes development, distributed systems, and Cloud Native engineering.
• Represent Lambda externally through technical blog posts, conference talks, and strategic customer engagements, contributing to the broader open-source community.
• Shape the AIOps vision by designing intelligent systems for automated capacity planning, anomaly detection, and predictive maintenance of cloud infrastructure.
• Collaborate directly with customers and internal teams to understand legacy deployments and chart migration paths to Lambda’s managed platform.
• Engage with NVIDIA and the open-source ecosystem to stay current on GPU orchestration technologies and contribute back to upstream projects where applicable.
• Design and implement secure, compliant multi-tenant environments with RBAC, Pod Security Standards, network policies, and workload isolation.
• Maintain deep expertise in Linux systems and networking (L2–L7), including high-performance networking concepts like RDMA and InfiniBand.
• Write production-quality code in Go and Python, with a focus on building scalable, maintainable, and elegant infrastructure systems.
• Drive design reviews and technical decision-making across teams to ensure systems are scalable, observable, and aligned with customer needs.
• Utilize modern tools and AI-assisted development (e.g., Claude Code) to accelerate productivity and increase engineering impact.

🎯 Requirements

• 10+ years of experience in software engineering, platform engineering, or SRE, with at least 5 years focused on Kubernetes at scale
• Expert-level understanding of Kubernetes internals: API machinery, controllers, schedulers, operators, CRDs, CSI, CNI, and extension patterns
• Strong software engineering skills in Go (required) and Python; production-grade coding experience
• Deep experience with GPU orchestration in Kubernetes: NVIDIA GPU Operator, device plugins, DCGM, MIG, time-slicing, and GPU-aware scheduling
• Proven track record of technical leadership: driving cross-team design decisions, mentoring engineers, and influencing infrastructure direction beyond immediate scope
• Hands-on experience designing and operating managed services or multi-tenant platforms

🏖️ Benefits

• Generous cash & equity compensation
• Health, dental, and vision coverage for you and your dependents
• Wellness and commuter stipends for select roles
• 401k Plan with 2% company match (USA employees)
• Flexible paid time off plan that employees actively use
• Opportunity to work with NVIDIA’s cutting-edge open-source GPU and networking stack

Skills & Technologies

Python

Kubernetes

Linux

Prometheus

Grafana

Senior

Onsite

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Lambda Inc.

Visit Website

About Lambda Inc.

Lambda Inc. provides cloud-based GPU clusters and workstations for artificial-intelligence research and development. The company designs and operates high-performance hardware infrastructure optimized for machine-learning workloads, offering on-demand access to NVIDIA GPUs, pre-configured deep-learning software stacks, and scalable storage. Customers include AI labs, universities, and enterprises training large language and computer-vision models. Founded in 2012, Lambda is headquartered in San Francisco and maintains data centers across North America and Europe.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.