Software Engineer — GPU Networking & Distributed Systems

BaseTen Inc.

Job Overview

Location

San Francisco

Job Type

Full-time

Full Job Description

📋 Description

• At BaseTen, we are at the forefront of enabling mission-critical AI inference for leading companies like Cursor, Notion, and Writer. We are building the global operating system for distributed, heterogeneous AI hardware, recognizing that as AI workloads scale, the network becomes the computer. This role is pivotal in leading our GPU Networking efforts, making RDMA a foundational element of our infrastructure and unlocking next-generation distributed inference optimizations.
• The convergence of networking and compute is no longer a distant future; it's our present reality. With the immense throughput capabilities of architectures like H100, B200, and NVL72, we are pioneering a new paradigm where communication is intrinsically co-optimized with computation. This era demands a network that acts as an active accelerator, leveraging smart hardware offloads and direct interconnects to ensure data movement operates at wire-speed, eliminating bottlenecks and maximizing efficiency.
• As a Software Engineer on our GPU Networking & Distributed Systems team, you will transcend traditional network configuration. Your mission will be to architect the sophisticated software fabric that unifies thousands of GPUs into a cohesive, high-performance operating system. While we embrace and integrate the best of the open-source ecosystem, we are not constrained by its limitations. Where off-the-shelf solutions fall short, you will be empowered to build from the ground up, engineering the fundamental primitives necessary to co-optimize communication and compute for critical applications such as Disaggregated Serving, Wide Expert Parallelism (WideEP), and dramatically reducing cold start times for large models.
• A core responsibility will be to make RDMA (Remote Direct Memory Access) a first-class citizen within our inference stack. This involves deep integration of RDMA, RoCE (RDMA over Converged Ethernet), and InfiniBand capabilities. By moving beyond the limitations of traditional TCP/IP, you will unlock order-of-magnitude improvements in both bandwidth and latency, crucial for the demanding nature of modern AI workloads.
• You will be instrumental in optimizing distributed inference. This includes implementing and meticulously tuning the networking layers essential for efficient Disaggregated KV Cache Offload and Wide Expert Parallelism (WideEP). Your work will ensure seamless, high-speed communication across NVLink and InfiniBand interconnects, which is paramount for the performance of our Mixture-of-Experts (MoE) models.
• Enabling serverless-grade startup speeds for Large Language Models (LLMs) is another key objective. You will engage deeply with checkpointing and storage mechanisms, architecting solutions that facilitate sub-10-second startup times for models with trillions of parameters, a feat that will redefine user experience and operational efficiency.
• A significant part of your role will involve deep dives into hardware. You will characterize and validate networking performance on the latest bleeding-edge GPU clusters, including H100/H200, B200/B300, and GB200/300 NVL72 configurations. This includes writing rigorous acceptance tests to guarantee that our hardware consistently delivers peak achievable throughput and minimal latency.
• Building robust observability tools is crucial for managing complex distributed systems. You will design and implement systems that allow us to visualize packet flow, identify congestion points, and measure effective bandwidth across GPU interconnects, providing critical insights for diagnosing and resolving intricate distributed system behaviors.
• You will also have the opportunity to optimize communication kernels. This involves working closely with established libraries like NCCL (NVIDIA Collective Communications Library) and NVSHMEM (NVIDIA Shared Memory), and potentially developing custom communication kernels to achieve optimal overlap between computation and data transfer, maximizing GPU utilization.
• This role offers unparalleled exposure to the absolute cutting edge of AI hardware. You will be among the first engineers in the industry to optimize networking for next-generation architectures like Blackwell (B200/B300) and NVL72/GB300 racks. Our commitment to operating at every depth means you will tackle challenges from tuning hardware interconnects and writing custom communication kernels to designing distributed inference strategies, working across the entire stack to deliver exceptional performance.
• The networking optimizations you develop will directly enable groundbreaking features, such as seamless multi-node WideEP and instant model hydration, pushing the boundaries of what's possible in AI deployment. You will be a foundational engineer, shaping the future of distributed AI infrastructure.

🎯 Requirements

• Deep experience with high-performance networking protocols such as InfiniBand, RoCE v2, and a strong understanding of the physics of data movement.
• Fluency in C++ or Python, with the ability to bridge high-level logic and hardware, coupled with a deep understanding of the memory hierarchy in modern NVIDIA architectures (e.g., H100/Blackwell) and optimization techniques.
• Proven ability to dive deep into complex systems, including debugging low-level issues in areas like TensorRT-LLM, writing custom C++/Python bindings, or diagnosing NVLink topology problems.
• Strong judgment in selecting between off-the-shelf solutions and building custom infrastructure when existing tools (e.g., standard Kubernetes networking) are insufficient for high-performance needs.
• Highly preferred: Deep knowledge of NCCL, NVSHMEM, and UCX.
• Highly preferred: Experience with GPUDirect Storage (GDS) or high-performance filesystems (e.g., Weka, 3FS).

🏖️ Benefits

• Competitive compensation package, including significant equity.
• Comprehensive 100% coverage for medical, dental, and vision insurance for employees and their dependents.
• Generous Paid Time Off (PTO) policy, including a company-wide Winter Break from Christmas Eve to New Year's Day.
• Paid parental leave to support new parents.
• Company-facilitated 401(k) plan.
• Unique exposure to a diverse range of ML startups, providing exceptional learning and networking opportunities.

Skills & Technologies

Python

Node.js

Kubernetes

Apache Spark

Onsite

Degree Required

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

BaseTen Inc.

Visit Website

About BaseTen Inc.

BaseTen provides a serverless, GPU-accelerated platform that lets machine-learning teams deploy, scale and monitor custom models behind autoscaling inference endpoints. The service abstracts infrastructure management, supports PyTorch, TensorFlow and Hugging Face artifacts, and offers built-in observability, A/B testing and fine-tuning. Customers integrate via REST or GraphQL APIs and pay only for compute used. Founded in 2019 and headquartered in San Francisco, BaseTen targets data scientists and product teams seeking production-grade ML serving without Kubernetes complexity.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.