
Job Overview
Location
San Francisco
Job Type
Full-time
Category
Software Engineering
Date Posted
May 22, 2026
Full Job Description
đź“‹ Description
- • Design and own end-to-end network architecture for data center clusters powering GPU-based AI inference and training systems.
- • Define cluster fabric architectures using InfiniBand or high-performance Ethernet protocols to optimize throughput and reduce latency for distributed workloads.
- • Design and implement spine-leaf topologies and rack-level connectivity for scalable, high-availability data center networks.
- • Select and specify switches, optics, and cabling systems based on performance, reliability, and scalability requirements for GPU clusters.
- • Lead network bring-up, validation, and performance testing across new and existing data center deployments.
- • Partner closely with hardware and platform engineering teams to align network design with system-level performance goals.
- • Define and document standardized network deployment practices to ensure consistency across multiple data center sites.
- • Perform ongoing network performance tuning to support demanding AI workloads, including RDMA and low-latency communication protocols.
- • Own technical decision-making for network infrastructure at a staff level, with direct impact on model training and inference efficiency.
- • Mentor and support junior engineers and future team members as the network team scales.
- • Collaborate with cross-functional teams to troubleshoot complex network issues affecting distributed AI systems.
- • Contribute to the evolution of network standards and best practices for high-performance computing environments.
- • Ensure network infrastructure meets the reliability and performance demands of mission-critical AI applications used by leading ML companies.
- • Maintain detailed documentation of network configurations, topology diagrams, and operational procedures.
- • Participate in on-call rotations to respond to critical network incidents affecting production AI infrastructure.
🎯 Requirements
- • Experience designing and operating data center or HPC networks.
- • Strong familiarity with InfiniBand, RDMA, or high-performance Ethernet.
- • Strong hands-on skills in network configuration, debugging, and performance tuning.
- • Experience owning complex systems end-to-end at a senior level.
- • Experience leading technical projects or cross-functional efforts.
- • Prior leadership or mentoring experience is a plus.
🏖️ Benefits
- • Competitive compensation, including meaningful equity.
- • 100% coverage of medical, dental, and vision insurance for employee and dependents.
- • Flexible PTO policy including company-wide Winter Break (offices closed from Christmas Eve to New Year's Day).
- • Paid parental leave.
- • Fertility and family-building stipend through Carrot.
- • Company-facilitated 401(k).
- • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Skills & Technologies
About BaseTen Inc.
BaseTen provides a serverless, GPU-accelerated platform that lets machine-learning teams deploy, scale and monitor custom models behind autoscaling inference endpoints. The service abstracts infrastructure management, supports PyTorch, TensorFlow and Hugging Face artifacts, and offers built-in observability, A/B testing and fine-tuning. Customers integrate via REST or GraphQL APIs and pay only for compute used. Founded in 2019 and headquartered in San Francisco, BaseTen targets data scientists and product teams seeking production-grade ML serving without Kubernetes complexity.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities
27 days ago

PAE Holding Corporation, LLC
23 hours ago

Siftstack Inc.
2 months ago

ICF International, Inc.
2 months ago
