
Job Overview
Location
North America Remote / San Francisco, CA
Job Type
Full-time
Category
Software Engineering
Date Posted
May 22, 2026
Full Job Description
đź“‹ Description
- • Own end-to-end reliability of Andromeda’s AI infrastructure, from hardware rack-and-stack to customer-facing training runs, ensuring uptime and performance for multi-thousand-GPU workloads.
- • Carry primary pager responsibility for P0 incidents; lead incident response from PyTorch → NCCL → driver → fabric → hardware, diagnose root causes, write postmortems, and ship systemic fixes.
- • Operate and maintain day-to-day health of heterogeneous GPU fleets across providers and generations, including node lifecycle management, burn-in, validation, draining, repair workflows, firmware rollouts, and driver upgrades.
- • Design, build, and own observability systems for GPU health, fabric monitoring, and automated remediation — including telemetry, health checks, and alerting — to preempt customer-impacting failures.
- • Define and evolve Andromeda’s on-call practices: rotations, escalation paths, runbooks, incident command structure, and blameless postmortem culture as the team scales.
- • Serve as the senior reliability liaison to AI customers and infrastructure providers, leading incident reviews with customer principal engineers, scoping demanding workloads, and participating in architecture deep-dives and deal cycles.
- • Partner directly with product engineering teams to embed SLOs, error budgets, and failure mode analysis into feature development; translate customer pain points into actionable engineering priorities.
- • Influence physical infrastructure design with data center and hardware providers on rack layout, power/cooling envelopes, network topology, and validation protocols to prevent failure modes before deployment.
- • Mentor engineers daily through incident reviews, pairing on diagnostics, written guidance, and hiring decisions — raising the technical bar across the team.
- • Build production-grade tooling in Go, Python, or Rust for automation, controllers, and infrastructure management — not throwaway scripts — and deploy them reliably in production.
- • Operate and optimize Kubernetes-with-GPUs (device plugins, topology-aware scheduling, multi-cluster) and/or Slurm/HPC schedulers; use Terraform, Helm, and Ansible as standard tooling.
- • Maintain expert-level command of Linux systems internals: kernel tuning, NVIDIA driver and CUDA toolkit lifecycle, cgroups/namespaces, perf/BPF tracing, and firmware management.
- • Diagnose and resolve performance bottlenecks in distributed training workloads, including NCCL communication, CUDA kernel behavior, FSDP, DeepSpeed, Megatron, and checkpointing/recovery patterns.
- • Analyze and optimize high-performance networking fabrics: InfiniBand, RoCE, and NVLink topologies; identify degraded links, congestion, and all-reduce inefficiencies in fat-tree or other network designs.
- • Be the senior technical voice in customer and provider meetings, translating complex infrastructure issues into clear business and technical context for CTOs, engineers, and procurement teams.
🎯 Requirements
- • Multiple years of hands-on experience building and operating large-scale GPU infrastructure as a primary responsibility
- • Proven staff-level SRE track record owning reliability of load-bearing infrastructure under high-stakes production conditions
- • Deep, production-grade expertise with NVIDIA H100/H200/B200/GB200 (or equivalent) GPUs, including memory hierarchies, ECC, thermal envelopes, NVLink/NVSwitch topology, and hardware failure modes
- • Real production experience with InfiniBand, RoCE, and NVLink fabrics for distributed training, including diagnosing slow all-reduce and congestion control
- • Working knowledge of distributed training internals: NCCL, CUDA, PyTorch distributed, FSDP, DeepSpeed, Megatron, and modern checkpointing/recovery patterns
- • Strong proficiency in Go, Python, or Rust for building production tooling; experience with Kubernetes-with-GPUs and/or Slurm/HPC schedulers; fluency in Terraform, Helm, Ansible
- • Expert-level Linux and systems internals knowledge: kernel tuning, NVIDIA driver/CUDA lifecycle, cgroups/namespaces, perf/BPF, firmware management
- • Comfort and composure as the senior engineer on P0 incident bridges with customers and providers on the line
🏖️ Benefits
- • Significant autonomy to shape infrastructure reliability practices at a company powering the world’s most ambitious AI labs
- • Direct impact on infrastructure used by leading AI labs and data centers, with visibility into cutting-edge AI compute demands
- • Opportunity to work hands-on in code, customer meetings, and incident response — no bureaucratic layers
- • Collaborative environment with a small, senior team where individual judgment directly shapes customer experience
Skills & Technologies
See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.
About Andromeda Technologies Inc.
Andromeda is a technology company focused on developing advanced AI solutions for the space industry. Their core business revolves around creating sophisticated software and hardware that enhances space exploration, satellite operations, and data analysis. Andromeda's platform leverages machine learning and computer vision to automate complex tasks, improve mission efficiency, and provide actionable insights from vast amounts of space-derived data. They aim to be a leader in the burgeoning space tech sector, offering innovative tools that empower researchers, commercial entities, and government agencies to better understand and utilize the space environment. Their work supports a range of applications from Earth observation to deep space missions.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities

Vanta, Inc.
3 months ago

Keyrock NV
3 months ago

Cloudera, Inc.
3 months ago

Nelly Sweden AB
3 months ago