This job has expired

This position was posted on May 22, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Staff SRE, AI Infrastructure

Andromeda Technologies Inc.

Job Overview

Location

North America Remote / San Francisco, CA

Job Type

Full-time

Full Job Description

📋 Description

• Own end-to-end reliability of Andromeda’s AI infrastructure, from hardware rack-and-stack to customer-facing training runs, ensuring uptime and performance for multi-thousand-GPU workloads.
• Carry primary pager responsibility for P0 incidents; lead incident response from PyTorch → NCCL → driver → fabric → hardware, diagnose root causes, write postmortems, and ship systemic fixes.
• Operate and maintain day-to-day health of heterogeneous GPU fleets across providers and generations, including node lifecycle management, burn-in, validation, draining, repair workflows, firmware rollouts, and driver upgrades.
• Design, build, and own observability systems for GPU health, fabric monitoring, and automated remediation — including telemetry, health checks, and alerting — to preempt customer-impacting failures.
• Define and evolve Andromeda’s on-call practices: rotations, escalation paths, runbooks, incident command structure, and blameless postmortem culture as the team scales.
• Serve as the senior reliability liaison to AI customers and infrastructure providers, leading incident reviews with customer principal engineers, scoping demanding workloads, and participating in architecture deep-dives and deal cycles.
• Partner directly with product engineering teams to embed SLOs, error budgets, and failure mode analysis into feature development; translate customer pain points into actionable engineering priorities.
• Influence physical infrastructure design with data center and hardware providers on rack layout, power/cooling envelopes, network topology, and validation protocols to prevent failure modes before deployment.
• Mentor engineers daily through incident reviews, pairing on diagnostics, written guidance, and hiring decisions — raising the technical bar across the team.
• Build production-grade tooling in Go, Python, or Rust for automation, controllers, and infrastructure management — not throwaway scripts — and deploy them reliably in production.
• Operate and optimize Kubernetes-with-GPUs (device plugins, topology-aware scheduling, multi-cluster) and/or Slurm/HPC schedulers; use Terraform, Helm, and Ansible as standard tooling.
• Maintain expert-level command of Linux systems internals: kernel tuning, NVIDIA driver and CUDA toolkit lifecycle, cgroups/namespaces, perf/BPF tracing, and firmware management.
• Diagnose and resolve performance bottlenecks in distributed training workloads, including NCCL communication, CUDA kernel behavior, FSDP, DeepSpeed, Megatron, and checkpointing/recovery patterns.
• Analyze and optimize high-performance networking fabrics: InfiniBand, RoCE, and NVLink topologies; identify degraded links, congestion, and all-reduce inefficiencies in fat-tree or other network designs.
• Be the senior technical voice in customer and provider meetings, translating complex infrastructure issues into clear business and technical context for CTOs, engineers, and procurement teams.

🎯 Requirements

• Multiple years of hands-on experience building and operating large-scale GPU infrastructure as a primary responsibility
• Proven staff-level SRE track record owning reliability of load-bearing infrastructure under high-stakes production conditions
• Deep, production-grade expertise with NVIDIA H100/H200/B200/GB200 (or equivalent) GPUs, including memory hierarchies, ECC, thermal envelopes, NVLink/NVSwitch topology, and hardware failure modes
• Real production experience with InfiniBand, RoCE, and NVLink fabrics for distributed training, including diagnosing slow all-reduce and congestion control
• Working knowledge of distributed training internals: NCCL, CUDA, PyTorch distributed, FSDP, DeepSpeed, Megatron, and modern checkpointing/recovery patterns
• Strong proficiency in Go, Python, or Rust for building production tooling; experience with Kubernetes-with-GPUs and/or Slurm/HPC schedulers; fluency in Terraform, Helm, Ansible
• Expert-level Linux and systems internals knowledge: kernel tuning, NVIDIA driver/CUDA lifecycle, cgroups/namespaces, perf/BPF, firmware management
• Comfort and composure as the senior engineer on P0 incident bridges with customers and providers on the line

🏖️ Benefits

• Significant autonomy to shape infrastructure reliability practices at a company powering the world’s most ambitious AI labs
• Direct impact on infrastructure used by leading AI labs and data centers, with visibility into cutting-edge AI compute demands
• Opportunity to work hands-on in code, customer meetings, and incident response — no bureaucratic layers
• Collaborative environment with a small, senior team where individual judgment directly shapes customer experience

Skills & Technologies

Python

Rust

Node.js

Kubernetes

Terraform

DevOps

Senior

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Andromeda Technologies Inc.

Visit Website

About Andromeda Technologies Inc.

Andromeda is a technology company focused on developing advanced AI solutions for the space industry. Their core business revolves around creating sophisticated software and hardware that enhances space exploration, satellite operations, and data analysis. Andromeda's platform leverages machine learning and computer vision to automate complex tasks, improve mission efficiency, and provide actionable insights from vast amounts of space-derived data. They aim to be a leader in the burgeoning space tech sector, offering innovative tools that empower researchers, commercial entities, and government agencies to better understand and utilize the space environment. Their work supports a range of applications from Earth observation to deep space missions.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.