
Job Overview
Location
San Francisco
Job Type
Full-time
Category
DevOps & SysAdmin
Date Posted
May 12, 2026
Full Job Description
đź“‹ Description
- • As a Site Reliability Engineer at Baseten, you'll define and codify the gold standards of day 2 operations for our ML infrastructure platform, ensuring reliability at scale for mission-critical AI inference systems used by leading companies like Notion, Cursor, and OpenEvidence.
- • You'll own the reliability of Baseten's multi-cloud Kubernetes infrastructure, build and maintain observability as code, author and improve runbooks, diagnose runtime issues related to latency and GPU utilization, and convert failure patterns into automated mitigations.
- • You'll work closely with engineering, forward-deployed, and product teams to turn tribal knowledge into automated systems, raise the operational floor, and empower the organization to operate confidently at the frontier of AI infrastructure.
- • In this role, you'll deepen your expertise in SRE practices, observability-as-code, infrastructure automation, and ML infrastructure challenges — gaining exposure to cutting-edge AI startups while shaping the reliability foundation of a rapidly growing platform.
🎯 Requirements
- • Extensive hands-on experience with Kubernetes (multi-cloud experience across EKS, GKE, or similar is a strong plus).
- • Experience in building and maintaining scalable infrastructure.
- • Strong foundation in observability tooling: metrics (VictoriaMetrics, Prometheus), logging (Loki, ELK), dashboards (Grafana), and alerting pipelines. Observability-as-code experience is a plus.
- • Experience with infrastructure-as-code (Terraform, Helm) and GitOps workflows (Flux CD, ArgoCD).
- • Experience writing and improving runbooks, leading incident response, and doing post-mortem analysis.
- • Comfort working at the intersection of engineering and operations — you write code, but you also think deeply about process, escalation paths, and operational leverage.
🏖️ Benefits
- • Competitive compensation, including meaningful equity.
- • 100% coverage of medical, dental, and vision insurance for employee and dependents.
- • Flexible PTO policy including company wide Winter Break (offices closed from Christmas Eve to New Year's Day!).
- • Paid parental leave.
- • Fertility and family-building stipend through Carrot.
- • Company-facilitated 401(k).
- • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Skills & Technologies
About BaseTen Inc.
BaseTen provides a serverless, GPU-accelerated platform that lets machine-learning teams deploy, scale and monitor custom models behind autoscaling inference endpoints. The service abstracts infrastructure management, supports PyTorch, TensorFlow and Hugging Face artifacts, and offers built-in observability, A/B testing and fine-tuning. Customers integrate via REST or GraphQL APIs and pay only for compute used. Founded in 2019 and headquartered in San Francisco, BaseTen targets data scientists and product teams seeking production-grade ML serving without Kubernetes complexity.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities

Pragmatike Soluciones TecnolĂłgicas S.L.
1 month ago
1 month ago

