BaseTen Inc. logo

SRE

Job Overview

Location

San Francisco

Job Type

Full-time

Category

DevOps & SysAdmin

Date Posted

May 12, 2026

Full Job Description

đź“‹ Description

  • • As a Site Reliability Engineer at Baseten, you'll define and codify the gold standards of day 2 operations for our ML infrastructure platform, ensuring reliability at scale for mission-critical AI inference systems used by leading companies like Notion, Cursor, and OpenEvidence.
  • • You'll own the reliability of Baseten's multi-cloud Kubernetes infrastructure, build and maintain observability as code, author and improve runbooks, diagnose runtime issues related to latency and GPU utilization, and convert failure patterns into automated mitigations.
  • • You'll work closely with engineering, forward-deployed, and product teams to turn tribal knowledge into automated systems, raise the operational floor, and empower the organization to operate confidently at the frontier of AI infrastructure.
  • • In this role, you'll deepen your expertise in SRE practices, observability-as-code, infrastructure automation, and ML infrastructure challenges — gaining exposure to cutting-edge AI startups while shaping the reliability foundation of a rapidly growing platform.

🎯 Requirements

  • • Extensive hands-on experience with Kubernetes (multi-cloud experience across EKS, GKE, or similar is a strong plus).
  • • Experience in building and maintaining scalable infrastructure.
  • • Strong foundation in observability tooling: metrics (VictoriaMetrics, Prometheus), logging (Loki, ELK), dashboards (Grafana), and alerting pipelines. Observability-as-code experience is a plus.
  • • Experience with infrastructure-as-code (Terraform, Helm) and GitOps workflows (Flux CD, ArgoCD).
  • • Experience writing and improving runbooks, leading incident response, and doing post-mortem analysis.
  • • Comfort working at the intersection of engineering and operations — you write code, but you also think deeply about process, escalation paths, and operational leverage.

🏖️ Benefits

  • • Competitive compensation, including meaningful equity.
  • • 100% coverage of medical, dental, and vision insurance for employee and dependents.
  • • Flexible PTO policy including company wide Winter Break (offices closed from Christmas Eve to New Year's Day!).
  • • Paid parental leave.
  • • Fertility and family-building stipend through Carrot.
  • • Company-facilitated 401(k).
  • • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Skills & Technologies

Kubernetes
Terraform
Apache Spark
Prometheus
Grafana
DevOps
Senior
Onsite

Ready to Apply?

You will be redirected to an external site to apply.

BaseTen Inc. logo
BaseTen Inc.
Visit Website

About BaseTen Inc.

BaseTen provides a serverless, GPU-accelerated platform that lets machine-learning teams deploy, scale and monitor custom models behind autoscaling inference endpoints. The service abstracts infrastructure management, supports PyTorch, TensorFlow and Hugging Face artifacts, and offers built-in observability, A/B testing and fine-tuning. Customers integrate via REST or GraphQL APIs and pay only for compute used. Founded in 2019 and headquartered in San Francisco, BaseTen targets data scientists and product teams seeking production-grade ML serving without Kubernetes complexity.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

Yerevan, Armenia
Full-time
Expires Jun 4, 2026
Python
Java
Go
+6 more

1 month ago

Apply
Pragmatike Soluciones TecnolĂłgicas S.L. logo

Pragmatike Soluciones TecnolĂłgicas S.L.

Armenia
Full-time
Expires Jun 6, 2026
JavaScript
TypeScript
Rust
+4 more

1 month ago

Apply
Yerevan, Armenia
Full-time
Expires Jun 4, 2026
Python
Java
Go
+5 more

1 month ago

Apply
Argentina
Full-time
Expires May 31, 2026
Azure
Remote
$40k-45k

1 month ago

Apply