This job has expired

This position was posted on May 19, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Senior Manager, Cloud Platform & Site Reliability

BaseTen Inc.

Job Overview

Location

San Francisco

Job Type

Full-time

Full Job Description

📋 Description

• Lead and develop multiple team leads across Cloud Platform and Site Reliability Engineering (SRE) functions, fostering a culture of ownership, technical excellence, and continuous improvement.
• Set the technical roadmap for infrastructure, reliability, and platform engineering at the organizational level, balancing short-term operational demands with long-term strategic investments in multi-cloud capacity, GPU inference, and observability.
• Own the end-to-end reliability posture of Baseten’s ML platform, defining and enforcing org-wide standards for SLOs/SLIs, incident response protocols, observability-as-code, runbooks, and post-incident reviews.
• Drive cross-functional alignment between engineering, product, and customer-facing teams to ensure infrastructure capabilities meet product goals and enterprise customer SLA requirements.
• Oversee incident management and escalation processes for high-severity production issues, ensuring rapid resolution, clear communication, and systemic follow-through to prevent recurrence.
• Translate recurring operational pain points and enterprise customer feedback into actionable roadmap priorities, infrastructure improvements, and runbook enhancements across both Cloud Platform and SRE teams.
• Ensure consistent adoption and maintenance of best practices in CI/CD, infrastructure-as-code (Terraform, Pulumi), GitOps workflows (Flux CD, ArgoCD, Helm), Kubernetes, and cloud resource management.
• Partner with forward-deployed and customer success teams to support enterprise accounts with strict SLAs and complex infrastructure needs, providing technical guidance and escalation support.
• Make principled architectural and organizational tradeoffs to avoid unnecessary complexity while enabling teams to move fast and scale reliably in a high-growth environment.
• Maintain technical credibility by engaging meaningfully in architectural decisions involving Kubernetes, multi-cloud infrastructure (EKS, GKE), distributed systems, and GPU inference platforms.
• Demonstrate accountability and high standards in all aspects of infrastructure ownership, expecting the same rigor from team leads and their engineering teams.
• Represent technical work clearly to both technical and non-technical audiences, including executives, with strong communication and executive presence.
• Stay open to learning about ML infrastructure and model serving, even without prior ML experience, to effectively support the company’s mission of enabling AI product deployment.

🎯 Requirements

• Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field
• Proven experience managing managers and leading multiple high-performing infrastructure, platform, or SRE teams in a fast-paced, high-growth environment
• Deep technical expertise in Kubernetes (multi-cloud across EKS, GKE, or similar), cloud infrastructure, and distributed systems, with the ability to engage credibly in architectural and operational decisions
• Hands-on background with infrastructure-as-code (e.g., Terraform, Pulumi) and CI/CD tooling (e.g., GitHub Actions, GitLab CI, Jenkins); familiarity with GitOps workflows (e.g., Flux CD, ArgoCD, Helm)
• Strong foundation in observability tooling — metrics (Prometheus, VictoriaMetrics), logging (Loki, ELK), dashboards (Grafana), tracing (OpenTelemetry) — and a track record of raising reliability standards through SLOs, SLIs, and observability-as-code
• Experience owning incident management and enterprise SLAs at scale, including executive-level communication during high-severity incidents and rigorous post-incident follow-through

🏖️ Benefits

• Competitive compensation, including meaningful equity
• 100% coverage of medical, dental, and vision insurance for employee and dependents
• Flexible PTO policy including company wide Winter Break (offices closed from Christmas Eve to New Year's Day)
• Paid parental leave
• Fertility and family-building stipend through Carrot
• Company-facilitated 401(k)

Skills & Technologies

Node.js

Kubernetes

Terraform

Jenkins

GitLab

Senior

Onsite

Degree Required

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

BaseTen Inc.

Visit Website

About BaseTen Inc.

BaseTen provides a serverless, GPU-accelerated platform that lets machine-learning teams deploy, scale and monitor custom models behind autoscaling inference endpoints. The service abstracts infrastructure management, supports PyTorch, TensorFlow and Hugging Face artifacts, and offers built-in observability, A/B testing and fine-tuning. Customers integrate via REST or GraphQL APIs and pay only for compute used. Founded in 2019 and headquartered in San Francisco, BaseTen targets data scientists and product teams seeking production-grade ML serving without Kubernetes complexity.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.