
Job Overview
Location
San Francisco
Job Type
Full-time
Category
Data Science
Date Posted
June 4, 2026
Full Job Description
đź“‹ Description
- • Drive 99.99% call completion reliability for a real-time voice AI platform handling over 1 billion calls, where any p99 spike directly results in dropped caller experiences.
- • Own and operate SLOs and error budgets for critical call-completion pathways using Chronosphere, Prometheus, Grafana, and Datadog, translating incident patterns into prioritized reliability backlogs.
- • Lead incident command during outages and own the end-to-end postmortem process, ensuring actionable learnings are institutionalized to prevent recurrence.
- • Design, build, and ship production-grade platform services in Go or TypeScript — including auto-remediation systems, capacity forecasters, and oncall tooling — to automate reliability at scale.
- • Perform capacity planning and load testing against provider rate limits and per-org concurrency constraints, identifying bottlenecks before they impact live traffic.
- • Tune and optimize Kubernetes autoscaling configurations using KEDA and custom metrics for Vapi’s wscaler and workerpool-cron-scaler systems to maintain stability under variable load.
- • Diagnose and resolve production Kubernetes issues including pod crashes, HPA/VPA misconfigurations, PodDisruptionBudget violations, and graceful shutdown failures.
- • Implement backpressure patterns and autoscaling logic to prevent cascading failures during traffic surges in a latency-sensitive environment where degraded performance means lost calls.
- • Audit and improve observability pipelines using OpenTelemetry, ensuring metrics, logs, and traces accurately reflect system health and caller experience.
- • Collaborate with engineering teams to embed reliability practices into the SDLC, including defining service-level objectives, error budget burn rates, and alerting thresholds.
- • Maintain and enhance core Vapi services including cluster-manager, database-health, wscaler, and incidentManager through direct code contributions.
- • Conduct regular load tests to validate system resilience under peak conditions, particularly around third-party API rate limits and per-customer concurrency caps.
- • Build a culture of ownership and reliability across engineering by mentoring teams on incident response, SLO adherence, and proactive capacity planning.
- • Work in a high-stakes environment where system degradation directly impacts customer experience — not just dashboard metrics — requiring deep empathy for real-time user outcomes.
- • Operate on EKS with production-grade Kubernetes practices, ensuring high availability, fault tolerance, and efficient resource utilization across distributed voice processing systems.
🎯 Requirements
- • You’ve run incident command and postmortem discipline at scale on a real oncall rotation.
- • You’ve operated SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog.
- • You’ve done capacity planning and load testing for production systems with real users.
- • You’re fluent in Kubernetes production ops: pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown.
- • You know backpressure and autoscaling patterns — KEDA, custom metrics scaling.
🏖️ Benefits
- • Real stake: We offer a competitive salary and excellent equity ownership.
- • Comprehensive health coverage: medical, dental, and vision plans.
- • Team love: We love hanging out, and we do quarterly off-sites.
- • Flexible time off: take what you need.
- • Catered meals, transportation, gym, and a $10k annual L&D budget.
Skills & Technologies
About Vapi Technologies Inc.
Vapi empowers developers to build and deploy advanced voice AI agents through a highly configurable, API-first platform. Serving a wide range of clients from startups to Fortune 500 companies, Vapi simplifies the creation of leading voice AI products and scales phone operations efficiently. The platform supports a global user base, evidenced by its multilingual capabilities in over 100 languages. With impressive traction, Vapi has powered over 300 million calls and launched more than 2.5 million assistants, highlighting its significant impact and reliability in the voice AI market.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities

Humana Inc.
3 months ago


