Vapi Technologies Inc. logo

Member of Technical Staff, Site Reliablity Engineer

Job Overview

Location

San Francisco

Job Type

Full-time

Category

Data Science

Date Posted

June 4, 2026

Full Job Description

đź“‹ Description

  • • Drive 99.99% call completion reliability for a real-time voice AI platform handling over 1 billion calls, where any p99 spike directly results in dropped caller experiences.
  • • Own and operate SLOs and error budgets for critical call-completion pathways using Chronosphere, Prometheus, Grafana, and Datadog, translating incident patterns into prioritized reliability backlogs.
  • • Lead incident command during outages and own the end-to-end postmortem process, ensuring actionable learnings are institutionalized to prevent recurrence.
  • • Design, build, and ship production-grade platform services in Go or TypeScript — including auto-remediation systems, capacity forecasters, and oncall tooling — to automate reliability at scale.
  • • Perform capacity planning and load testing against provider rate limits and per-org concurrency constraints, identifying bottlenecks before they impact live traffic.
  • • Tune and optimize Kubernetes autoscaling configurations using KEDA and custom metrics for Vapi’s wscaler and workerpool-cron-scaler systems to maintain stability under variable load.
  • • Diagnose and resolve production Kubernetes issues including pod crashes, HPA/VPA misconfigurations, PodDisruptionBudget violations, and graceful shutdown failures.
  • • Implement backpressure patterns and autoscaling logic to prevent cascading failures during traffic surges in a latency-sensitive environment where degraded performance means lost calls.
  • • Audit and improve observability pipelines using OpenTelemetry, ensuring metrics, logs, and traces accurately reflect system health and caller experience.
  • • Collaborate with engineering teams to embed reliability practices into the SDLC, including defining service-level objectives, error budget burn rates, and alerting thresholds.
  • • Maintain and enhance core Vapi services including cluster-manager, database-health, wscaler, and incidentManager through direct code contributions.
  • • Conduct regular load tests to validate system resilience under peak conditions, particularly around third-party API rate limits and per-customer concurrency caps.
  • • Build a culture of ownership and reliability across engineering by mentoring teams on incident response, SLO adherence, and proactive capacity planning.
  • • Work in a high-stakes environment where system degradation directly impacts customer experience — not just dashboard metrics — requiring deep empathy for real-time user outcomes.
  • • Operate on EKS with production-grade Kubernetes practices, ensuring high availability, fault tolerance, and efficient resource utilization across distributed voice processing systems.

🎯 Requirements

  • • You’ve run incident command and postmortem discipline at scale on a real oncall rotation.
  • • You’ve operated SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog.
  • • You’ve done capacity planning and load testing for production systems with real users.
  • • You’re fluent in Kubernetes production ops: pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown.
  • • You know backpressure and autoscaling patterns — KEDA, custom metrics scaling.

🏖️ Benefits

  • • Real stake: We offer a competitive salary and excellent equity ownership.
  • • Comprehensive health coverage: medical, dental, and vision plans.
  • • Team love: We love hanging out, and we do quarterly off-sites.
  • • Flexible time off: take what you need.
  • • Catered meals, transportation, gym, and a $10k annual L&D budget.

Skills & Technologies

TypeScript
Go
Kubernetes
Prometheus
Grafana
Senior
Onsite

Ready to Apply?

You will be redirected to an external site to apply.

Vapi Technologies Inc. logo
Vapi Technologies Inc.
Visit Website

About Vapi Technologies Inc.

Vapi empowers developers to build and deploy advanced voice AI agents through a highly configurable, API-first platform. Serving a wide range of clients from startups to Fortune 500 companies, Vapi simplifies the creation of leading voice AI products and scales phone operations efficiently. The platform supports a global user base, evidenced by its multilingual capabilities in over 100 languages. With impressive traction, Vapi has powered over 300 million calls and launched more than 2.5 million assistants, highlighting its significant impact and reliability in the voice AI market.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

Expired
United States Virtual
Full-time
Expired May 30, 2026
Remote

2 months ago

Apply
NYC or Remote
Full-time
Expires Jul 22, 2026
REST
Remote

16 days ago

Apply
Expired
Remote Nationwide
Full-time
Expired Apr 29, 2026
Senior
Remote
$115k-158k
+1 more

3 months ago

Apply
Limble CMMS, Inc. logo

Limble CMMS, Inc.

Remote
Full-time
Expires Jun 18, 2026
Python
Remote
$145k-170k

2 months ago

Apply