
Job Overview
Location
San Francisco
Job Type
Full-time
Category
DevOps
Date Posted
January 8, 2026
Full Job Description
đź“‹ Description
- • Own the entire lifecycle of our AI-enabled developer platform’s infrastructure—from design and provisioning to day-two operations and disaster recovery—ensuring 99.9 %+ uptime for thousands of daily code reviews.
- • Architect and maintain resilient, multi-region Kubernetes clusters on AWS and GCP using Infrastructure-as-Code (Terraform/Pulumi) with automated drift detection and policy guardrails; every pull request triggers a preview environment so the team can test changes in minutes, not hours.
- • Build and continuously improve CI/CD pipelines (GitHub Actions → Argo CD) that deploy micro-services, ML models, and GPU workloads in under five minutes while enforcing security scans, dependency checks, and performance regression tests.
- • Instrument end-to-end observability: configure Prometheus, Grafana, Loki, and Datadog dashboards that surface golden signals (latency, traffic, errors, saturation) for both traditional services and GPU-accelerated inference pods, cutting mean-time-to-detect (MTTD) to <2 minutes.
- • Design cost-aware autoscaling policies (Karpenter, HPA, VPA) that balance GPU availability for large-language-model inference against cloud spend; deliver weekly cost reports and right-sizing recommendations to leadership.
- • Harden security at every layer: enforce OIDC-based auth, mTLS between services, secrets rotation via Vault, container image signing, and CIS-benchmarked node hardening; run quarterly chaos-engineering drills to validate blast-radius containment.
- • Partner with applied-AI engineers to optimize model-serving infrastructure (Triton, vLLM, TensorRT) for low-latency code-review feedback, including canary releases and A/B traffic splitting to measure model accuracy vs. performance.
- • Create self-service tooling that lets backend and ML engineers spin up ephemeral dev environments, run integration tests, and ship features without ever opening a ticket; document everything in runbooks and code so tribal knowledge disappears.
- • Establish SLOs/SLIs with error budgets and blameless post-mortems; lead incident response, root-cause analysis, and long-term corrective actions that prevent recurrence.
- • Contribute to open-source DevOps projects and internal platform libraries, turning one-off scripts into reusable modules the broader community can adopt.
- • Mentor junior engineers through pair-programming and design reviews, fostering a culture where infrastructure is treated as a product and reliability is everyone’s job.
- • Stay ahead of the curve: evaluate new CNCF projects, GPU orchestrators, and security frameworks, then run proof-of-concepts that keep CodeRabbit on the bleeding edge of developer productivity.
🎯 Requirements
- • 3–5 years of hands-on DevOps, SRE, or platform engineering experience in a high-growth startup or scale-up environment.
- • Expert-level proficiency with Kubernetes, Docker, and cloud-native CI/CD stacks (GitHub Actions, Argo CD, or similar).
- • Deep expertise in at least one major cloud provider (AWS or GCP), including networking, IAM, and cost optimization; Terraform or Pulumi fluency is mandatory.
- • Proven track record designing observability solutions using Prometheus, Grafana, ELK/Opensearch, or Datadog in large-scale distributed systems.
- • Solid grasp of cloud security best practices: secrets management, container hardening, network policies, and compliance frameworks (SOC 2, ISO 27001).
🏖️ Benefits
- • Competitive base salary + meaningful equity in a fast-growing AI startup redefining software development.
- • Hybrid work culture: collaborate in person in San Francisco 2–3 days per week, with flexibility for remote deep-work days and top-tier home-office stipend.
- • Annual learning & development budget ($3,000+) plus paid attendance at leading DevOps/KubeCon conferences and certification programs.
- • Premium health, dental, and vision coverage for you and dependents, plus monthly wellness stipend and mental-health support.
Skills & Technologies
AWS
GCP
Docker
Kubernetes
Terraform
DevOps
Remote
About CodeRabbit, Inc.
CodeRabbit provides an AI-powered code review platform that integrates with GitHub and GitLab. It automatically analyzes pull requests, identifies bugs, enforces style rules, and suggests improvements in real time. The service supports multiple languages and frameworks, offers customizable policies, and maintains a privacy-focused architecture to keep proprietary code secure.


