This job has expired

This position was posted on March 10, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

Deepgram Inc.

Job Overview

Location

USA | Remote

Job Type

Full-time

Full Job Description

📋 Description

• Deepgram is at the forefront of the rapidly expanding Voice AI economy, a market projected to reach trillions of dollars. We provide best-in-class real-time APIs for speech-to-text (STT) and text-to-speech (TTS), enabling developers and organizations to build sophisticated, production-grade voice agents at an unprecedented scale. Our platform is the engine behind numerous innovative voice offerings, trusted by over 1,300 organizations and more than 200,000 developers worldwide. Companies like Twilio, Cloudflare, Sierra, Decagon, Vapi, Daily, Cresta, Granola, and Jack in the Box rely on Deepgram to power their voice solutions. We offer unmatched accuracy, minimal latency, and exceptional cost efficiency through cloud APIs, as well as self-hosted and on-premises software deployments. With a recent Series C funding round led by top-tier global investors and strategic partners, Deepgram has processed an astounding 50,000+ years of audio and transcribed over 1 trillion words, solidifying our position as the world's foremost authority on voice technology.
• At Deepgram, we foster an AI-first culture where the use and exploration of AI are not just encouraged but are fundamental to our operations, innovation, and performance metrics. Every team member is expected to actively engage with and experiment on advanced AI tools, integrating them into their daily workflows and even developing custom AI solutions. Success is measured by the effective application of AI to achieve tangible results, and a consistent, creative approach to leveraging the latest AI capabilities is paramount. Candidates must be comfortable rapidly adopting new models and methodologies, seamlessly integrating AI into their work, and continuously pushing the boundaries of what these technologies can accomplish.
• We operate at the accelerated pace of AI development. The landscape is constantly evolving, and your day-to-day responsibilities will adapt just as quickly. This role is ideal for individuals who are energized by experimentation, adaptation, critical thinking, and continuous learning. If you are seeking a highly prescriptive role with a traditional structure, this might not be the best fit.
• We are seeking a highly experienced Site Reliability Engineer (SRE) to architect, build, and operate the foundational hybrid infrastructure that supports our cutting-edge AI/ML research and product development initiatives. You will be instrumental in designing and managing a robust platform that spans both AWS cloud environments and our on-premise bare metal data centers. Your primary objective will be to create a highly scalable, reliable, and self-service environment that empowers our AI researchers and ML engineers to train and deploy complex models efficiently. This will involve extensive use of Kubernetes for orchestration, AWS for cloud services, and Infrastructure-as-Code (IaC) principles with Terraform for automation and reproducibility. A key aspect of this role will be orchestrating high-demand GPU workloads using advanced schedulers like Slurm.
• Key Responsibilities:
• Architect, deploy, and maintain our core computing platform, leveraging Kubernetes on both AWS and on-premise bare metal infrastructure to ensure a stable, scalable, and highly available environment for all critical applications and services.
• Develop, implement, and manage our entire infrastructure using Infrastructure-as-Code (IaC) best practices with Terraform, guaranteeing that all environments are fully reproducible, meticulously versioned, and extensively automated.
• Design, build, and continuously optimize our AI/ML job scheduling and orchestration systems, focusing on seamless integration of Slurm with our Kubernetes clusters to ensure the most efficient utilization of valuable GPU resources.
• Take ownership of the provisioning, management, and ongoing maintenance of our on-premise bare metal server infrastructure, specifically tailored for high-performance GPU computing tasks.
• Implement and manage the platform's critical networking components (e.g., CNI, service mesh) and storage solutions (e.g., CSI, S3) to support demanding high-throughput, low-latency workloads across our hybrid cloud and on-premise environments.
• Develop and maintain a comprehensive observability stack, encompassing monitoring, logging, and tracing, to proactively ensure platform health, identify potential issues, and create robust automation for operational tasks, incident response, and performance tuning.
• Foster close collaboration with our AI researchers and ML engineers, deeply understanding their unique infrastructure requirements to build and deliver the essential tools and streamlined workflows that significantly accelerate their development cycles.
• Automate the complete lifecycle management of single-tenant, managed deployments, ensuring efficiency and reliability.
• You will thrive in this role if you:
• Possess a genuine passion for building sophisticated platforms that significantly empower developers and researchers.
• Excel at creating elegant, highly automated solutions for complex infrastructure challenges that span both cloud and traditional data center environments.
• Are driven by optimizing hybrid infrastructure for peak performance, cost-effectiveness, and unwavering reliability.
• Are excited by the prospect of working at the dynamic intersection of modern platform engineering and groundbreaking AI technologies.
• Embrace the philosophy of treating infrastructure as a product, constantly seeking opportunities for improvement and enhancing the developer experience.

Skills & Technologies

Python

AWS

Kubernetes

Terraform

Jenkins

DevOps

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Deepgram Inc.

Visit Website

About Deepgram Inc.

Deepgram builds end-to-end speech AI infrastructure that converts live or recorded audio into text and insights. The company trains large-scale neural networks on GPU clusters to deliver low-latency transcription, keyword detection, and speaker diarization through a single API. Developers use the platform for call centers, meetings, podcasts, and voice bots, paying per minute or hosting the engine on-premise. Founded in 2015 and headquartered in San Francisco, Deepgram serves enterprises seeking accurate, private, and customizable speech recognition without vendor lock-in.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.