This job has expired

This position was posted on March 12, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Research Staff, Voice AI Foundations

Deepgram Inc.

Job Overview

Location

USA | Remote

Job Type

Full-time

Full Job Description

📋 Description

• Deepgram is at the forefront of the burgeoning trillion-dollar Voice AI economy, providing essential real-time APIs for speech-to-text (STT), text-to-speech (TTS), and the development of production-grade voice agents at an unprecedented scale.
• Join a team of over 200,000 developers and 1,300+ organizations who rely on Deepgram to power their voice offerings, including industry leaders like Twilio, Cloudflare, Sierra, Decagon, Vapi, Daily, Cresta, Granola, and Jack in the Box.
• Our voice-native foundation models are accessible via cloud APIs or as self-hosted and on-premises software, delivering unparalleled accuracy, minimal latency, and exceptional cost efficiency.
• With a recent Series C funding round led by top global investors and strategic partners, Deepgram has processed over 50,000 years of audio and transcribed more than 1 trillion words, solidifying our position as the world's foremost authority on voice technology.
• At Deepgram, we embrace an AI-first operating rhythm, where the active use and experimentation with advanced AI tools are not just encouraged but are fundamental to how we innovate, operate, and measure performance.
• Every team member is expected to leverage and experiment with cutting-edge AI, integrating it into daily workflows and pushing the boundaries of what's possible.
• Success is measured by the effective application of AI to achieve results, making consistent and creative use of the latest AI capabilities paramount.
• Candidates must be comfortable rapidly adopting new models and modes, seamlessly integrating AI into their work, and continuously exploring new technological frontiers.
• We operate at the rapid pace of AI evolution, meaning day-to-day responsibilities can change quickly. This role is ideal for individuals who thrive on experimentation, adaptation, quick thinking, and continuous learning, rather than those seeking a prescriptive, traditional 9-to-5 structure.
• The opportunity lies in addressing the fundamental challenges of voice AI: current sequence modeling paradigms struggle to deliver voice AI capable of universal human interaction due to data scarcity and extreme diversity in audio.
• Real-world audio data is vast and varied, encompassing a wide spectrum of voices, speaking styles, and acoustic conditions, making it computationally and storage-intensive to train and deploy at a global scale.
• We believe that novel paradigms for audio AI are essential to overcome these data, scale, and cost hurdles, making voice interaction universally accessible.
• As a Member of the Research Staff, you will be instrumental in pioneering Latent Space Models (LSMs), a groundbreaking approach designed to tackle the core data, scale, and cost challenges inherent in building robust, contextualized voice AI.
• Your research will be pivotal in solving critical problems such as building next-generation neural audio codecs for extreme, low bit-rate compression and high-fidelity reconstruction across a diverse, world-scale audio corpus.
• You will pioneer steerable generative models capable of synthesizing the full spectrum of human speech from codec latent representations, encompassing casual conversation, intense emotional expression, and complex multi-speaker scenarios with background noise and overlapping speech.
• Develop advanced embedding systems that effectively factorize the codec latent space into interpretable dimensions, including speaker, content, style, environment, and channel effects, enabling precise control and massive amplification of existing datasets through “latent recombination”.
• Leverage latent recombination to generate synthetic audio data at unprecedented scales, thereby unlocking new paradigms for joint model and data scaling in audio.
• Endeavor to train multimodal speech-to-speech systems that can understand any human, regardless of demographics, state, or environment, and produce empathic, human-like responses to achieve conversational or task-oriented objectives.
• Design innovative model architectures, training schemes, and inference algorithms optimized for bare-metal hardware, facilitating cost-efficient training on billion-hour datasets and enabling real-time inference for millions of concurrent conversations.
• The challenge involves identifying and tackling 'unsolved' problems by pioneering entirely new approaches, demonstrating the vision to scale successful proofs-of-concept exponentially.
• Researchers are expected to identify the single critical experiment that validates or invalidates an idea rapidly, within days rather than months.
• A core aspect of this role is the obsession with using AI to automate and amplify personal impact, driving efficiency and innovation.
• This position demands an intense focus on the problems, creative problem-solving, and a relentless pursuit of elegant, scalable solutions.
• The technical challenges are substantial, but the potential for transformative impact on the field of voice AI is immense.
• Successful candidates will possess a strong mathematical foundation in statistical learning theory, with a particular emphasis on self-supervised and multimodal learning.
• Deep expertise in foundation model architectures and the ability to scale training across multiple modalities are crucial.
• A proven track record of bridging theoretical concepts with practical implementation, including deriving novel mathematical formulations and implementing them efficiently, is essential.
• Demonstrated ability to build robust data pipelines capable of processing and curating massive datasets while maintaining quality and diversity is required.
• Candidates must have a history of designing controlled experiments to isolate the impact of architectural innovations and validate theoretical insights.
• Experience in optimizing models for real-world deployment, including an understanding of hardware constraints and efficiency techniques, is highly valued.
• A history of open-source contributions or impactful research publications that have advanced the state of the art in speech/language AI is a significant advantage.

Skills & Technologies

Senior

Remote

Degree Required

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Deepgram Inc.

Visit Website

About Deepgram Inc.

Deepgram builds end-to-end speech AI infrastructure that converts live or recorded audio into text and insights. The company trains large-scale neural networks on GPU clusters to deliver low-latency transcription, keyword detection, and speaker diarization through a single API. Developers use the platform for call centers, meetings, podcasts, and voice bots, paying per minute or hosting the engine on-premise. Founded in 2015 and headquartered in San Francisco, Deepgram serves enterprises seeking accurate, private, and customizable speech recognition without vendor lock-in.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.