Machine Learning Engineer, Inference & Serving (Speech LLM) - San Francisco

PLAUD AI INC.

Job Overview

Location

San Francisco, CA

Job Type

Full-time

Full Job Description

📋 Description

• Machine Learning Engineer, Inference & Serving (Speech LLM) role focused on building and deploying high-throughput, ultra-low-latency inference engines for large language and speech models to power Plaud’s AI work companion used by over 1.5M users globally.
• Day-to-day responsibilities include optimizing latency, throughput, and Time-To-First-Token in real-time streaming environments, implementing continuous batching and KV cache management (e.g., PagedAttention), and working with GPU architectures (NVIDIA Ampere/Hopper) to eliminate hardware bottlenecks.
• Plaud is a bootstrapped, profitable, San Francisco-based AI company with $250M revenue run rate, SOC 2, HIPAA, GDPR, ISO27001 compliant, building trusted AI work companions through hardware-software integration to capture and utilize human intelligence from speech, audio, and thought.
• The role offers the opportunity to join the founding SpeechLLM lab, work at the intersection of ML training and backend infrastructure, gain exposure to cutting-edge AI serving techniques, and grow in a culture of continuous learning, innovation, and fast career development with global impact.

🎯 Requirements

• Hands-on experience building and deploying high-throughput, ultra-low-latency inference engines for large language models or foundational speech models
• Understanding of tradeoffs between latency, throughput, and Time-To-First-Token in real-time streaming environments
• Practical experience with continuous batching, KV cache management (e.g., PagedAttention), and stateful connections for real-time conversational AI
• Deep understanding of GPU architectures (NVIDIA Ampere/Hopper) and memory hierarchy to identify and eliminate hardware bottlenecks
• Ability to communicate clearly and collaborate effectively between ML training and backend infrastructure teams
• Experience with frontier serving frameworks like vLLM, TensorRT-LLM, SGLang, or NVIDIA Triton Inference Server (nice-to-have)

🏖️ Benefits

• Competitive compensation: $180K–$270K base salary + performance bonus + equity
• Comprehensive benefits: top-tier healthcare (medical, dental, vision) with employer subsidy
• Retirement planning: 401(k) plan with company matching
• Paid time off: unlimited PTO plus 13 paid holidays
• New parent leave: 12 weeks of paid time off regardless of gender
• Hybrid office: minimum 3x in-office per week; gear perks include choice of top-of-the-line laptops/workstations

Skills & Technologies

Node.js

Kubernetes

Hybrid

$180k-270k

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

PLAUD AI INC.

Visit Website

About PLAUD AI INC.

PLAUD AI INC. builds AI-powered voice and note-taking hardware. Its flagship Plaud Note records phone calls and meetings, transcribes them in real time, and generates summaries using GPT-4o. The credit-card-sized device attaches to iPhone or Android, stores encrypted audio locally or in the cloud, and integrates with Notion, Slack, and Google Docs. Founded in 2023 and based in San Francisco, the company sells direct to consumers and enterprises through plaud.ai, offering subscription plans for advanced AI features and multi-language support.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.