Software Engineer, Inference Platform

FluidStack Inc.

Job Overview

Location

San Francisco, CA

Job Type

Full-time

Full Job Description

📋 Description

• Fluidstack is at the forefront of building the infrastructure for abundant intelligence, partnering with leading AI labs, governments, and enterprises to accelerate the realization of Artificial General Intelligence (AGI). We are driven by a mission to deliver world-class infrastructure with urgency, treating our customers' outcomes as our own and earning trust through the systems we build. If you are purpose-driven, obsessed with excellence, and ready to dedicate significant effort to advancing the future of intelligence, join us in shaping what comes next.
• The Inference Platform team at Fluidstack is tasked with addressing the critical cost and latency bottlenecks in frontier AI, which are now defined by inference. This team owns the crucial serving layer that bridges our extensive global accelerator supply with the production workloads of our customers. This includes managing LLM serving frameworks, developing KV cache infrastructure, implementing disaggregated prefill/decode pipelines, and orchestrating these systems across vast multi-datacenter footprints using Kubernetes.
• This role is a hands-on individual contributor position situated at the dynamic intersection of distributed systems, model optimization, and serving infrastructure. You will be instrumental in owning end-to-end inference deployments for cutting-edge AI labs and Fluidstack's inference product. Your work will directly drive significant improvements in key performance metrics such as throughput, cost-per-token, and time-to-first-token (TTFT). Furthermore, you will play a vital role in shaping the platform architecture, influencing how Fluidstack deploys and scales its services across tens of thousands of accelerators.
• Key responsibilities include taking full ownership of inference deployments from initial setup and performance tuning through to maintaining production Service Level Agreements (SLAs) and responding to incidents. You will be responsible for achieving measurable gains in throughput, TTFT, and cost-per-token across a wide spectrum of model families, including dense transformers, mixture-of-experts (MoE), and multi-modal models, as well as diverse customer workload patterns.
• You will develop and operate sophisticated KV cache and scheduling infrastructure designed to maximize resource utilization across concurrent inference requests. This involves implementing and rigorously validating disaggregated prefill/decode pipelines and the Kubernetes orchestration necessary to support these complex systems at scale.
• A significant part of the role involves profiling and resolving performance bottlenecks at the compute, memory, and communication layers. This includes instrumenting deployments to ensure comprehensive end-to-end observability, allowing for proactive identification and resolution of issues.
• You will collaborate closely with customers to translate their specific model architectures, access patterns, and stringent latency requirements into effective deployment configurations. Your insights will also feed back into upstream platform improvements, ensuring our infrastructure evolves to meet user needs.
• You will contribute directly to the inference platform's architecture and roadmap, with a strategic focus on simplifying deployment processes, enhancing hardware utilization, and expanding support for novel model classes and accelerator types.
• Participation in an on-call rotation, typically up to one week per month, is expected to ensure the continuous reliability and adherence to SLA commitments for all production deployments.
• This role offers a unique opportunity to work with state-of-the-art AI technologies and contribute to the foundational infrastructure powering the next generation of intelligent systems. You will be part of a highly motivated and committed team focused on delivering exceptional results and pushing the boundaries of what's possible in AI infrastructure.

Skills & Technologies

Python

Node.js

Kubernetes

PyTorch

Onsite

$165k-500k

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

FluidStack Inc.

Visit Website

About FluidStack Inc.

FluidStack Inc. operates a distributed cloud platform that aggregates under-utilized GPUs in data centers and individual machines worldwide, renting them on-demand to AI researchers, startups, and enterprises for training and inference workloads. The company automates deployment, security, and billing, offering prices up to 80% below traditional hyperscalers while providing instant access to high-end NVIDIA A100, H100, and consumer GPUs through a single API and web console. Headquartered in London, FluidStack targets machine-learning engineers who need scalable, low-cost compute without long-term commitments, claiming thousands of active nodes and customers including Fortune 500 enterprises and leading research labs.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.