Staff Site Reliability Engineer – Automation and Platform

Cerebras Systems Inc.

Job Overview

Location

Remote Office; Sunnyvale, CA; Toronto, Ontario, Canada

Job Type

Full-time

Full Job Description

📋 Description

• Staff Site Reliability Engineer – Automation and Platform at Cerebras Systems Inc., leading the engineering effort to eliminate toil at scale by driving implementation of self-service delivery pipelines, shared observability, and common tooling for the world’s fastest AI inference service powered by the Wafer-Scale Engine (WSE).
• Day-to-day responsibilities include defining and implementing a robust strategy for delivering and running software reliably across multiple datacenters and cloud-based solutions; architecting self-service platforms and internal tooling for product teams, external customers, and cluster operators; defining and evolving reliability practices for inference workloads including SLOs, SLIs, error budgets, blameless postmortems, chaos testing, and capacity forecasting; mentoring mid-level SREs, supporting critical incident escalations, and using production pain points to prioritize high-leverage automation work; measuring and driving impact through metrics such as toil reduction, deployment velocity, SLO compliance, MTTR, and adoption of self-service workflows.
• Cerebras Systems builds the world’s largest AI chip, 56 times larger than GPUs, with a wafer-scale architecture that delivers industry-leading training and inference speeds. The company’s customers include top model labs, global enterprises, and cutting-edge AI-native startups, including a multi-year partnership with OpenAI to deploy 750 megawatts of scale for ultra high-speed inference. Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.
• In this role, you will learn to architect and deliver declarative GitOps-driven continuous delivery for model releases, capacity provisioning, and cluster upgrades; gain deep familiarity with a proprietary cloud control plane operating large-scale heterogeneous clusters; influence cross-functional stakeholders and lead complex projects end to end; mentor early-career SREs as platform engineers; and contribute to shifting reliability from an ops-only burden to a shared engineering discipline that underpins frontier AI inference at scale.

Skills & Technologies

Prometheus

Senior

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Cerebras Systems Inc.

Visit Website

About Cerebras Systems Inc.

Cerebras Systems builds wafer-scale AI processors and systems for accelerating deep-learning workloads. Its flagship WSE-3 chip, the largest silicon device ever produced, delivers petabyte-scale memory bandwidth and hundreds of thousands of cores on a single wafer. The company supplies CS-series appliances and cloud services to national labs, pharmaceutical firms, and hyperscalers, enabling training of trillion-parameter models with reduced latency and energy use compared to GPU clusters. Founded in 2015 and headquartered in Sunnyvale, California, Cerebras also provides software stacks, libraries, and consulting for AI deployment at scale.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.