
Job Overview
Location
San Francisco
Job Type
Full-time
Category
DevOps & SysAdmin
Date Posted
June 19, 2026
Full Job Description
đź“‹ Description
- • Senior Site Reliability Engineer responsible for designing, implementing, and optimizing the scalable, secure, and robust infrastructure that powers Anyscale’s cloud platform for distributed AI applications.
- • Day-to-day tasks include building and scaling services that orchestrate Ray clusters across cloud and on-prem environments, optimizing control plane components for large-scale AI/ML workloads, developing intelligent scheduling and resource management systems, enhancing reliability and observability of Ray workloads, supporting accelerator integration (GPUs/TPUs), managing container images and dependencies, participating in code reviews and architecture discussions, providing on-call support, and collaborating with distributed systems and ML experts.
- • Anyscale is a well-funded startup commercializing Ray, an open-source framework for scalable machine learning, backed by Andreessen Horowitz, NEA, and Addition with over $250M raised, enabling developers to scale ML applications from laptop to cluster without distributed systems expertise.
- • The role offers the opportunity to work on open-source Ray, contribute to Anyscale’s proprietary platform, integrate control and data planes, and deliver high-impact infrastructure features used by industry leaders like OpenAI, Uber, Spotify, and Cruise, while advancing expertise in cloud-native systems and AI infrastructure.
🎯 Requirements
- • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
- • 3+ years of experience writing high-quality production code
- • Hands-on experience building and maintaining highly available, scalable, and performant distributed systems
- • Expertise in cloud-native technologies (AWS, Azure, GCP) and Kubernetes-based deployments
- • Proficiency in Go and Python
- • Familiarity with observability stacks (Prometheus, Grafana etc)
🏖️ Benefits
- • Opportunity to work on open-source Ray and contribute to a widely adopted AI infrastructure platform
- • Work with leading experts in distributed systems and machine learning from companies like OpenAI, Uber, and Spotify
- • Impactful role in shaping the next generation of cloud infrastructure for scalable AI applications
- • Collaborative, innovative environment at a well-funded startup with strong investor backing
- • Exposure to cutting-edge technologies including GPU/TPU acceleration, container orchestration, and observability stacks
Skills & Technologies
See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.
About Anyscale Inc.
Anyscale Inc. builds the Ray open-source distributed computing framework and offers a managed platform that lets data scientists and engineers scale machine-learning workloads from laptop to cloud without rewriting code. The company provides serverless infrastructure, observability, and cluster automation so teams can train, tune, and serve models faster. Founded in 2019 by the creators of Ray at UC Berkeley, Anyscale serves Fortune 500 enterprises and AI startups, enabling them to reduce cost and complexity while accelerating production deployment of large-scale AI applications.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities

Workato, Inc.
3 months ago
3 months ago

Pragmatike Soluciones TecnolĂłgicas S.L.
3 months ago

Latamcent
3 months ago