
Job Overview
Location
Remote
Job Type
Full-time
Category
DevOps
Date Posted
January 30, 2026
Full Job Description
đź“‹ Description
- • Join Scale AI as a Site Reliability Engineer (SRE) and become a cornerstone of our expanding technological infrastructure. As our product portfolio and customer base grow exponentially, we are seeking highly skilled and motivated SREs to manage, maintain, and optimize our critical physical infrastructure. This role is pivotal in ensuring the seamless operation of our on-site technical facilities, including our advanced robot stations, robust server environments, essential computer systems, and intricate network installations.
- • In this dynamic position, you will be at the forefront of operational excellence, responsible for the day-to-day health, performance, and reliability of our physical hardware and network. Your expertise will directly impact our ability to deliver cutting-edge AI solutions to a global market, ensuring that our physical assets are as sophisticated and dependable as our software.
- • You will play a crucial role in ensuring our on-site technical facilities operate smoothly, acting as the primary guardian of our physical technology stack. This involves proactive monitoring, rapid incident response, and meticulous preventative maintenance to minimize downtime and maximize efficiency.
- • A key aspect of your responsibility will be coordinating resource usage across various projects and teams. This includes managing hardware allocation, optimizing power consumption, and ensuring that all physical resources are utilized effectively to support our ambitious development and operational goals.
- • You will provide essential hands-on support to our remote engineering teams, acting as their eyes and ears on the ground. This involves troubleshooting hardware issues, performing installations and upgrades, and facilitating remote access and diagnostics for complex problems.
- • Your responsibilities will extend to designing, implementing, and maintaining scalable and resilient infrastructure solutions. This includes working with server hardware, networking equipment, and specialized robotics hardware, ensuring they meet the demands of our rapidly evolving AI platform.
- • You will be instrumental in developing and enforcing best practices for infrastructure management, security, and disaster recovery. This includes creating and maintaining comprehensive documentation, runbooks, and standard operating procedures to ensure consistency and knowledge sharing across the team.
- • Collaborate closely with software engineering, hardware engineering, and operations teams to understand their infrastructure needs and provide tailored solutions. You will be a bridge between the physical and digital realms, ensuring that our hardware infrastructure perfectly supports our software development and deployment pipelines.
- • Participate in on-call rotations to provide 24/7 support for critical infrastructure issues, demonstrating a commitment to maintaining high availability and rapid problem resolution.
- • Drive continuous improvement initiatives by identifying bottlenecks, inefficiencies, and potential risks within the infrastructure. Propose and implement solutions that enhance reliability, performance, and cost-effectiveness.
- • Gain deep insights into the operational challenges of deploying and scaling AI technologies in real-world environments, contributing to the strategic direction of our physical infrastructure.
- • You will be empowered to make significant contributions to the architecture and design of our physical infrastructure, influencing decisions that will shape the future of Scale AI's operational capabilities.
- • This role offers a unique opportunity to work with state-of-the-art technology in a fast-paced, innovative environment. You will be part of a team that values collaboration, technical excellence, and a passion for solving complex problems.
- • We are looking for individuals who are not only technically proficient but also possess a strong sense of ownership, a proactive mindset, and excellent communication skills. Your ability to work independently and as part of a team will be crucial for success in this role.
- • Embrace the challenge of maintaining and evolving a complex, distributed physical infrastructure that powers the future of artificial intelligence. Your work will have a direct and tangible impact on our ability to innovate and lead in the AI industry.
- • Contribute to the development and implementation of automation strategies for infrastructure provisioning, configuration management, and routine maintenance tasks, reducing manual effort and increasing operational efficiency.
- • Monitor system performance, identify trends, and implement optimizations to ensure optimal resource utilization and application performance.
- • Manage and maintain physical security of data centers and robot stations, ensuring compliance with company policies and industry standards.
- • Troubleshoot and resolve complex hardware and network issues, often requiring deep dives into system logs, performance metrics, and physical component diagnostics.
- • Work with vendors to procure, install, and maintain hardware and network components, ensuring timely delivery and quality service.
- • Develop and maintain infrastructure as code (IaC) principles for managing physical resources where applicable, enhancing reproducibility and scalability.
- • Participate in capacity planning to ensure our infrastructure can meet future demands, working closely with product and engineering teams.
- • Ensure all infrastructure components are properly documented, including network diagrams, server configurations, and maintenance procedures.
- • Be a key player in ensuring the physical infrastructure is robust, reliable, and ready to support the next generation of AI advancements at Scale AI.
🎯 Requirements
- • Proven experience in Site Reliability Engineering, DevOps, or a related infrastructure management role.
- • Strong understanding of Linux/Unix operating systems, including performance tuning and troubleshooting.
- • Hands-on experience with managing and maintaining physical server hardware, networking equipment (routers, switches, firewalls), and data center operations.
- • Familiarity with scripting languages (e.g., Python, Bash) for automation tasks.
- • Experience with monitoring tools and techniques (e.g., Prometheus, Grafana, Nagios).
- • Ability to provide hands-on support for hardware installation, troubleshooting, and repair.
- • Excellent problem-solving and analytical skills, with a methodical approach to diagnosing and resolving complex issues.
- • Strong communication and interpersonal skills, with the ability to collaborate effectively with remote engineering teams.
- • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
🏖️ Benefits
- • Competitive salary and stock options.
- • Comprehensive health, dental, and vision insurance.
- • Generous paid time off and holidays.
- • Opportunities for professional development and continuous learning.
- • A dynamic and collaborative work environment with a talented team.
- • The chance to work on cutting-edge AI technology and make a significant impact.
Skills & Technologies
DevOps
Remote
About Scale AI
Scale AI provides data infrastructure for artificial intelligence, offering data labeling, model evaluation, and generative AI application development tools for enterprises and government agencies. Founded in 2016 and headquartered in San Francisco, the company supplies high-quality training datasets to automotive, defense, e-commerce, and technology customers, enabling deployment of accurate machine-learning models at scale.
Similar Opportunities
❌ EXPIRED

Tribe Payment Solutions Limited
Kaunas
Full-time
Expired Feb 24, 2026
PHP
MySQL
Elasticsearch
+5 more
2 months ago
❌ EXPIRED

evoila GmbH
Madrid (Remote)
Full-time
Expired Feb 24, 2026
Kubernetes
Terraform
GitLab
+2 more
2 months ago

