This job has expired

This position was posted on January 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Site Reliability Engineer

Arista Networks, Inc.

Job Overview

Location

Remote

Job Type

Full-time

Full Job Description

📋 Description

• Arista Networks is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our dynamic team. In this pivotal role, you will be instrumental in ensuring the robust operation, scalability, and reliability of our global service fleet, specifically focusing on our flagship CloudVision platform. You will collaborate with a talented group of SREs who blend deep software and systems engineering expertise with a genuine passion for operating complex production environments at an unprecedented scale.
• As an SRE at Arista, you will be at the forefront of managing and enhancing our cutting-edge technology stack. CloudVision is deployed on Kubernetes across multiple global regions, leveraging Spinnaker for a sophisticated CI/CD pipeline. Our infrastructure is built upon Google Kubernetes Engine (GKE), utilizing HBase Hadoop as our primary distributed database and storage layer. For advanced data analytics and search capabilities, we employ ElasticSearch for powering search data and ClickHouse for high-speed, real-time queries of flow data. Our proprietary Kafka-based distributed real-time stream processing layer is crucial for our analytics, and we utilize TensorFlow for sophisticated Machine Learning analysis.
• Your responsibilities will extend to maintaining and evolving our comprehensive monitoring system, which is expertly built on top of industry-standard open-source tools such as Prometheus, Grafana, and Loki, alongside other OSS solutions. This provides us with deep visibility into the health and performance of our services.
• In this Senior SRE position, you will take ownership of our global CloudVision service fleet. This encompasses a wide range of critical activities designed to ensure maximum uptime and performance. You will be responsible for the safe, incremental build, deployment, and ongoing operation of these vital production systems, with an unwavering focus on scalability, reliability, observability, performance, and security. This means not just deploying code, but ensuring it runs flawlessly and efficiently under all conditions.
• A significant part of your role will involve monitoring, supporting, and continuously enhancing the product deployment experience across all our services. This includes identifying and resolving any friction points in the deployment process, ensuring a smooth and efficient path from development to production.
• You will be a key driver in building automation to eliminate repetitive tasks (toil) and to efficiently operate our production systems. This proactive approach to automation is fundamental to our SRE philosophy, allowing us to scale our operations without a proportional increase in manual effort.
• Your duties will include proactively monitoring system health, responding swiftly to incidents, and enhancing our alerting mechanisms. You will also be responsible for setting up automated alert handling to ensure rapid and effective responses to potential issues, minimizing downtime and impact.
• Creating and maintaining comprehensive incident response runbooks will be a crucial aspect of your work. These documents are vital for ensuring consistent and effective responses during critical events, enabling the team to act decisively and efficiently.
• You will be involved in building and deploying new systems, with scalability, reliability, and observability as non-negotiable primary requirements from the outset. This ensures that new services are designed for success from day one.
• Triage and resolve platform infrastructural issues will be a core responsibility. You will also provide critical support to Arista's software engineers in their own triage efforts, fostering a collaborative environment for problem-solving.
• Engaging with 3rd party vendor support will be necessary to resolve complex issues that may span across different technology providers.
• You will deploy new systems in a staged, controlled manner to mitigate risks and ensure smooth integration into the production environment.
• Writing detailed postmortem documents and subsequently building solutions to prevent similar incidents from recurring is a key part of our continuous improvement process. This commitment to learning from incidents is paramount.
• Planning and communicating maintenance windows on production systems will be essential to ensure transparency and minimize disruption to our users.
• You will work closely with Arista's product development teams to identify infrastructural issues that may be causing bottlenecks or limitations in their workflows. Your insights will directly influence product development and performance.
• You will design and implement solutions to resolve these identified infrastructural challenges, directly impacting the efficiency and capabilities of our development teams.
• Surveying and adopting best practices around infrastructure platforms is crucial for maintaining secure, scalable, and fault-tolerant systems. You will be a champion for adopting new technologies and methodologies that enhance our operational excellence.
• Implementing solutions to scale our systems effectively is a primary objective, ensuring we can meet growing demand.
• You will focus on improving system fault-tolerance and performance to enhance the overall availability and resilience of our services.
• Studying the design and implementation details of open-source systems will be necessary for more effective triage and resolution of issues within those components.
• This role offers a unique opportunity to work with a world-class team on challenging problems at the intersection of software and infrastructure, contributing directly to the success of Arista's innovative networking solutions.

Skills & Technologies

Elasticsearch

Kubernetes

Kafka

TensorFlow

Prometheus

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Arista Networks, Inc.

Visit Website

About Arista Networks, Inc.

Arista Networks designs and sells cloud networking solutions built on its Extensible Operating System. The company provides high-performance 10/100/400 Gigabit Ethernet switches and software for large data center, campus, and routing environments. Founded in 2004 and headquartered in Santa Clara, California, Arista serves cloud providers, financial services, web companies, and enterprises worldwide. Its programmable platforms emphasize reliability, scalability, and open standards, enabling automation and network visibility across private, hybrid, and multi-cloud infrastructures.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.