This job has expired

This position was posted on January 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Site Reliability Engineer (SRE) - CloudVision

Arista Networks, Inc.

Job Overview

Location

Poland - Remote

Job Type

Full-time

Full Job Description

📋 Description

• Arista Networks is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our dynamic team in Poland, operating remotely. As an SRE at Arista, you will be at the forefront of ensuring the reliability, scalability, and performance of our cutting-edge CloudVision platform. This role is ideal for individuals who possess a robust blend of software engineering expertise and a deep passion for operating complex production systems at a global scale. You will become an integral part of the team responsible for managing and enhancing our extensive global service fleet, directly impacting the experience of our users and the stability of our offerings.
• The CloudVision platform is a sophisticated system deployed on Kubernetes across multiple global regions. We leverage Spinnaker for our Continuous Integration and Continuous Deployment (CI/CD) pipeline, ensuring efficient and reliable software delivery. Our technology stack is built upon a foundation of industry-leading tools and technologies. We utilize Google Kubernetes Engine (GKE) for our container orchestration needs. For our primary distributed database and storage layer, we rely on HBase Hadoop, a proven solution for handling large datasets. ElasticSearch powers our search capabilities, enabling fast and efficient data retrieval. ClickHouse is employed for rapid, real-time queries of flow data, providing critical insights. Our own Kafka-based distributed real-time stream processing layer is central to our analytics capabilities, allowing for immediate processing of streaming data. Furthermore, we integrate TensorFlow for advanced Machine Learning analysis, driving intelligent features within CloudVision.
• Our comprehensive monitoring system is meticulously crafted using a suite of open-source software (OSS) tools, including Prometheus for metrics collection, Grafana for visualization, and Loki for log aggregation. As a Senior SRE, your responsibilities will extend to the overall health and operational excellence of our global CloudVision service fleet. This encompasses a wide range of critical tasks designed to maintain and improve our production environment.
• You will be instrumental in building, deploying safely and incrementally, and operating critical production systems. A primary focus will be placed on ensuring scalability, reliability, observability, performance, and security across all deployed services. This involves not just deploying new features but also ensuring that these deployments are seamless and do not compromise the stability of the live environment.
• A significant part of your role will involve monitoring, supporting, and enhancing the product deployment experience across all our services. This means actively observing system performance, identifying potential issues before they impact users, and working to streamline the deployment process for our development teams.
• Building automation to eliminate repetitive tasks (toil) and efficiently operate production systems is a core tenet of SRE. You will develop and implement automated solutions to streamline operations, reduce manual intervention, and free up valuable engineering time for more strategic initiatives.
• Proactively monitoring, responding to, and enhancing alerts is crucial. You will be responsible for setting up intelligent alert handling mechanisms that can automatically address common issues, thereby minimizing downtime and improving response times.
• Creating and maintaining incident response runbooks will be a key responsibility. These documents are vital for ensuring a consistent and effective response to any production incidents, guiding the team through troubleshooting and resolution steps.
• You will design, build, and deploy new systems with scalability, reliability, and observability as paramount requirements from the outset. This proactive approach ensures that new infrastructure is robust and maintainable from day one.
• Triage platform infrastructural issues and provide support to Arista software engineers in their troubleshooting efforts. You will act as a bridge between infrastructure and development teams, ensuring swift resolution of cross-functional problems.
• Engaging with 3rd party vendor support will be necessary to resolve complex issues that may involve external dependencies.
• Deploying new systems in a staged, controlled manner is essential for minimizing risk and ensuring successful integration into the production environment.
• Writing detailed postmortem documents after incidents and building solutions to prevent recurrence is a critical learning and improvement process. This involves deep analysis of root causes and implementing preventative measures.
• Planning and communicating maintenance windows on production systems to stakeholders will be a regular part of your duties, ensuring transparency and minimizing disruption.
• Collaborating closely with Arista's product development teams to identify infrastructural issues that act as bottlenecks or limitations in their workflows is vital for continuous improvement. You will then design and implement solutions to resolve these identified problems.
• Surveying and adopting best practices around infrastructure platforms is key to maintaining secure, scalable, and fault-tolerant systems. This includes staying abreast of industry trends and emerging technologies.
• Implementing solutions to scale systems effectively to meet growing demand and improving fault tolerance and performance to enhance system availability are ongoing objectives.
• Studying the design and implementation details of OSS systems used within our stack will enable you to perform better triage and resolution of issues, contributing to a deeper understanding of our technology ecosystem.

Skills & Technologies

Elasticsearch

Kubernetes

Kafka

TensorFlow

Prometheus

DevOps

Senior

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Arista Networks, Inc.

Visit Website

About Arista Networks, Inc.

Arista Networks designs and sells cloud networking solutions built on its Extensible Operating System. The company provides high-performance 10/100/400 Gigabit Ethernet switches and software for large data center, campus, and routing environments. Founded in 2004 and headquartered in Santa Clara, California, Arista serves cloud providers, financial services, web companies, and enterprises worldwide. Its programmable platforms emphasize reliability, scalability, and open standards, enabling automation and network visibility across private, hybrid, and multi-cloud infrastructures.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.