This job has expired

This position was posted on May 21, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Staff Site Reliability Engineer

Replit, Inc.

Job Overview

Location

Remote - Europe

Job Type

Full-time

Full Job Description

📋 Description

• Architect, build, and lead the implementation of comprehensive monitoring, logging, and tracing solutions to provide real-time visibility into Replit’s globally distributed infrastructure serving millions of developers.
• Define, implement, and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) in collaboration with product and engineering teams, ensuring reliability standards are met while balancing innovation velocity.
• Lead incident response for high-impact system outages, directing cross-functional teams to rapid resolution, conducting blameless post-mortems, and driving implementation of preventative automation to reduce Mean Time To Recovery (MTTR).
• Design and maintain infrastructure as code using Terraform or Pulumi to automate provisioning, configuration, and lifecycle management of cloud resources, eliminating manual toil and operational overhead.
• Optimize performance of large-scale Kubernetes clusters on Google Cloud Platform (GCP), identifying and resolving latency, resource contention, and scalability bottlenecks across global regions.
• Debug and harden distributed systems by analyzing complex failures across the entire stack—from application code to networking and storage—and implementing long-term fixes to improve system robustness and operability.
• Review feature and system designs across the engineering organization to ensure reliability, scalability, security, and operational integrity are embedded from inception.
• Mentor and educate engineers at all levels to cultivate a culture where reliability is a shared ownership and core value, not just an SRE function.
• Write high-quality, well-tested code in Python or Go to build internal tools, automate operational workflows, and integrate with third-party services.
• Build and refine runbooks, alerting policies, and self-healing systems that automatically detect and remediate common failure modes without human intervention.
• Implement capacity planning strategies to anticipate growth, prevent over-provisioning, and ensure consistent performance under varying loads.
• Contribute to the continuous improvement of CI/CD pipelines to ensure rapid, safe, and reliable deployment of infrastructure and application changes.
• Maintain and enhance observability platforms including metrics, logs, and traces using industry-standard tools to enable proactive issue detection and root cause analysis.
• Collaborate closely with core infrastructure and product teams to align system design with operational realities and business objectives.
• Advocate for and model open, transparent communication practices, ensuring incidents, improvements, and lessons learned are documented and shared across the engineering organization.

🎯 Requirements

• 8-10 years of experience in Site Reliability Engineering or similar roles (e.g., DevOps, Systems Engineering, Infrastructure Engineering)
• Strong programming skills in Python or Go, with a track record of writing high-quality, well-tested code
• Deep understanding of distributed systems, including design, scaling, and maintenance of production services
• Deep experience with Kubernetes and cloud-native technologies, specifically on Google Cloud Platform (GCP)
• Proven track record of designing and implementing sophisticated observability solutions (metrics, logging, tracing)
• Strong incident management skills with experience leading response for complex, high-impact outages

🏖️ Benefits

• Competitive Salary & Equity
• 401(k) Program with a 4% match
• Health, Dental, Vision and Life Insurance
• Short Term and Long Term Disability
• Paid Parental, Medical, Caregiver Leave
• Commuter Benefits
• Monthly Wellness Stipend
• Autonomous Work Environment
• In Office Set-Up Reimbursement
• Flexible Time Off (FTO) + Holidays
• Quarterly Team Gatherings
• In Office Amenities

Skills & Technologies

Python

GCP

Docker

Kubernetes

Senior

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Replit, Inc.

Visit Website

About Replit, Inc.

Replit is an online, collaborative, integrated development environment (IDE) that allows users to write, run, and share code in numerous programming languages directly from their web browser. It provides a cloud-based platform, eliminating the need for local setup and dependencies. Replit supports real-time collaboration, enabling multiple users to code together simultaneously on the same project, making it ideal for educational purposes, team projects, and rapid prototyping. The platform offers a vast array of features including version control integration, package management, and deployment tools, democratizing software development for beginners and experienced programmers alike.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.