Replit, Inc. logo

Staff Site Reliability Engineer

Job Overview

Location

Remote - Europe

Job Type

Full-time

Category

Software Engineering

Date Posted

May 21, 2026

Full Job Description

đź“‹ Description

  • • Architect, build, and lead the implementation of comprehensive monitoring, logging, and tracing solutions to provide real-time visibility into Replit’s globally distributed infrastructure serving millions of developers.
  • • Define, implement, and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) in collaboration with product and engineering teams, ensuring reliability standards are met while balancing innovation velocity.
  • • Lead incident response for high-impact system outages, directing cross-functional teams to rapid resolution, conducting blameless post-mortems, and driving implementation of preventative automation to reduce Mean Time To Recovery (MTTR).
  • • Design and maintain infrastructure as code using Terraform or Pulumi to automate provisioning, configuration, and lifecycle management of cloud resources, eliminating manual toil and operational overhead.
  • • Optimize performance of large-scale Kubernetes clusters on Google Cloud Platform (GCP), identifying and resolving latency, resource contention, and scalability bottlenecks across global regions.
  • • Debug and harden distributed systems by analyzing complex failures across the entire stack—from application code to networking and storage—and implementing long-term fixes to improve system robustness and operability.
  • • Review feature and system designs across the engineering organization to ensure reliability, scalability, security, and operational integrity are embedded from inception.
  • • Mentor and educate engineers at all levels to cultivate a culture where reliability is a shared ownership and core value, not just an SRE function.
  • • Write high-quality, well-tested code in Python or Go to build internal tools, automate operational workflows, and integrate with third-party services.
  • • Build and refine runbooks, alerting policies, and self-healing systems that automatically detect and remediate common failure modes without human intervention.
  • • Implement capacity planning strategies to anticipate growth, prevent over-provisioning, and ensure consistent performance under varying loads.
  • • Contribute to the continuous improvement of CI/CD pipelines to ensure rapid, safe, and reliable deployment of infrastructure and application changes.
  • • Maintain and enhance observability platforms including metrics, logs, and traces using industry-standard tools to enable proactive issue detection and root cause analysis.
  • • Collaborate closely with core infrastructure and product teams to align system design with operational realities and business objectives.
  • • Advocate for and model open, transparent communication practices, ensuring incidents, improvements, and lessons learned are documented and shared across the engineering organization.

🎯 Requirements

  • • 8-10 years of experience in Site Reliability Engineering or similar roles (e.g., DevOps, Systems Engineering, Infrastructure Engineering)
  • • Strong programming skills in Python or Go, with a track record of writing high-quality, well-tested code
  • • Deep understanding of distributed systems, including design, scaling, and maintenance of production services
  • • Deep experience with Kubernetes and cloud-native technologies, specifically on Google Cloud Platform (GCP)
  • • Proven track record of designing and implementing sophisticated observability solutions (metrics, logging, tracing)
  • • Strong incident management skills with experience leading response for complex, high-impact outages

🏖️ Benefits

  • • Competitive Salary & Equity
  • • 401(k) Program with a 4% match
  • • Health, Dental, Vision and Life Insurance
  • • Short Term and Long Term Disability
  • • Paid Parental, Medical, Caregiver Leave
  • • Commuter Benefits
  • • Monthly Wellness Stipend
  • • Autonomous Work Environment
  • • In Office Set-Up Reimbursement
  • • Flexible Time Off (FTO) + Holidays
  • • Quarterly Team Gatherings
  • • In Office Amenities

Skills & Technologies

Python
Go
GCP
Docker
Kubernetes
Senior
Remote

Ready to Apply?

You will be redirected to an external site to apply.

Replit, Inc. logo
Replit, Inc.
Visit Website

About Replit, Inc.

Replit is an online, collaborative, integrated development environment (IDE) that allows users to write, run, and share code in numerous programming languages directly from their web browser. It provides a cloud-based platform, eliminating the need for local setup and dependencies. Replit supports real-time collaboration, enabling multiple users to code together simultaneously on the same project, making it ideal for educational purposes, team projects, and rapid prototyping. The platform offers a vast array of features including version control integration, package management, and deployment tools, democratizing software development for beginners and experienced programmers alike.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

Abu Dhabi
Part-time
Expires Jul 20, 2026
Python
Remote

8 days ago

Apply
Universe Group Ltd logo

Universe Group Ltd

Kyiv
Full-time
Expires Jul 20, 2026
Onsite

8 days ago

Apply
Alkami Technology, Inc. logo

Alkami Technology, Inc.

US Remote
Full-time
Expires Jul 26, 2026
Senior
Remote
$113k-125k
+1 more

1 day ago

Apply
Edgesource Corporation logo

Edgesource Corporation

Remote
Full-time
Expires Jun 16, 2026
AWS
Azure
Remote

1 month ago

Apply