This job has expired

This position was posted on November 19, 2025 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Senior Site Reliability Engineer

ScienceLogic, Inc.

Job Overview

Location

Remote

Job Type

Full-time

Full Job Description

📋 Description

• Own the reliability, scalability, and performance of ScienceLogic’s cloud-native AIOps platform, serving thousands of enterprises and managed service providers worldwide. You will design and operate systems that ingest billions of metrics per day, detect anomalies in real time, and trigger automated remediation workflows—keeping mission-critical infrastructure running 24/7.
• Architect and maintain multi-region, multi-cloud infrastructure on AWS, Azure, and GCP using Infrastructure-as-Code (Terraform, CloudFormation, Pulumi). You will codify every layer—from VPC topology and IAM policies to Kubernetes clusters and service meshes—ensuring environments are reproducible, version-controlled, and continuously tested.
• Drive Service-Level Objectives (SLOs), Error Budgets, and Error Budget Policies for dozens of microservices. You will instrument golden signals (latency, traffic, errors, saturation) using Prometheus, Thanos, and OpenTelemetry; publish dashboards in Grafana; and negotiate realistic targets with product and customer-success teams.
• Lead incident response as the on-call commander, orchestrating war-room bridges, root-cause analyses, and blameless post-mortems. You will refine runbooks, automate repetitive diagnostics, and create chaos-engineering experiments (Litmus, Chaos Mesh) to validate resilience before customers feel pain.
• Optimize data pipelines that process petabytes of telemetry. You will tune Kafka, Pulsar, and Flink clusters; right-size Cassandra, ScyllaDB, and ClickHouse storage; and implement tiered retention strategies that balance cost, compliance, and query performance.
• Champion GitOps and progressive delivery. You will build Argo CD/Flux pipelines, define canary and blue-green deployments with Flagger or Argo Rollouts, and integrate feature-flag systems (LaunchDarkly, Flagsmith) to ship changes safely dozens of times per day.
• Automate toil away. You will write Python, Go, or Rust operators that auto-remediate common failure modes—disk pressure, certificate expiry, memory leaks—and integrate them with ScienceLogic’s own AIOps engine for closed-loop automation.
• Collaborate across product, security, and customer-success teams to translate customer pain into engineering priorities. You will join architecture reviews, threat-model sessions, and customer QBRs to ensure reliability is baked into every feature from day one.
• Mentor junior SREs and DevOps engineers through pair programming, design reviews, and internal guilds. You will curate a culture of operational excellence, sharing knowledge via tech talks, brown-bags, and open-source contributions.
• Influence the roadmap for observability, cost optimization, and sustainability. You will pilot new tools (eBPF, OpenCost, Kepler) to reduce MTTR by 30 % and cloud spend by 20 % within your first year, presenting results at SREcon and KubeCon.

🎯 Requirements

• 6+ years in Site Reliability, DevOps, or Platform Engineering roles supporting 24/7 SaaS products at scale (≥99.9 % availability).
• Expert-level proficiency with Kubernetes, container runtimes, and service meshes (Istio, Linkerd, or Consul) in production clusters exceeding 500 nodes.
• Hands-on experience with Infrastructure-as-Code (Terraform, CloudFormation, Pulumi) and configuration management (Ansible, Chef, or Salt).
• Deep understanding of observability stacks: Prometheus, Thanos, Grafana, Alertmanager, OpenTelemetry, and distributed tracing (Jaeger, Zipkin).
• Strong coding skills in Python, Go, or Rust for automation, CLI tooling, and Kubernetes operators.
• Nice-to-have: AWS Solutions Architect or CKA/CKS certification, chaos-engineering expertise, and contributions to open-source CNCF projects.

🏖️ Benefits

• Fully remote-first culture with quarterly in-person summits in exciting global locations.
• Annual $3,000 professional-development stipend for conferences, certifications, and courses.
• 100 % employer-paid medical, dental, and vision coverage for you and 75 % for dependents.
• 20 days PTO plus 12 company holidays and a 4-week paid sabbatical every 4 years.

Skills & Technologies

Senior

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

ScienceLogic, Inc.

Visit Website

About ScienceLogic, Inc.

ScienceLogic provides AI-driven IT operations and infrastructure monitoring software for enterprises and service providers. Its platform autonomously discovers, maps, and monitors hybrid cloud, network, storage, and application resources in real time, delivering actionable insights, anomaly detection, and automated remediation workflows. Customers use the technology to reduce outages, optimize performance, and enforce policy across multi-cloud and on-premises environments.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.