
Job Overview
Location
United Kingdom (Remote)
Job Type
Full-time
Category
Software Engineer
Date Posted
February 24, 2026
Full Job Description
đź“‹ Description
- • Join Grafana Labs, a globally recognized open-source powerhouse, as a Senior Software Engineer specializing in Grafana Databases and Site Reliability Engineering (SRE). With over 20 million users worldwide, Grafana is at the forefront of monitoring diverse systems, from environmental data to critical infrastructure. Our open-source solutions, including Grafana Mimir for metrics, Grafana Loki for logs, and Grafana Tempo for traces, form the backbone of our Grafana LGTM Stack. This stack is offered as a fully managed SaaS product through Grafana Cloud or as a self-managed Grafana Enterprise Stack, serving over 3,000 companies like Bloomberg, JPMorgan Chase, and eBay.
- • As a remote-first company, we foster a collaborative, transparent, and trust-driven culture where innovation thrives. We are seeking talented individuals who are passionate about meaningful work and are eager to contribute to a rapidly scaling organization.
- • This is a unique opportunity to make a significant impact by enhancing the reliability of our high-value Grafana Cloud database products. You will be instrumental in supporting our most critical customers by ensuring the stability and performance of our Mimir, Loki, Tempo, and Pyroscope databases, which are delivered as a SaaS offering across AWS, GCP, and Azure in all regions.
- • The SRE team operates in an embedded model within the Mimir and Loki squads, focusing on delivering exceptional reliability for our highest-SLA customers. You will be a key player at the intersection of customer needs, production systems, and product engineering.
- • Your responsibilities will include partnering closely with product engineering squads to understand their challenges and contribute to solutions. You will take ownership of production reliability for complex, high-SLA customer environments, ensuring they consistently meet their Service Level Objectives (SLOs).
- • A core aspect of this role involves designing and implementing sophisticated automation to scale our reliability practices. This includes proactively reducing SLO budget burn by identifying and addressing root causes of potential issues, which may involve enhancements to monitoring, automation, self-healing capabilities, and auto-scaling mechanisms.
- • You will define and evolve per-tenant SLOs and reliability models, tailoring our approach to meet the specific demands of individual customers. By proactively reducing SLO burn, you will prevent repeat incidents and maintain a high level of service quality.
- • This role requires serving as a primary escalation point and participating in an on-call rotation for relevant incidents. We operate a global, remote-first model to ensure healthy on-call coverage, typically aligned with daylight hours, with shared ownership across different regions.
- • You will lead customer-impacting incident response efforts, conducting thorough post-incident reviews (PIRs) to extract learnings and implement preventative measures. Contributing to design documents and participating in code reviews will be essential to influence feature design, ensuring production scalability and operability from the outset.
- • A significant part of your work will involve building automation to eliminate toil and repetitive tasks, thereby improving team efficiency. You will also focus on improving alert quality and reducing noisy escalations to ensure that alerts are actionable and meaningful.
- • We heavily invest in developer productivity, providing access to modern AI coding assistants with a company-funded usage budget. This allows for rapid iteration, faster prototyping, test generation, refactoring, documentation, and incident follow-ups, always underpinned by strong code review and quality standards. You will have access to frontier models to enhance your workflow.
- • You will improve the observability of customers within their environments, providing deeper insights into system performance and potential issues. Developing fault-tolerant design patterns will be crucial, ensuring that reliability is considered at every stage of the service lifecycle.
- • Collaborating with Engineering Leaders to help define and influence product strategy, roadmaps, and technical designs will be a key aspect of shaping the future of our database products.
- • You will teach and mentor others about Site Reliability Engineering principles and communicate best practices to be applied early in the development of new features and functionality.
- • Participate actively in incident response, from investigation and resolution to PIRs and customer communication via bridge calls when necessary.
- • This role offers a high degree of autonomy and self-direction within a supportive engineering team, encouraging intellectual curiosity, a bias for action, and a default to transparency.
🎯 Requirements
- • 6+ years of engineering experience, with at least 3 years specifically in SRE, CRE, or production engineering roles. Formal customer reliability engineering experience is highly preferred.
- • Strong experience with Kubernetes in AWS, GCP, or Azure, coupled with familiarity with infrastructure-as-code tooling such as Helm, Terraform, or Jsonnet.
- • Proven experience operating multi-tenant systems in a production environment.
- • Demonstrated experience in designing, implementing, and managing Service Level Objectives (SLOs).
- • Proficiency in at least one programming language, such as Go, Python, or Java.
- • Solid understanding of Linux operating system internals, with knowledge of networking, cloud storage, and scaling principles.
- • Excellent problem-solving, analytical, and troubleshooting skills.
- • Experience in calm, active participation in blame-free incident response, including follow-up actions and writing high-quality Post Incident Reviews (PIRs).
- • Ability to reason about performance, scalability, and failure modes in complex distributed systems.
- • Comfort working in an autonomous and self-directed engineering team environment.
- • Proven ability to partner deeply and effectively with product engineering teams.
- • High degree of intellectual curiosity, a default to transparency, a strong bias towards action, and kindness.
🏖️ Benefits
- • 100% Remote, Global Culture: Work with a diverse, international team united by a shared purpose and collaborative spirit.
- • Scaling Organization: Engage in meaningful work within a high-growth, dynamic environment.
- • Transparent Communication: Benefit from open decision-making processes and regular company-wide updates.
- • Innovation-Driven Environment: Enjoy autonomy and support to deliver exceptional work and explore new ideas.
- • Open Source Roots: Contribute to and be part of a company built on community-driven values.
- • Empowered Teams: Thrive in a high-trust, low-ego culture that prioritizes outcomes.
- • Career Growth Pathways: Access defined opportunities for professional development and career advancement.
- • Approachable Leadership: Interact with transparent, involved, and visible executives.
- • Passionate People: Join a team of intelligent, supportive individuals who are deeply committed to their work.
- • In-Person Onboarding: Participate in a comprehensive onboarding experience to ensure a successful start.
- • Generous Annual Leave: Enjoy 30 days of annual leave per year, including 3 company-wide shutdown days to ensure genuine disconnection. (Compliance with local legislation is assured).
Skills & Technologies
Python
Java
AWS
Azure
GCP
DevOps
Senior
Remote
About Raintank Inc.
Raintank Inc., operating as Grafana Labs, is the open-source company behind the Grafana observability platform. It develops and maintains Grafana dashboards, Loki for logs, Tempo for traces, Mimir for metrics, and Grafana Cloud services, providing scalable monitoring and analytics for DevOps, SRE, and engineering teams worldwide. Grafana Labs supports on-prem and SaaS deployments with enterprise-grade features and commercial support.
Similar Opportunities
❌ EXPIRED

Jitterbit, Inc.
Rio de Janeiro
Full-time
Expired Feb 24, 2026
JavaScript
TypeScript
Angular
+4 more
2 months ago
⏰ EXPIRES SOON

Grant Street Group
United States (Remote)
Full-time
Expires Mar 10, 2026 (Soon)
Python
JavaScript
Java
+4 more
2 months ago

