
Job Overview
Location
Sweden (Remote)
Job Type
Full-time
Category
Software Engineer
Date Posted
February 24, 2026
Full Job Description
đź“‹ Description
- • Join Grafana Labs, a globally recognized open-source leader, as a Senior Software Engineer specializing in Grafana Databases and Site Reliability Engineering (SRE).
- • This is a fully remote opportunity, ideal for candidates based in Sweden, Germany, Spain, or the UK, allowing you to contribute from anywhere.
- • You will play a pivotal role in enhancing the reliability of our high-value Grafana Cloud customer databases, which are built upon cutting-edge technologies like Mimir, Loki, Tempo, and Pyroscope.
- • Our SaaS offering is delivered across AWS, GCP, and Azure, serving a diverse and demanding customer base.
- • As an embedded member of the Mimir and Loki squads, you will be at the forefront of ensuring our database products consistently meet exceptional reliability standards for our highest-SLA customers.
- • This role requires a unique blend of customer advocacy, deep production systems understanding, and proactive product engineering.
- • You will forge strong partnerships with product engineering teams, working collaboratively to embed reliability principles from the ground up.
- • Take ownership of production reliability for complex, high-SLA customer environments, ensuring their critical operations are uninterrupted.
- • Design, implement, and champion automation strategies to scale our reliability practices, reducing manual intervention and increasing efficiency.
- • You will be instrumental in ensuring our customers consistently achieve and exceed their Service Level Objective (SLO) targets.
- • Define, evolve, and meticulously manage per-tenant SLOs and sophisticated reliability models tailored to individual customer needs.
- • Proactively monitor and manage SLO budget burn, implementing strategies to prevent overruns and maintain service integrity.
- • Serve as a primary escalation point and participate in the on-call rotation for critical incidents, ensuring swift and effective resolution.
- • Lead customer-impacting incident response efforts, from initial detection through to resolution, and conduct thorough post-incident reviews to extract valuable lessons.
- • Contribute significantly to the design documentation process and actively participate in code reviews, upholding the highest standards of quality and reliability.
- • Influence feature design and product roadmaps, ensuring that scalability and operability are core considerations from the earliest stages of development.
- • Build robust automation solutions to eliminate toil and repetitive tasks, freeing up valuable engineering time for more strategic initiatives.
- • Enhance alert quality across our systems, reducing noise and ensuring that critical alerts are actionable and efficiently addressed.
- • Participate in incident response, including investigation, resolution, post-incident analysis, and clear communication with customers during bridge calls when necessary.
- • Develop fault-tolerant design patterns, embedding reliability considerations throughout the entire service lifecycle.
- • Collaborate with Engineering Leaders to help define and influence product strategy, roadmaps, and technical designs, ensuring a shared vision for reliability.
- • Teach and mentor other engineers on Site Reliability Engineering principles and best practices, fostering a culture of reliability across the organization.
- • Leverage modern AI coding assistants, backed by a company-funded budget, to accelerate development, prototyping, test generation, refactoring, and documentation, while always adhering to strong code review and quality standards.
- • Gain access to frontier AI models to enhance your daily workflow and drive innovation.
- • Work within a remote-first, global team that values autonomy, transparency, and a bias for action.
- • Contribute to a company that is scaling rapidly, offering opportunities to tackle meaningful challenges in a dynamic environment.
- • Be part of an open-source powerhouse with a strong community-driven culture.
- • Enjoy a high-trust, low-ego environment where outcomes are prioritized.
- • Benefit from defined career growth pathways and approachable leadership.
- • Engage with passionate, supportive colleagues who are dedicated to their work.
- • Participate in in-person onboarding to connect with new colleagues and learn about Grafana Labs.
- • Experience a healthy work-life balance with a global annual leave policy and dedicated shutdown days.
🎯 Requirements
- • 6+ years of engineering experience, with a minimum of 3 years specifically in Site Reliability Engineering (SRE), Customer Reliability Engineering (CRE), or production engineering.
- • Proven experience operating multi-tenant systems in a production environment.
- • Strong proficiency with Kubernetes on AWS, GCP, or Azure, coupled with experience in infrastructure-as-code tooling such as Helm, Terraform, or Jsonnet.
- • Demonstrated experience in designing, implementing, and managing Service Level Objectives (SLOs).
- • Proficiency in at least one programming language, such as Go, Python, or Java.
- • Solid understanding of Linux operating system internals, networking, cloud storage, and scaling principles.
- • Excellent problem-solving and troubleshooting capabilities, with a methodical approach to identifying and resolving complex issues.
- • Experience in calmly and actively participating in blame-free incident response, including thorough follow-up on action items and writing high-quality Post Incident Reviews (PIRs).
- • Ability to reason effectively about performance, scalability, and potential failure modes in complex systems.
- • Comfort and effectiveness working within an engineering team that encourages a strong sense of autonomy and self-direction.
- • Demonstrated ability to partner deeply and effectively with product engineering teams.
- • A mindset that values intellectual curiosity, defaults to transparency, possesses a high bias towards action, and embodies kindness.
🏖️ Benefits
- • 100% Remote, Global Culture: Work with a diverse, international team united by a collaborative spirit.
- • Scaling Organization: Contribute to meaningful work in a high-growth, dynamic environment.
- • Transparent Communication: Benefit from open decision-making and regular company-wide updates.
- • Innovation-Driven Environment: Enjoy autonomy and support to pursue new ideas and ship great work.
- • Open Source Roots: Be part of a company built on community-driven values.
- • Empowered Teams: Thrive in a high-trust, low-ego culture focused on outcomes.
- • Career Growth Pathways: Access defined opportunities for professional development and advancement.
- • Approachable Leadership: Work with transparent and visible executives.
- • Passionate People: Join a team of smart, supportive, and dedicated individuals.
- • In-Person Onboarding: Connect with new colleagues and learn about Grafana Labs from day one.
- • Generous Annual Leave: Enjoy 30 days of annual leave per year, including 3 dedicated Grafana Shutdown Days to ensure you can truly disconnect.
Skills & Technologies
Python
Java
AWS
Azure
GCP
DevOps
Senior
Remote
About Raintank Inc.
Raintank Inc., operating as Grafana Labs, is the open-source company behind the Grafana observability platform. It develops and maintains Grafana dashboards, Loki for logs, Tempo for traces, Mimir for metrics, and Grafana Cloud services, providing scalable monitoring and analytics for DevOps, SRE, and engineering teams worldwide. Grafana Labs supports on-prem and SaaS deployments with enterprise-grade features and commercial support.
Similar Opportunities
❌ EXPIRED

Jitterbit, Inc.
Rio de Janeiro
Full-time
Expired Feb 24, 2026
JavaScript
TypeScript
Angular
+4 more
2 months ago
⏰ EXPIRES SOON

Grant Street Group
United States (Remote)
Full-time
Expires Mar 10, 2026 (Soon)
Python
JavaScript
Java
+4 more
2 months ago

