
Job Overview
Location
Germany (Remote)
Job Type
Full-time
Category
Software Engineer
Date Posted
February 24, 2026
Full Job Description
đź“‹ Description
- • Join Grafana Labs, a globally recognized open-source leader, as a Senior Software Engineer specializing in Site Reliability Engineering (SRE) for our cutting-edge Grafana Databases.
- • This is a fully remote opportunity, ideal for candidates based in Germany, Spain, the UK, or Sweden, offering the chance to contribute to a company with over 20 million users worldwide.
- • You will play a pivotal role in enhancing the reliability and performance of our high-value Grafana Cloud customer databases, which are built upon Mimir, Loki, Tempo, and Pyroscope.
- • Our SaaS database offerings are deployed across AWS, GCP, and Azure, serving a diverse and demanding global clientele.
- • As an embedded member of the Mimir and Loki product engineering squads, you will operate at the critical intersection of customer needs, production systems, and product development.
- • Your primary focus will be ensuring exceptional reliability for our most critical customers, those with the highest Service Level Agreements (SLAs).
- • You will be instrumental in designing, implementing, and scaling automation to elevate our reliability practices across the board.
- • A key responsibility will be ensuring our customers consistently meet their defined Service Level Objectives (SLOs).
- • You will define, evolve, and meticulously manage per-tenant SLOs and tailored reliability models to meet specific customer requirements.
- • Proactively reducing SLO budget burn will be a continuous effort, aiming to prevent recurring incidents and ensure system stability.
- • You will serve as a primary escalation point and participate in the on-call rotation for critical incidents, ensuring swift and effective resolution.
- • Leading customer-impacting incident response efforts, including thorough post-incident reviews (PIRs), will be a core part of your role.
- • You will actively contribute to the design documentation process and participate in code reviews, upholding high standards of quality and maintainability.
- • Your insights will influence feature design, ensuring that new functionalities are developed with production scalability and operability as paramount considerations.
- • You will build and deploy automation solutions to eliminate repetitive tasks and reduce operational toil.
- • Improving alert quality and minimizing noisy escalations are crucial for maintaining an efficient and responsive SRE function.
- • This role involves a significant on-call component, managed with a global, remote-first approach to ensure healthy coverage and work-life balance, typically spanning approximately 12 daylight hours per day.
- • You will collaborate closely with international counterparts to ensure balanced coverage and shared responsibility for system reliability.
- • Grafana Labs heavily invests in developer productivity, providing access to modern AI coding assistants and a company-funded usage budget to accelerate development cycles.
- • We encourage pragmatic AI-assisted development, leveraging tools for faster prototyping, test generation, refactoring, documentation, and incident follow-ups, always balanced with rigorous code review and quality standards.
- • You will have access to frontier AI models, such as GPT-Codex 5/3, Claude Opus 4.6, and Gemini 3 Pro, to enhance your workflow.
- • Regularly engage in 1:1 meetings with your manager and colleagues to foster collaboration and professional development.
- • Review and create SLOs, proactively identifying and implementing improvements to reduce budget burn, which may include enhancements to monitoring, automation, self-healing capabilities, and auto-scaling.
- • Enhance the observability of customer environments, providing deeper insights into system performance and potential issues.
- • Design and implement robust solutions to ensure our environments can reliably scale to meet rapidly increasing customer demands.
- • Develop fault-tolerant design patterns, embedding reliability considerations into every stage of the service lifecycle.
- • Collaborate with Engineering Leaders to help define and influence product strategy, roadmaps, and technical designs, ensuring a strong focus on reliability and scalability.
- • Participate actively in Pull Request (PR) reviews and collaborate with other engineers on their Design Documents.
- • Educate and mentor other team members on Site Reliability Engineering principles and best practices, promoting their early adoption in the development of new features.
- • Participate in Incident Response, including investigation, resolution, PIRs, and necessary customer communication via bridge calls.
- • Contribute to a culture of continuous improvement and knowledge sharing within the SRE and broader engineering teams.
- • This role offers a unique opportunity to shape the future of observability databases at a leading open-source company, impacting millions of users and thousands of businesses globally.
🎯 Requirements
- • 6+ years of overall engineering experience, with a minimum of 3 years specifically in SRE, CRE, or production engineering roles.
- • Proven experience operating multi-tenant systems in a production environment.
- • Strong experience designing, implementing, and managing Service Level Objectives (SLOs).
- • Proficiency with Kubernetes, particularly within AWS, GCP, or Azure cloud environments.
- • Familiarity with infrastructure-as-code tooling such as Helm, Terraform, or Jsonnet.
- • Experience with at least one modern programming language (e.g., Go, Python, Java).
- • Solid understanding of Linux operating system internals, networking, cloud storage, and scaling principles.
- • Excellent problem-solving, analytical, and troubleshooting skills.
- • Demonstrated ability to participate calmly and effectively in blame-free incident response, including follow-up actions and writing high-quality Post Incident Reviews (PIRs).
- • Ability to reason about system performance, scalability, and potential failure modes.
- • Comfort working autonomously within an engineering team that values self-direction and initiative.
- • Proven ability to partner deeply and effectively with product engineering teams.
- • Intellectual curiosity, a default to transparency, a high bias for action, and kindness are highly valued personal attributes.
🏖️ Benefits
- • Competitive salary range in Germany: EUR 97,034 - EUR 116,441, with actual compensation based on experience and skill level.
- • Restricted Stock Units (RSUs) included in all roles, providing ownership and a stake in Grafana Labs' success.
- • 100% Remote, Global Culture: Work with a diverse, international team united by collaboration and shared purpose.
- • Scaling Organization: Engage in meaningful work within a high-growth, dynamic environment.
- • Transparent Communication: Benefit from open decision-making processes and regular company-wide updates.
- • Innovation-Driven Environment: Enjoy autonomy and support to ship great work and explore new ideas.
- • Open Source Roots: Contribute to a company built on community-driven values.
- • Empowered Teams: Experience a high-trust, low-ego culture focused on outcomes.
- • Career Growth Pathways: Access defined opportunities for professional development and career advancement.
- • Approachable Leadership: Interact with transparent, involved, and visible executives.
- • Passionate Colleagues: Join a team of intelligent, supportive individuals dedicated to their work.
- • In-Person Onboarding: Participate in a structured onboarding experience with new hires to learn about the company and its operations.
- • Generous Annual Leave: 30 days of annual leave per year, with 3 designated Grafana Shutdown Days for disconnecting.
- • Access to modern AI coding assistants and a company-funded usage budget for enhanced developer productivity.
Skills & Technologies
Python
Java
AWS
Azure
GCP
DevOps
Senior
Remote
About Raintank Inc.
Raintank Inc., operating as Grafana Labs, is the open-source company behind the Grafana observability platform. It develops and maintains Grafana dashboards, Loki for logs, Tempo for traces, Mimir for metrics, and Grafana Cloud services, providing scalable monitoring and analytics for DevOps, SRE, and engineering teams worldwide. Grafana Labs supports on-prem and SaaS deployments with enterprise-grade features and commercial support.
Similar Opportunities
❌ EXPIRED

Jitterbit, Inc.
Rio de Janeiro
Full-time
Expired Feb 24, 2026
JavaScript
TypeScript
Angular
+4 more
2 months ago
⏰ EXPIRES SOON

Grant Street Group
United States (Remote)
Full-time
Expires Mar 10, 2026 (Soon)
Python
JavaScript
Java
+4 more
2 months ago

