Okta, Inc. logo

Site Reliability Engineering Manager

Job Overview

Location

Barcelona, Spain

Job Type

Full-time

Category

Software Engineering

Date Posted

March 5, 2026

Full Job Description

đź“‹ Description

  • • As a Manager, Site Reliability Engineer at Okta, you will be instrumental in championing and advancing the principles of reliability for our cutting-edge Auth0 product. This pivotal role involves close collaboration with product engineers, quality engineers, platform engineers, and architecture teams to ensure the unwavering operational integrity of our production systems. Your primary mission will be to not only maintain current system uptime but also to establish and achieve ambitious long-term goals for performance, reliability, and scalability, crucial for supporting Auth0's significant growth trajectory.
  • • You will be a key contributor to Auth0's commitment to providing customers with uninterrupted access to their business-critical enterprise and consumer applications. This is a hands-on leadership position where you will actively participate in operating, troubleshooting, and scaling our production environments. This includes responding to critical monitoring alerts and managing incidents as part of a global, 24/7 on-call rotation. Your efforts are paramount in meeting the escalating demands of increased traffic and user growth, ensuring our customers consistently experience a reliable product.
  • • You will drive the technical direction of the Site Reliability Engineering team, working in tandem with SRE leadership to translate the overarching organizational vision into a concrete and actionable technical roadmap. This involves strategic planning and execution to ensure the team's efforts align with Okta's broader objectives.
  • • You will actively participate in a global on-call rotation, employing a follow-the-sun model during weekdays and a shared, lower-frequency rotation for weekends. This ensures prompt remediation of incidents impacting our critical systems, minimizing downtime and customer impact.
  • • Lead and spearhead complex, cross-functional initiatives that demand robust partnerships with both internal platform and product teams, as well as potentially external stakeholders. Your ability to navigate and influence across different groups will be key to success.
  • • Leverage existing monitoring tools and systems to proactively identify potential issues, troubleshoot them effectively, and escalate to relevant service teams when necessary. This requires a deep understanding of our observability stack.
  • • Implement strategic changes and enhancements to bolster infrastructure resilience, improve monitoring capabilities, and refine alerting mechanisms. This proactive approach is vital for preventing future incidents.
  • • Develop, refine, and continuously improve SRE tools and processes. The goal is to enhance software delivery pipelines, deepen observability, elevate reliability standards, and boost overall operational efficiency across the team and the product.
  • • Optimize existing systems by identifying and eliminating toil through simplification and automation. This focus on efficiency frees up valuable engineering time for more strategic initiatives.
  • • Define, document, and champion best practices and policies related to reliability across the engineering organization. Your advocacy will help embed a culture of reliability throughout the product development lifecycle.
  • • Serve as a senior technical expert representing SRE in architectural reviews and strategic planning sessions. You will ensure that reliability is a foundational consideration in all significant engineering endeavors, from initial design to deployment.
  • • Mentor and develop other SREs through various methods, including pair programming, engaging in insightful design discussions, and conducting thorough code reviews. Your guidance will be crucial in leveling up the technical capabilities and expertise of the entire team.
  • • Foster a culture of continuous learning and improvement within the team, encouraging experimentation and the adoption of new technologies and methodologies that enhance reliability and performance.
  • • Collaborate with product management and engineering leadership to define Service Level Objectives (SLOs) and Service Level Indicators (SLIs) that accurately reflect customer expectations and system health.
  • • Contribute to the development and maintenance of disaster recovery and business continuity plans, ensuring Okta's services can withstand and recover from major disruptions.
  • • Analyze system performance data to identify bottlenecks and areas for optimization, driving data-informed decisions for system improvements.
  • • Participate in post-incident reviews (PIRs) to identify root causes, document lessons learned, and implement preventative measures, ensuring a blameless and constructive approach to incident management.

Skills & Technologies

Python
Go
AWS
Azure
Docker
Remote

Ready to Apply?

You will be redirected to an external site to apply.

Okta, Inc. logo
Okta, Inc.
Visit Website

About Okta, Inc.

Okta provides cloud-based identity and access management software that enables organizations to securely connect employees, partners, and customers to the right technologies. Its platform offers single sign-on, multi-factor authentication, lifecycle management, API access control, and analytics to manage user identities across applications, devices, and networks. The company serves enterprises, government agencies, and small to medium-sized businesses, helping them improve security, compliance, and user experience while reducing IT complexity and support costs.

Similar Opportunities

❌ EXPIRED
Scale to Win LLC logo

Scale to Win LLC

Remote
Full-time
Expired Jan 22, 2026
Senior
Remote

3 months ago

Apply
Remote - USA
Full-time
Expires May 2, 2026
Senior
Remote

4 days ago

Apply
Dandy Technology, Inc. logo

Dandy Technology, Inc.

USA - Remote
Full-time
Expires May 3, 2026
REST
Remote

2 days ago

Apply
Remote - Canada
Full-time
Expires May 2, 2026
Go
MongoDB
Redis
+3 more

4 days ago

Apply