This job has expired

This position was posted on March 4, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Senior Manager - Incident Response Engineering

Confluent Inc.

Job Overview

Location

Remote, Ontario, Canada

Job Type

Full-time

Full Job Description

📋 Description

• As the Senior Manager of Incident Response Engineering at Confluent Inc., you will be at the forefront of ensuring the reliability and resilience of a world-class data streaming platform that processes millions of events per second across major cloud providers (AWS, GCP, Azure).
• This is a pivotal leadership role within the Cloud Architecture & Reliability (CAR) organization, responsible for establishing and executing a comprehensive incident response program.
• You will lead a specialized team of approximately 5 senior incident response engineers, providing 24/7 coverage across global time zones, ensuring swift and effective resolution of critical customer-impacting incidents.
• Your primary responsibility will be to act as a player-coach, providing direct incident command for high-severity events while simultaneously building the strategic framework for the entire incident response function.
• This includes owning the people, processes, tooling, and overall outcomes of incident management, fostering a culture of rigor, intentionality, and continuous improvement.
• You will be empowered to operate with conviction and autonomy, making critical decisions during major incidents, setting the pace, and directing the response efforts with clarity and composure.
• Beyond active incident management, you will be instrumental in developing advanced practices, implementing cutting-edge tooling, and leveraging AI-driven capabilities to enhance the speed, accuracy, and effectiveness of every response.
• The ideal candidate views incident response not as mere firefighting, but as a critical engineering discipline that can be systematically improved through strategic leadership and meticulous execution.
• You will be responsible for recruiting, hiring, and developing a high-performing team of senior technical operators, ensuring they possess deep systems intuition and the ability to navigate complex, ambiguous situations.
• Designing and implementing sustainable on-call models, including follow-the-sun coverage, will be crucial to maintaining 24/7 operational readiness.
• You will set and rigorously enforce standards for incident management, encompassing communication cadences, stakeholder engagement, coordination of domain experts, and seamless handoffs.
• A paramount focus will be placed on maintaining a customer-first posture throughout every incident, guaranteeing timely, accurate updates and clear ownership from initial detection to final resolution.
• You will own the end-to-end postmortem process, including facilitation, in-depth root cause analysis, definition of corrective actions, and ensuring their diligent follow-through.
• A key deliverable is managing the Customer Root Cause Analysis (CRCA) program, producing technically accurate, clearly written documents that rebuild customer trust and provide valuable insights.
• This involves coordinating critical technical inputs from various engineering teams and synthesizing complex, potentially ambiguous information into clear, actionable narratives.
• You will champion an AI-centric approach to incident operations, utilizing intelligent tooling to accelerate triage, improve documentation quality, and enhance pattern detection without compromising the integrity of the response process.
• Collaboration with other sub-functions within CAR, such as observability, supportability, and resiliency, will be essential to provide vital feedback for platform evolution.
• You will own and continuously evolve the incident management tooling stack, prioritizing solutions that offer agentic assistance and streamline operations.
• Analyzing incident data to identify recurring patterns and systemic issues will be a core activity, feeding these learnings back into engineering practices to prevent future occurrences.
• When incident volume permits, you will strategically direct your team's capacity towards improving runbooks, developing automation, and enhancing overall operational hygiene.
• You will serve as a key cross-functional liaison, partnering with Legal, Public Relations, and Customer Success teams to manage customer-facing communications during and after significant incidents.
• You will be responsible for briefing engineering leadership and executives with clarity and composure during active incidents, providing concise and accurate situational updates.
• This role positions you as the go-to expert for engineering teams seeking guidance on improving operational standards and incident response practices.
• Ultimately, you will drive the evolution of Confluent's incident response capabilities, ensuring the platform remains robust, reliable, and trusted by its global customer base.

Skills & Technologies

React

AWS

Azure

GCP

Kafka

Senior

Remote

Degree Required

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Confluent Inc.

Visit Website

About Confluent Inc.

Confluent Inc. delivers a cloud-native data streaming platform built around Apache Kafka. It provides real-time data pipelines, stream processing, and event-driven architecture tools for enterprises. The company offers managed services, connectors, and analytics to unify data across on-premises and cloud environments. Industries use Confluent to power fraud detection, IoT, logistics, and customer experiences. Founded by Kafka creators, it operates globally with offices in the U.S., Europe, and Asia.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.