Cohere Inc. logo

Technical Program Manager, Incident Management

Job Overview

Location

Toronto

Job Type

Full-time

Category

Project Manager

Date Posted

February 28, 2026

Full Job Description

đź“‹ Description

  • • As a pivotal member of Cohere's Engineering Program Management team, you will spearhead the critical function of incident management, ensuring the resilience and reliability of our cutting-edge AI platforms. This role demands a proactive, independent, and driven individual with proven leadership capabilities and hands-on experience managing complex projects within enterprise-grade software or machine learning solutions. You will be instrumental in shaping Cohere's operational excellence, collaborating with world-class AI researchers and engineers to maintain the highest standards of service availability for our global customer base.
  • • You will own the end-to-end lifecycle of all major incidents within Cohere’s environment. This encompasses proactive identification, swift communication, effective escalation, and efficient resolution, ensuring minimal disruption to our services and customers. Your primary focus will be on leading all P1-P4 incidents, meticulously managing them through their entire lifecycle and ensuring adherence to their respective Service Level Agreements (SLAs).
  • • A core responsibility will be to deliver clear, timely, and objective updates to a diverse range of stakeholders, including engineering teams, senior leadership, and non-technical departments. You will translate complex technical issues into understandable information, facilitating informed decision-making and maintaining transparency throughout incident response.
  • • You will be tasked with optimizing our incident management processes by breaking down intricate challenges into actionable strategies. This involves aligning engineering efforts with the needs and expectations of all relevant stakeholders, fostering a collaborative and solution-oriented environment.
  • • Planning and coordination are paramount. You will orchestrate efforts across all engineering teams to guarantee global coverage and uninterrupted service for Cohere’s customers, regardless of their location or time zone.
  • • A significant part of your role will involve the strategic development and maintenance of incident playbooks. You will anticipate common or potential incident scenarios and create robust, easy-to-follow guides to streamline response and resolution efforts.
  • • You will collaborate closely with engineering managers to enhance our monitoring capabilities and refine our triage processes. The goal is to proactively identify and mitigate potential incidents before they impact our services, thereby reducing the frequency and severity of future disruptions.
  • • Close partnership with our Security, IT, and broader Engineering teams is essential. You will ensure that resolutions are prioritized effectively and that mitigation strategies are implemented promptly and thoroughly.
  • • Following the resolution of any major incident, you will be responsible for executing and delivering comprehensive post-mortem analyses. These reports will detail the incident, its root cause, the response actions taken, and clearly defined action items to prevent recurrence and improve future incident handling.
  • • You will act as a proactive problem-solver, anticipating potential issues, coordinating dependencies between various teams and systems, and meticulously prioritizing impacts on product quality and project timelines. Your ability to foresee challenges and implement preventative measures will be key to maintaining Cohere's reputation for reliability.
  • • This role offers a unique opportunity to influence the trajectory of a rapidly growing AI company. You will contribute directly to the robustness of our product, the efficiency of our operations, and the strength of our company culture. Your work will directly impact the success of our mission to scale intelligence for the benefit of humanity.
  • • You will be a key facilitator, ensuring that communication flows seamlessly during high-pressure situations. Your ability to remain calm, organized, and decisive under duress will be critical to successful incident resolution.
  • • By developing and refining our incident management framework, you will play a crucial role in building trust and confidence with our enterprise clients, assuring them of Cohere's commitment to service excellence and operational integrity.
  • • You will leverage your technical acumen to understand the intricacies of our AI models and infrastructure, enabling you to effectively guide technical discussions and solutions during incidents.
  • • This position requires a strategic mindset, capable of looking beyond immediate incident resolution to identify systemic improvements and long-term solutions that enhance overall system stability and performance.
  • • You will champion best practices in incident management, continuously seeking opportunities for process improvement and knowledge sharing across the organization.

🎯 Requirements

  • • 5+ years of experience in a Technical Program Manager or Engineering Program Manager role focused on incident management, with a strong technical background and experience in SaaS/cloud environments.
  • • Demonstrated experience in building and scaling incident management programs from inception (0-1) within enterprise-level organizations.
  • • Proficiency with incident management tools and platforms such as Incident.io, PagerDuty, ServiceNow, Rootly, or Atlassian.
  • • Exceptional communication skills, both written and verbal, with the ability to articulate complex technical issues clearly and concisely to diverse audiences, including executive leadership and non-technical stakeholders.
  • • Strong organizational skills, meticulous attention to detail, and a proven ability to track actions, manage dependencies, and facilitate effective team collaboration, particularly in fast-paced environments.

🏖️ Benefits

  • • An open, inclusive, and collaborative work environment where your contributions are valued.
  • • Opportunity to work at the forefront of AI research and development alongside leading experts in the field.
  • • Comprehensive health and dental benefits package, including a dedicated budget for mental health and well-being.
  • • Generous parental leave policy with 100% top-up for up to 6 months.
  • • Personal enrichment benefits to support your interests in arts, culture, fitness, well-being, quality time, and workspace improvement.
  • • Remote-flexible work arrangements with offices in key global locations (Toronto, New York, San Francisco, London, Paris) and a co-working stipend.
  • • A substantial 6 weeks (30 working days) of paid vacation annually.

Skills & Technologies

Remote

Ready to Apply?

You will be redirected to an external site to apply.

Cohere Inc. logo
Cohere Inc.
Visit Website

About Cohere Inc.

Cohere provides large language models and retrieval-augmented generation APIs for enterprise developers to embed conversational AI, search, summarization, and content generation into applications. Founded in 2021 by former Google Brain researchers, the company offers cloud and on-premise deployment, fine-tuning tools, and multilingual support to help organizations automate workflows, improve customer support, and analyze unstructured data while maintaining data privacy and security controls.

Similar Opportunities

California, Canada
Full-time
Expires Apr 18, 2026
Remote
Degree Required

19 days ago

Apply
Canada
Full-time
Expires Apr 25, 2026
Remote
Degree Required

12 days ago

Apply
SpryPoint Services Inc. logo

SpryPoint Services Inc.

Canada
Full-time
Expires Apr 13, 2026
Onsite

24 days ago

Apply
❌ EXPIRED
Aleph Holding Inc. logo

Aleph Holding Inc.

Cyprus
Full-time
Expired Dec 10, 2025
Remote

5 months ago

Apply