Senior Incident Manager

Lambda Inc.

Job Overview

Location

Remote, USA

Job Type

Full-time

Full Job Description

📋 Description

• Lead end-to-end incident response for critical (SEV-1/SEV-2) outages impacting AI infrastructure, GPU clusters, networking, storage, and data center operations.
• Serve as Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams to ensure rapid resolution.
• Act as the primary liaison between leadership and cross-functional teams during active incidents and post-incident reviews, providing clear status updates and summaries.
• Own the incident lifecycle including triage, escalation, coordination, resolution, and post-incident analysis with documented timelines and action plans.
• Maintain and improve incident response documentation, operational playbooks, runbooks, and reliability frameworks.
• Conduct post-incident reviews (PIRs) and root cause analyses to identify systemic reliability gaps and drive corrective actions.
• Track and report key incident metrics including MTTR, MTTD, and incident recurrence rates to measure operational improvement.
• Participate in an on-call rotation to respond to, lead, and coordinate incidents in real-time.
• Collaborate closely with data center operations, infrastructure engineering, network engineering, platform reliability engineering, and security operations teams during cross-layer outages.
• Work with hardware and facility vendors to resolve incidents involving physical infrastructure and GPU cluster failures.
• Improve incident response tooling, escalation paths, and automation by partnering with technical support and engineering teams.
• Maintain incident dashboards and operational health reports for leadership and engineering stakeholders.
• Deliver executive-level incident summaries and clear, concise communication during high-pressure situations.
• Contribute to the development of operational standards and reliability frameworks aligned with SRE and ITIL practices.
• Support implementation of observability improvements and automation to reduce manual intervention in incident response.
• Drive alignment across teams during complex incidents spanning multiple infrastructure layers including cloud, hybrid, and on-prem environments.

🎯 Requirements

• 8+ years experience in incident management, site reliability engineering, or infrastructure operations
• Experience managing incidents in large-scale distributed infrastructure environments
• Strong understanding of data center operations, GPU compute clusters, networking, storage infrastructure, and cloud or hybrid platforms
• Proven ability to lead high-pressure incident response situations
• Experience with incident management frameworks (ITIL, SRE, or equivalent)
• Excellent communication and stakeholder management skills
• Experience with incident tracking and monitoring tools such as PagerDuty, ServiceNow, Jira, Datadog, Prometheus, and Grafana

🏖️ Benefits

• Generous cash & equity compensation
• Health, dental, and vision coverage for you and your dependents
• Wellness and commuter stipends for select roles
• 401k Plan with 2% company match (USA employees)
• Flexible paid time off plan that we all actually use

Skills & Technologies

Prometheus

Grafana

Datadog

Senior

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Lambda Inc.

Visit Website

About Lambda Inc.

Lambda Inc. provides cloud-based GPU clusters and workstations for artificial-intelligence research and development. The company designs and operates high-performance hardware infrastructure optimized for machine-learning workloads, offering on-demand access to NVIDIA GPUs, pre-configured deep-learning software stacks, and scalable storage. Customers include AI labs, universities, and enterprises training large language and computer-vision models. Founded in 2012, Lambda is headquartered in San Francisco and maintains data centers across North America and Europe.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.