Lambda Inc. logo

Senior Incident Manager

Job Overview

Location

Remote, USA

Job Type

Full-time

Category

DevOps

Date Posted

June 4, 2026

Full Job Description

đź“‹ Description

  • • Lead end-to-end incident response for critical (SEV-1/SEV-2) outages impacting AI infrastructure, GPU clusters, networking, storage, and data center operations.
  • • Serve as Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams to ensure rapid resolution.
  • • Act as the primary liaison between leadership and cross-functional teams during active incidents and post-incident reviews, providing clear status updates and summaries.
  • • Own the incident lifecycle including triage, escalation, coordination, resolution, and post-incident analysis with documented timelines and action plans.
  • • Maintain and improve incident response documentation, operational playbooks, runbooks, and reliability frameworks.
  • • Conduct post-incident reviews (PIRs) and root cause analyses to identify systemic reliability gaps and drive corrective actions.
  • • Track and report key incident metrics including MTTR, MTTD, and incident recurrence rates to measure operational improvement.
  • • Participate in an on-call rotation to respond to, lead, and coordinate incidents in real-time.
  • • Collaborate closely with data center operations, infrastructure engineering, network engineering, platform reliability engineering, and security operations teams during cross-layer outages.
  • • Work with hardware and facility vendors to resolve incidents involving physical infrastructure and GPU cluster failures.
  • • Improve incident response tooling, escalation paths, and automation by partnering with technical support and engineering teams.
  • • Maintain incident dashboards and operational health reports for leadership and engineering stakeholders.
  • • Deliver executive-level incident summaries and clear, concise communication during high-pressure situations.
  • • Contribute to the development of operational standards and reliability frameworks aligned with SRE and ITIL practices.
  • • Support implementation of observability improvements and automation to reduce manual intervention in incident response.
  • • Drive alignment across teams during complex incidents spanning multiple infrastructure layers including cloud, hybrid, and on-prem environments.

🎯 Requirements

  • • 8+ years experience in incident management, site reliability engineering, or infrastructure operations
  • • Experience managing incidents in large-scale distributed infrastructure environments
  • • Strong understanding of data center operations, GPU compute clusters, networking, storage infrastructure, and cloud or hybrid platforms
  • • Proven ability to lead high-pressure incident response situations
  • • Experience with incident management frameworks (ITIL, SRE, or equivalent)
  • • Excellent communication and stakeholder management skills
  • • Experience with incident tracking and monitoring tools such as PagerDuty, ServiceNow, Jira, Datadog, Prometheus, and Grafana

🏖️ Benefits

  • • Generous cash & equity compensation
  • • Health, dental, and vision coverage for you and your dependents
  • • Wellness and commuter stipends for select roles
  • • 401k Plan with 2% company match (USA employees)
  • • Flexible paid time off plan that we all actually use

Skills & Technologies

Prometheus
Grafana
Datadog
Senior
Remote

Ready to Apply?

You will be redirected to an external site to apply.

Lambda Inc. logo
Lambda Inc.
Visit Website

About Lambda Inc.

Lambda Inc. provides cloud-based GPU clusters and workstations for artificial-intelligence research and development. The company designs and operates high-performance hardware infrastructure optimized for machine-learning workloads, offering on-demand access to NVIDIA GPUs, pre-configured deep-learning software stacks, and scalable storage. Customers include AI labs, universities, and enterprises training large language and computer-vision models. Founded in 2012, Lambda is headquartered in San Francisco and maintains data centers across North America and Europe.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

Web.com Group, Inc. logo

Web.com Group, Inc.

Argentina - Remote
Full-time
Expires Jul 14, 2026
Python
Docker
Kubernetes
+4 more

23 days ago

Apply
Haast Technologies Inc. logo

Haast Technologies Inc.

Sydney Office
Full-time
Expires Jul 25, 2026
Go
Junior
Hybrid

12 days ago

Apply
Magic Eden, Inc. logo

Magic Eden, Inc.

Melbourne, Australia
Full-time
Expires Jul 25, 2026
Onsite

12 days ago

Apply
Expired
Bangalore, INDIA
Full-time
Expired May 16, 2026
Remote

3 months ago

Apply