FluidStack Inc. logo

Lead, NOC & Incident Management

Job Overview

Location

Austin, Texas, USA

Job Type

Full-time

Category

Operations Manager

Date Posted

March 3, 2026

Full Job Description

📋 Description

  • • Fluidstack is at the forefront of building the infrastructure for abundant intelligence, partnering with leading AI labs, governments, and enterprises to accelerate the realization of Artificial General Intelligence (AGI).
  • • We are seeking a highly motivated and technically adept Lead, NOC & Incident Management to establish and lead our Network Operations Center (NOC) and incident management functions.
  • • This pivotal role will be instrumental in shaping how Fluidstack detects, triages, and responds to operational events across our entire AI infrastructure, encompassing data center facilities, network backbone, and internal platform services.
  • • The ideal candidate possesses a unique blend of operational leadership acumen and robust technical capabilities, essential for building a 24/7 monitoring and triage function from the ground up.
  • • You will be responsible for operationalizing our incident management framework, ensuring seamless execution and fostering an operational culture that consistently meets stringent customer Service Level Agreements (SLAs).
  • • Your primary objective is to alleviate operational toil for our infrastructure teams, freeing them from tasks like alert monitoring, carrier ticket management, incident bridge setup, and shift coverage gaps, allowing them to focus on critical engineering and reliability initiatives.
  • • You will be the guardian of our infrastructure, ensuring continuous oversight, consistent incident handling, and the effective implementation of post-incident learning.
  • • NOC Build & Operations:
  • • Spearhead the establishment of a cross-functional operations center, defining its structure, processes, and operational standards.
  • • Play a key role in selecting and onboarding a Managed Service Provider (MSP) to provide essential Tier 1 coverage.
  • • Develop comprehensive staffing models, intricate handoff processes, key performance indicators (KPIs), and rigorous quality standards for the NOC.
  • • Own the critical responsibility of ensuring qualified personnel are actively monitoring all alerts, 24/7.
  • • Incident Management Execution:
  • • Design, implement, and operationalize Fluidstack’s incident management framework, ensuring its effectiveness and adherence.
  • • Manage the on-call rotation for Incident Managers, ensuring continuous availability and response readiness.
  • • Develop and deliver training programs for engineers on their specific roles within incident management.
  • • Lead and orchestrate incident bridges during critical SEV0/SEV1 events, ensuring swift and effective resolution.
  • • Guarantee that post-incident reviews are conducted promptly and that all identified action items are tracked to completion.
  • • Collaborate closely with the Program Manager to continuously refine the incident management framework based on practical execution and lessons learned.
  • • Operational Readiness:
  • • Take ownership of the "are we ready?" assessment for every new domain integrated into the NOC’s coverage.
  • • Drive the quality assurance of runbooks in collaboration with functional teams, ensuring they are accurate, comprehensive, and actionable.
  • • Plan and execute tabletop exercises to simulate incident scenarios and test response capabilities.
  • • Coordinate with the Platform team to optimize workflows within incident management tooling, such as incident.io.
  • • Oversee the phased onboarding of new infrastructure domains, including Facilities, Network, and Systems, aligning with datacenter launch schedules.
  • • Cross-Functional Orchestration:
  • • Cultivate strong operational partnerships with Network Operations, Data Center Operations, Systems/Platform, and Security teams.
  • • Define crystal-clear escalation criteria for Tier 1 to Tier 2 transitions across all operational domains.
  • • Position the NOC as a force multiplier for engineering teams by effectively managing monitoring, triage, vendor ticket management, and incident coordination.
  • • Vendor & Carrier Ticket Lifecycle Management:
  • • Establish robust processes for the NOC to manage the complete lifecycle of carrier and vendor tickets, from creation and tracking to SLA enforcement and escalation.
  • • Collaborate with Network Operations and Data Center Operations to define standardized ticket templates, escalation triggers, and vendor communication protocols.
  • • Ensure meticulous documentation and prevent any ticket from falling through the cracks.
  • • Metrics & Continuous Improvement:
  • • Define and implement key operational metrics, including Mean Time to Acknowledge (MTTA), Mean Time to Resolve (MTTR), escalation rates, and false positive rates.
  • • Establish a regular reporting cadence to track performance against these metrics.
  • • Leverage data analytics to identify trends, reduce alert noise, enhance runbook quality, and systematically decrease incident response times.
  • • Produce comprehensive monthly operational reports for leadership and customer-facing stakeholders, highlighting performance and areas for improvement.

Skills & Technologies

AWS
Prometheus
Grafana
Senior
Onsite
$200k-300k

Ready to Apply?

You will be redirected to an external site to apply.

FluidStack Inc. logo
FluidStack Inc.
Visit Website

About FluidStack Inc.

FluidStack Inc. operates a distributed cloud platform that aggregates under-utilized GPUs in data centers and individual machines worldwide, renting them on-demand to AI researchers, startups, and enterprises for training and inference workloads. The company automates deployment, security, and billing, offering prices up to 80% below traditional hyperscalers while providing instant access to high-end NVIDIA A100, H100, and consumer GPUs through a single API and web console. Headquartered in London, FluidStack targets machine-learning engineers who need scalable, low-cost compute without long-term commitments, claiming thousands of active nodes and customers including Fortune 500 enterprises and leading research labs.

Similar Opportunities

Toronto, California, Canada
Full-time
Expires May 3, 2026
Junior
Onsite

3 days ago

Apply
Canada
Full-time
Expires Apr 26, 2026
Go
Docker
Remote

11 days ago

Apply
Coinbase Global, Inc. logo

Coinbase Global, Inc.

Canada
Full-time
Expires May 2, 2026
Remote

5 days ago

Apply
Directive Consulting LLC logo

Directive Consulting LLC

Canada
Full-time
Expires Apr 25, 2026
Spring
Senior
Remote

12 days ago

Apply