OpenAI, Inc. logo

Tech Lead, Deployment & Operations — Custom Infrastructure

Job Overview

Location

San Francisco

Job Type

Full-time

Category

Engineering Manager

Date Posted

May 17, 2026

Full Job Description

📋 Description

  • Lead a team responsible for the deployment and operational management of OpenAI’s custom silicon and systems within production data center environments
  • Own the end-to-end journey of new hardware platforms from lab validation through production deployment, operational readiness, and sustained fleet support
  • Define and implement deployment processes, operational playbooks, technical readiness criteria, escalation paths, and reliability practices for custom AI hardware
  • Partner cross-functionally with silicon design, systems engineering, software, infrastructure, networking, data center operations, and supply chain teams to ensure seamless integration and scale
  • Drive technical execution across hardware bring-up, rack and system integration, data center deployment, fleet monitoring, debugging, and incident resolution
  • Remain deeply hands-on in architecture reviews, deployment planning, failure analysis, system-level debugging, and critical operational decision-making
  • Identify gaps in tooling, observability, automation, validation coverage, and operational workflows, and lead initiatives to close them
  • Establish and track key metrics for deployment readiness, system reliability, performance, maintainability, and overall operational health
  • Build and cultivate an engineering culture rooted in ownership, technical rigor, operational excellence, and high-velocity execution
  • Ensure custom AI hardware platforms are deployed, operated, and maintained reliably, repeatably, and safely at massive scale
  • Contribute directly to the architecture and design of future machine learning systems by applying operational insights from real-world hardware deployment
  • Mentor and develop senior technical engineers while maintaining active involvement in technical execution and problem-solving
  • Operate effectively in fast-moving, ambiguous environments where hardware platforms, tooling, and operational models are still being defined
  • Communicate clearly and decisively across technical and operational teams during high-urgency, cross-functional technical incidents
  • Maintain a bias toward ownership by actively engaging in urgent technical issues and leading resolution efforts
  • Build practical, scalable systems, tools, and processes that ensure consistent and reliable hardware operations in production
  • Ensure compliance with data center operational standards, security protocols, and hardware handling procedures
  • Collaborate with external partners and vendors to align on deployment timelines, logistical requirements, and technical specifications
  • Participate in on-call rotations and incident response for critical production hardware failures
  • Contribute to long-term strategic planning for OpenAI’s supercomputing infrastructure and AI-native hardware roadmap

🎯 Requirements

  • 8+ years of engineering experience in hardware systems, infrastructure, data center deployment, production operations, systems engineering, silicon bring-up, or related technical domains
  • Strong technical depth in one or more of: hardware deployment, data center operations, rack-scale systems, silicon bring-up, systems validation, fleet operations, reliability engineering, infrastructure automation, or hardware/software integration
  • Experience bringing complex hardware systems from development or validation into production environments
  • Experience working closely with silicon, systems, software, infrastructure, networking, or data center teams
  • Demonstrated ability to hire, develop, and lead senior technical talent
  • Strong written and verbal communication skills, especially in high-urgency, cross-functional technical environments

🏖️ Benefits

  • Compensation range of $342K - $445K USD
  • Opportunity to work on cutting-edge AI-native silicon and systems at the forefront of machine learning infrastructure
  • Collaborative environment with world-class teams in AI research, software, and hardware engineering
  • Commitment to operational excellence and building reliable, scalable systems at unprecedented scale

Skills & Technologies

DevOps
Senior
Onsite
$342k-445k

Ready to Apply?

You will be redirected to an external site to apply.

OpenAI, Inc. logo
OpenAI, Inc.
Visit Website

About OpenAI, Inc.

OpenAI is a San Francisco-based artificial intelligence research and deployment company founded in 2015. It develops large-scale AI models such as GPT, DALL-E, and Codex, providing cloud APIs and consumer applications like ChatGPT. Originally established as a non-profit, it later created a capped-profit subsidiary to attract capital while maintaining its mission to ensure artificial general intelligence benefits all of humanity.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

Expired
Dubai
Full-time
Expired Apr 28, 2026
Onsite

3 months ago

Apply
Expired
Argentina (Remote)
Full-time
Expired Nov 22, 2025
JavaScript
TypeScript
Java
+4 more

9 months ago

Apply
Expired
Buenos Aires
Full-time
Expired Jun 6, 2026
Python
JavaScript
Java
+4 more

2 months ago

Apply
Expired
Berlin, Germany
Full-time
Expired May 10, 2026
Go
Senior
Remote

3 months ago

Apply