OpenAI, Inc. logo

Software Engineer, ChatGPT Infrastructure

Job Overview

Location

San Francisco

Job Type

Full-time

Category

Software Engineer

Date Posted

February 25, 2026

Full Job Description

đź“‹ Description

  • • As a Software Engineer on the ChatGPT Infrastructure team at OpenAI, you will be at the forefront of building and operating the foundational platforms that power one of the world's most rapidly evolving AI systems. ChatGPT's continuous innovation, dynamic product surfaces, and week-to-week shifts in usage patterns demand an infrastructure that can not only keep pace but also ensure unwavering performance and reliability under real-world production constraints. This role is not about maintaining the status quo; it's about proactively shaping the future of AI development by creating high-leverage infrastructure that enables fast, safe, and scalable iteration.
  • • Your primary responsibility will be to design, develop, and evolve the shared systems, data paths, rollout mechanisms, and reliability guardrails that empower teams building ChatGPT at scale. You will translate complex, messy real-world operational challenges into elegant, reusable solutions. This involves defining clear interfaces, developing core abstractions, and creating intuitive tooling that transforms difficult operational lessons into default best practices. The impact of your work will be directly visible in reduced friction for developers, fewer regressions, enhanced system performance, and gracefully scaling systems that meet the growing demands of our user base.
  • • You will have the opportunity to contribute to a variety of critical areas within the ChatGPT ecosystem. This might include building platform foundations and frameworks, such as core libraries, robust service frameworks, and shared components that standardize how systems are built, integrated, and evolved across the organization. You could also focus on scalability and performance primitives, developing patterns and infrastructure designed to minimize tail latency, maximize throughput, and maintain cost predictability as demand surges.
  • • Furthermore, you will play a key role in implementing reliability guardrails – sophisticated mechanisms engineered to prevent outages by design. This includes implementing and refining techniques like rate limiting, load shedding, dependency isolation, backpressure management, and safe fallback strategies, all aimed at making regressions exceptionally difficult. A significant part of the role involves enhancing developer productivity through the creation of “golden paths” or “paved roads.” These are streamlined, self-serve workflows for common tasks like data access, service integration, and request lifecycle management, ensuring these operations are fast, secure, and user-friendly.
  • • You will also contribute to advanced observability and debugging systems, developing instrumentation, metrics models, and investigative tooling that enable precise, actionable diagnoses from vague issues like “it’s slow.” Ensuring safe change management is another crucial aspect, where you'll work on deployment and rollout systems that support rapid iteration with high confidence, incorporating progressive delivery, automated verification, and swift rollback strategies. Finally, you will excel in interface and contract design across various boundaries, crafting clean APIs and stable contracts that minimize coupling and facilitate independent evolution within our complex, interconnected ecosystem.
  • • The role demands a proactive approach to owning outcomes end-to-end, from initial design and implementation through to rollout and ensuring long-term operational maturity. You will partner closely with engineering and product teams to identify systemic pain points and transform them into robust, reusable solutions that benefit the entire organization. By building and operating these critical infrastructure platforms, you will directly multiply the effectiveness of the teams developing ChatGPT's user experiences, enabling them to innovate faster and more reliably.

🎯 Requirements

  • • Experience building and operating large-scale distributed systems in production, demonstrating proficiency in handling high throughput, concurrency, and complex failure scenarios.
  • • Strong foundational knowledge in systems design, encompassing areas such as caching strategies, consistency models, queueing/backpressure mechanisms, and resilient dependency management.
  • • Proven ability to analyze and reason about system performance, including latency distributions, tail behavior, and identifying bottlenecks, and to translate these insights into actionable engineering tasks.
  • • A track record of successfully building platforms or shared infrastructure that demonstrably improves the velocity, correctness, and scalability for other engineering teams.
  • • Excellent communication and collaboration skills, with a demonstrated ability to align stakeholders on interfaces, navigate complex technical tradeoffs, and drive cross-team execution effectively.

🏖️ Benefits

  • • Competitive salary and equity compensation.
  • • Comprehensive health, dental, and vision insurance.
  • • Generous paid time off and holidays.
  • • Opportunities for professional development and continuous learning.
  • • A collaborative and innovative work environment at the forefront of AI research and development.

Skills & Technologies

DevOps
Onsite

Ready to Apply?

You will be redirected to an external site to apply.

OpenAI, Inc. logo
OpenAI, Inc.
Visit Website

About OpenAI, Inc.

OpenAI is a San Francisco-based artificial intelligence research and deployment company founded in 2015. It develops large-scale AI models such as GPT, DALL-E, and Codex, providing cloud APIs and consumer applications like ChatGPT. Originally established as a non-profit, it later created a capped-profit subsidiary to attract capital while maintaining its mission to ensure artificial general intelligence benefits all of humanity.

Similar Opportunities

Argentina
Full-time
Expires Apr 25, 2026
Python
JavaScript
TypeScript
+4 more

12 days ago

Apply
Argentina
Full-time
Expires May 4, 2026
Python
PHP
Ruby
+5 more

3 days ago

Apply
Argentina
Full-time
Expires Apr 29, 2026
Java
Spring
PostgreSQL
+5 more

8 days ago

Apply
Argentina
Full-time
Expires Apr 28, 2026
JavaScript
TypeScript
Go
+4 more

9 days ago

Apply