This job has expired

This position was posted on March 28, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Software Engineer, Workload Enablement

OpenAI, Inc.

Job Overview

Location

San Francisco

Job Type

Full-time

Full Job Description

📋 Description

• As a Software Engineer on the Workload Enablement team within OpenAI’s Scaling organization, you will play a critical role in ensuring that cutting-edge AI workloads run reliably and efficiently on emerging hardware platforms, directly enabling the deployment of next-generation models.
• You will be responsible for porting, validating, and optimizing key inference and training workloads on new and early-access systems, building comprehensive test harnesses and stress benchmarks that reflect real-world end-to-end behavior, and diagnosing performance bottlenecks across compute, memory, networking, and storage subsystems.
• The Scaling team designs and delivers the architectural and engineering foundation for OpenAI’s AI infrastructure, spanning system software, high-speed networking, platform architecture, fleet monitoring, and performance optimization to support the training and deployment of frontier models.
• You will collaborate closely with systems bring-up engineers, vendor partners, and internal ML teams to ensure platforms are not only performant but also operationally robust, scalable, and observable—integrating with Kubernetes, telemetry systems, and failure triage workflows.
• In this role, you will deepen your expertise in high-performance computing, distributed systems, and AI infrastructure, gaining hands-on experience with the latest accelerators, interconnect technologies (like NVLink and RDMA), and software stacks while contributing to the reliability and scalability of OpenAI’s AI platform.

🎯 Requirements

• BS in Computer Science, Electrical Engineering, or equivalent practical experience
• 5+ years of experience in ML systems, performance engineering, distributed systems, or HPC
• Strong hands-on experience with PyTorch and modern LLM training/inference stacks, large-scale distributed training concepts (data/model/pipeline parallelism), and RDMA-based communication libraries (NCCL or RCCL)

🏖️ Benefits

• Opportunity to work on cutting-edge AI infrastructure at the forefront of technological innovation
• Collaborative, mission-driven culture focused on safe and beneficial AI development
• Access to advanced hardware and software tools for performance analysis and system optimization

Skills & Technologies

Python

Kubernetes

PyTorch

Onsite

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

OpenAI, Inc.

Visit Website

About OpenAI, Inc.

OpenAI is a San Francisco-based artificial intelligence research and deployment company founded in 2015. It develops large-scale AI models such as GPT, DALL-E, and Codex, providing cloud APIs and consumer applications like ChatGPT. Originally established as a non-profit, it later created a capped-profit subsidiary to attract capital while maintaining its mission to ensure artificial general intelligence benefits all of humanity.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.