This job has expired

This position was posted on March 10, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Training, Process Management Engineer

OpenAI, Inc.

Job Overview

Location

London, UK

Job Type

Full-time

Full Job Description

📋 Description

• Join OpenAI's Training Runtime team as a Process Management Engineer and play a pivotal role in shaping the core distributed runtime that powers cutting-edge AI research and development. This is an exceptional opportunity to work on the foundational software that enables the training of massive AI models, from initial research experiments to frontier-scale model runs.
• Our team is dedicated to building robust, scalable, and high-performance components that maximize the productivity of our researchers and the efficiency of our hardware. Our ultimate goal is to accelerate progress towards Artificial General Intelligence (AGI), and your contributions will be instrumental in achieving this ambitious vision.
• Within the Training Runtime organization, you will be part of the Process Management team. This team is responsible for developing the distributed operating system that handles the launching, coordination, and supervision of the vast number of processes that constitute modern AI training workloads. This critical runtime layer sits between high-level training frameworks and the underlying research infrastructure.
• Your work will ensure that training jobs run reliably across massive compute clusters, maintaining peak performance, unwavering stability, and comprehensive observability. Success in this role is measured by a dual focus: achieving exceptional system reliability and significantly enhancing researcher velocity, thereby enabling groundbreaking ideas to scale seamlessly from initial experiments to production-level training runs.
• As a Training Runtime: Process Management Engineer, you will be instrumental in developing the software that intricately connects thousands of computers, presenting them as a unified, powerful system. This sophisticated system must cater to the diverse needs of individual researchers running multiple parallel experiments, as well as the colossal training runs that span hundreds of thousands, and even millions, of machines and accelerators.
• The systems you build must be easy to use and introspectable, fostering a fast debugging and development cycle for researchers. Simultaneously, you will be tasked with relentless optimization for scale, ensuring that performance and stability are maintained across the entire infrastructure, even under extreme load.
• This role offers a unique chance to work primarily in Rust, a language renowned for its performance and safety. You will be building high-performance asynchronous systems with a profound emphasis on performance, correctness, and scalability. This is an opportunity to deepen your expertise in systems programming at the forefront of AI.
• The challenges at this scale and at the frontier of AI development are novel and often require innovative solutions. Standard, out-of-the-box approaches may not suffice. The problems you will tackle are highly ambiguous, demanding strong design judgment, creative problem-solving, and proficient execution to advance the state of our infrastructure.
• We are seeking individuals who are passionate about optimizing end-to-end platforms, possess a deep understanding of high-performance architectures, and are driven to maximize both local and distributed performance across our supercomputing infrastructure.
• You will thrive in this role if you are excited by the rapid pace of responding to the dynamic and evolving needs of our training runtime and compute stack. This includes adapting to new hardware, evolving algorithms, and the ever-increasing scale of our AI models.
• Your responsibilities will span across our Python and Rust software stack, contributing to the design, development, and maintenance of software critical for orchestrating and monitoring machine learning workloads on our largest supercomputers.
• You will be actively involved in profiling and optimizing our software stack to ensure it can support computation orchestration at frontier scale, pushing the boundaries of what's currently possible.
• A key focus will be on improving the reliability, observability, and fault tolerance of long-running jobs, ensuring that our research and training processes are as robust as possible.
• You will be expected to debug complex distributed systems issues that span across large clusters, requiring a systematic and analytical approach to problem-solving.
• Furthermore, you will respond proactively to the changing shapes and needs of our ML systems, adapting our infrastructure to enable our researchers to pursue their most ambitious ideas.
• This role is based in London, UK, and we operate on a hybrid work model, requiring 3 days in the office per week. We are committed to supporting your relocation and offer assistance to new employees joining our team.

Skills & Technologies

Python

Rust

Linux

Hybrid

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

OpenAI, Inc.

Visit Website

About OpenAI, Inc.

OpenAI is a San Francisco-based artificial intelligence research and deployment company founded in 2015. It develops large-scale AI models such as GPT, DALL-E, and Codex, providing cloud APIs and consumer applications like ChatGPT. Originally established as a non-profit, it later created a capped-profit subsidiary to attract capital while maintaining its mission to ensure artificial general intelligence benefits all of humanity.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.