This job has expired

This position was posted on February 17, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Member of Technical Staff - GPU Infrastructure

ReflectionAI Inc.

Job Overview

Location

Remote

Job Type

Full-time

Full Job Description

📋 Description

• Join ReflectionAI Inc. as a Member of Technical Staff specializing in GPU Infrastructure, a pivotal role in our mission to build and democratize open superintelligence. You will be at the forefront of designing, building, and operating the large-scale GPU infrastructure that powers our cutting-edge pre-training, post-training, and inference workloads. This is a unique opportunity to shape the foundational technology enabling the development of open weight models for a diverse range of users, from individuals and agents to enterprises and even nation-states.
• As a key member of our talent-dense team, you will architect and implement reliable, high-performance systems for scheduling, orchestration, and observability across a massive cluster of thousands of GPUs. Your work will directly impact our ability to push the boundaries of AI research and development by ensuring seamless and efficient operation of our computational resources.
• A core responsibility will be to optimize cluster utilization, throughput, and cost efficiency. This involves a deep understanding of resource management, workload balancing, and performance tuning to maximize the value derived from our GPU investments. You will be instrumental in ensuring that our infrastructure is not only powerful but also economically sustainable as we scale.
• You will develop and deploy sophisticated tools and automation to streamline distributed training, inference, monitoring, and experiment management. This includes building robust systems that enable our research and engineering teams to iterate rapidly and efficiently, accelerating the journey from initial concept to production-ready AI models.
• Collaboration is central to this role. You will work closely with our world-class research, training, and platform teams, who hail from leading institutions like DeepMind, OpenAI, Google Brain, Meta, Character.AI, and Anthropic. Your expertise will be crucial in translating their needs into robust infrastructure solutions, ensuring they have the resources and tools necessary to achieve breakthrough results.
• This position offers the chance to push the limits of what's possible in AI infrastructure. You will explore and implement innovative solutions involving hardware, networking, and software to accelerate the entire AI development lifecycle. This is an opportunity to define the state-of-the-art in large-scale AI training and inference infrastructure.
• We are looking for individuals with deep systems or infrastructure engineering experience, particularly within high-performance or distributed computing environments. A strong grasp of GPU technologies, including CUDA, NCCL, and familiarity with large-scale training and inference frameworks and libraries such as PyTorch, DeepSpeed, JAX, Megatron-LM, SGLang, and vLLM, is essential.
• Hands-on experience with containerization and orchestration technologies like Kubernetes and cluster management systems such as Slurm is highly desirable. You should also be comfortable with modern observability stacks and performance profiling tools to ensure the health and efficiency of our systems.
• The ideal candidate possesses high agency, demonstrating a proactive and self-driven approach to problem-solving. You must be comfortable thriving in a fast-paced, high-ownership startup environment where initiative and impact are paramount.
• We are particularly excited about candidates motivated by the prospect of building cutting-edge RL infrastructure from the ground up. Defining how frontier-scale training infrastructure is architected and operated is a core part of this role. Your passion for enabling researchers and engineers to build the world's most capable open-weight AI systems will be a driving force.
• At ReflectionAI, we believe that building truly open superintelligence requires a strong foundation. Joining us means contributing to the very core of our mission, helping to define our company's future and advance the frontier of open foundational models. We are committed to providing an environment where you can do the most impactful work of your career, with the assurance that you and your loved ones are well-supported.

🎯 Requirements

• Proven experience in deep systems or infrastructure engineering, specifically within high-performance or distributed computing environments.
• Strong understanding of GPU technologies (CUDA, NCCL) and familiarity with large-scale AI training/inference frameworks (PyTorch, DeepSpeed, JAX, Megatron-LM, SGLang, vLLM).
• Hands-on experience with containerization and orchestration (Kubernetes) and cluster management (Slurm or similar).
• Familiarity with modern observability stacks and performance profiling tools.
• High agency and a proactive, self-driven approach to problem-solving in a fast-paced startup environment.

🏖️ Benefits

• Top-tier compensation including competitive salary and equity designed to attract and retain global talent.
• Comprehensive health and wellness benefits, including medical, dental, vision, life, and disability insurance.
• Generous paid time off and fully paid parental leave for all new parents, supporting life and family needs.
• Opportunities for professional growth and development in a cutting-edge AI research environment.
• Daily provided lunches and dinners, regular off-sites, and team celebrations to foster connection and collaboration.

Skills & Technologies

Kubernetes

PyTorch

DevOps

Senior

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

ReflectionAI Inc.

Visit Website

About ReflectionAI Inc.

ReflectionAI builds autonomous AI agents for enterprise process automation. The platform lets organizations create, deploy, and manage software agents that observe workflows, make decisions, and act across internal systems. Using reinforcement learning and large language models, agents learn from human guidance and adapt to changing environments. Customers use the technology for customer support triage, IT operations, compliance monitoring, and sales process automation, reducing repetitive manual tasks. The company offers cloud-hosted and on-premise deployments, role-based access controls, audit trails, and integrations with common business applications including Salesforce, ServiceNow, Jira, and Slack.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.