Member of Technical Staff - GPU Infrastructure

Prime Intellect, Inc.

Job Overview

Location

San Francisco

Job Type

Full-time

Full Job Description

📋 Description

• Prime Intellect is at the forefront of building the open superintelligence stack, enabling the creation, training, and deployment of frontier agentic models. We are building the infrastructure that will power the next generation of artificial intelligence, aggregating and orchestrating global compute resources into a unified control plane. This infrastructure is complemented by a comprehensive reinforcement learning (RL) post-training stack, including environments, secure sandboxes, verifiable evaluations, and an asynchronous RL trainer. Our mission is to empower researchers, startups, and enterprises to conduct end-to-end reinforcement learning at a scale previously unimaginable, facilitating the adaptation of models to real-world tools, workflows, and deployment contexts.
• As a Solutions Architect for GPU Infrastructure, you will be a pivotal technical expert responsible for translating intricate customer requirements into robust, production-ready systems. Your expertise will be crucial in enabling the training of the world's most advanced AI models. This role is a unique opportunity to work with cutting-edge technology and directly contribute to the advancement of artificial intelligence. The company has recently secured significant funding, raising $15 million in a round led by Founders Fund, with additional participation from Menlo Ventures and notable angel investors such as Andrej Karpathy, Tri Dao, Dylan Patel, Clem Delangue, and Emad Mostaque, underscoring the company's strong market position and future potential.
• Your core technical responsibilities will encompass a blend of deep technical knowledge and hands-on implementation, making you instrumental in several key areas. You will partner closely with clients to thoroughly understand their specific workload requirements, using this insight to design optimal GPU cluster architectures tailored to their needs. This involves creating detailed technical proposals and performing comprehensive capacity planning for GPU clusters that can range from 100 to over 10,000 GPUs, ensuring scalability and efficiency.
• Furthermore, you will be responsible for developing sophisticated deployment strategies for a variety of AI workloads, including large language model (LLM) training, inference, and high-performance computing (HPC) applications. A significant part of your role will involve presenting these architectural recommendations clearly and persuasively to both technical and executive stakeholders, ensuring alignment and buy-in.
• In terms of infrastructure deployment and optimization, you will be tasked with deploying and configuring essential orchestration systems such as SLURM and Kubernetes, specifically for distributed workloads. You will implement and manage high-performance networking solutions, leveraging technologies like InfiniBand, RoCE, and NVLink interconnects to ensure seamless data flow and low latency.
• A critical aspect of your role will be optimizing GPU utilization, memory management, and inter-node communication to maximize performance and efficiency. You will also configure parallel filesystems, including Lustre, BeeGFS, and GPFS, to ensure optimal I/O performance for demanding AI workloads. System performance tuning, from kernel parameters to CUDA configurations, will be a continuous effort to extract the maximum potential from the infrastructure.
• For production operations and support, you will serve as the primary technical escalation point for any customer infrastructure issues, ensuring rapid and effective resolution. You will diagnose and resolve complex problems that span the entire technology stack, encompassing hardware, drivers, networking, and software. Implementing robust monitoring, alerting, and automated remediation systems will be key to proactive issue management.
• This role also includes providing 24/7 on-call support for critical customer deployments, ensuring business continuity and minimizing downtime. Additionally, you will create comprehensive runbooks and documentation to empower customer operations teams and facilitate knowledge transfer.
• You will work directly with clients who are pushing the boundaries of AI, from agile startups training foundational models to established enterprises deploying massive inference infrastructure. Collaboration with our world-class engineering team will be integral, and you will have a direct impact on the systems that are powering the next generation of AI breakthroughs. We are looking for individuals who are passionate about building reliable, high-performance GPU infrastructure and possess a proven track record of successful large-scale deployments. If this sounds like you, we encourage you to apply and join us in our mission to democratize access to planetary-scale computing.

Skills & Technologies

Python

Node.js

Docker

Kubernetes

Terraform

DevOps

Senior

Onsite

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Prime Intellect, Inc.

Visit Website

About Prime Intellect, Inc.

San Francisco–based startup building decentralized AI infrastructure that lets researchers pool compute and data to collaboratively train large models. Founded in 2023, the company offers open-source protocols and cloud orchestration tools that aggregate GPUs across providers, coordinate distributed training, and cryptographically verify contributions so participants share ownership and future rewards of the resulting models.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.