This job has expired

This position was posted on April 1, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Machine Learning Engineer - ML Training Platform

Pluralis Research Ltd

Job Overview

Location

San Francisco

Job Type

Full-time

Full Job Description

📋 Description

• As a Machine Learning Engineer on the ML Training Platform team at Pluralis Research Ltd, you will play a pivotal role in advancing Protocol Learning—a decentralized approach to AI training that democratizes access to frontier-scale model development by enabling individuals worldwide to contribute compute resources. Your work will directly support the company’s mission to prevent AI monopolization by building the open, collaborative infrastructure that allows heterogeneous, consumer-grade nodes to participate in large-scale model training without centralized control.
• You will own the end-to-end design, implementation, and scaling of the foundational systems that power Pluralis’ decentralized ML training platform, ensuring reliability, security, and efficiency across a globally distributed network of volunteer compute contributors.
• Day to day, you will design and implement multi-cloud infrastructure orchestration using Pulumi and Terraform to provision, manage, and scale compute resources across AWS, GCP, and Azure, including dynamic node scheduling, state synchronization, and handling concurrent operations across hundreds of heterogeneous nodes with varying capabilities and connectivity.
• You will architect fault-tolerant distributed training systems optimized for real-world conditions, including GPU cluster management, NVIDIA runtime integration, S3-based checkpointing, large-scale dataset streaming, and resilient retry mechanisms that gracefully handle node failures, network partitions, and intermittent connectivity.
• You will build network simulation and traffic shaping systems that emulate bandwidth limitations, latency injection, and packet loss to ensure robust data flow across worker nodes operating under consumer-grade internet conditions, while maintaining efficient communication and synchronization in a peer-to-peer, decentralized topology.
• You will develop and maintain observability tooling using Prometheus, Grafana, and cloud SDKs to monitor system health, performance bottlenecks, and job progress, implementing alerting, logging, and tracing to support rapid incident response and continuous system improvement.
• You will collaborate closely with ML researchers to translate experimental protocols into production-ready pipelines, enabling seamless iteration on model architectures, training strategies, and decentralized coordination mechanisms.
• You will contribute to a culture of technical excellence by writing clean, testable, and well-documented Python code leveraging asyncio and concurrency patterns, participating in code reviews, and driving improvements in system reliability, security, and scalability.
• Pluralis Research is a deeply technical, mission-driven team backed by Union Square Ventures and other tier-1 investors, composed of ML researchers and engineers committed to ideological openness in AI development. The company operates at the intersection of cutting-edge systems engineering and decentralized protocols, fostering an environment where innovation is guided by both scientific rigor and a commitment to equitable access to AI technology.
• In this role, you will gain deep expertise in decentralized systems, multi-cloud orchestration, and large-scale ML infrastructure—skills that are increasingly critical as the industry shifts toward more open, resilient, and community-driven AI development. You will have the opportunity to shape the architecture of a novel protocol with potential long-term impact on how AI models are trained, governed, and accessed globally.

🎯 Requirements

• 5+ years of professional experience in infrastructure/platform engineering, distributed systems, or ML platform development
• Production experience with infrastructure-as-code tools (Pulumi, Terraform, or CloudFormation) managing multi-cloud deployments (AWS, GCP, Azure)
• Hands-on experience with Docker, Kubernetes (EKS), GPU workload orchestration, and heterogeneous cluster management at scale
• Strong Python engineering skills including asyncio, concurrency, cloud SDKs, CLI tooling, and experience with observability stacks (Prometheus/Grafana)
• Deep understanding of distributed training workflows, checkpointing, data sharding, model versioning, and long-running job orchestration
• Familiarity with decentralized networking concepts (P2P, NAT traversal, traffic shaping) and real-world bandwidth/latency constraints

🏖️ Benefits

• Opportunity to work on a groundbreaking, ideologically driven project aimed at democratizing AI development and preventing corporate monopolization of frontier models
• Fully remote position with flexible hours, enabling collaboration across time zones while maintaining work-life balance
• Backed by top-tier venture capital (Union Square Ventures), offering stability and resources for ambitious technical execution
• Collaborate with a world-class team of ML researchers and engineers passionate about open, ethical AI innovation
• Engage in meaningful, high-impact work that combines cutting-edge systems engineering with real-world decentralized protocol design
• Continuous learning environment with exposure to novel research in Protocol Learning and decentralized AI training methodologies

Skills & Technologies

Python

Node.js

AWS

Azure

GCP

Onsite

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Pluralis Research Ltd

Visit Website

About Pluralis Research Ltd

Pluralis Research develops a novel approach to training large AI models called “Protocol Learning.” Instead of traditional centralized or open-source models, their method enables decentralized, multi-participant model training where no single party ever holds a full copy of the model weights. This makes models “unextractable” and supports collaborative ownership, allowing value from model usage to flow back to contributors. They aim to democratize access and innovation in AI, reduce dependency on large tech firms, and create a sustainable, open ecosystem for foundation model development.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.