Pluralis Research Ltd logo

Machine Learning Engineer - Distributed ML Systems

Job Overview

Location

San Francisco

Job Type

Full-time

Category

Machine Learning Engineer

Date Posted

April 1, 2026

Full Job Description

đź“‹ Description

  • • Machine Learning Engineer - Distributed ML Systems at Pluralis Research Ltd is a senior technical role focused on building the foundational infrastructure for Protocol Learning — a novel approach to decentralized, community-owned training of frontier AI models. This role is critical to enabling scalable, resilient, and efficient distributed training under real-world constraints like low bandwidth and high latency, directly advancing the company’s mission to prevent AI monopolization by large corporations.
  • • You will design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under consumer-grade internet conditions, developing model-parallel strategies (data, tensor, pipeline parallelism) with custom sharding to minimize communication overhead, while optimizing GPU utilization, memory efficiency, and compute performance across geographically dispersed nodes.
  • • You will build robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs, and implement monitoring and metrics systems to track training progress, model quality, and system bottlenecks in real time.
  • • You will architect resilient decentralized networking systems where nodes can fail, networks partition, and participants dynamically join or leave, designing peer-to-peer topologies for coordination, implementing NAT traversal, peer discovery, dynamic routing, and connection lifecycle management, and profiling communication patterns to reduce latency and bandwidth overhead.
  • • You will collaborate with a world-class team of ML researchers and engineers from Google, Amazon, Microsoft, and leading startups, contributing to a mission-driven effort backed by Union Square Ventures and other tier-1 investors to create community-trained, community-owned AI models with self-sustaining economics.
  • • In this role, you will deepen your expertise in cutting-edge distributed ML systems, gain hands-on experience with advanced parallelism techniques (FSDP, DeepSpeed, Megatron), and pioneer novel solutions for decentralized AI training — positioning yourself at the forefront of the next generation of resilient, equitable AI infrastructure.

🎯 Requirements

  • • Strong experience building and operating distributed systems in production environments
  • • Hands-on expertise with distributed training frameworks such as FSDP, DeepSpeed, Megatron, or similar
  • • Deep understanding of model parallelism strategies including data, tensor, and pipeline parallelism
  • • Expert-level Python proficiency with production experience in concurrency, error handling, retry logic, and clean architecture
  • • Strong networking fundamentals: P2P systems, gRPC, routing, NAT traversal, and distributed coordination
  • • Experience optimizing GPU workloads, memory management, and large-scale compute efficiency

🏖️ Benefits

  • • Equity-heavy compensation with meaningful ownership in a mission-driven company
  • • Competitive base salary for senior engineering roles in Australia
  • • Visa sponsorship available for exceptional candidates
  • • Remote-first work model with optional access to the Melbourne hub
  • • Opportunity to work with a world-class team from Google, Amazon, Microsoft, and top startups
  • • Backed by Union Square Ventures and other tier-1 investors

Skills & Technologies

Python
gRPC
Remote

Ready to Apply?

You will be redirected to an external site to apply.

Pluralis Research Ltd logo
Pluralis Research Ltd
Visit Website

About Pluralis Research Ltd

Pluralis Research develops a novel approach to training large AI models called “Protocol Learning.” Instead of traditional centralized or open-source models, their method enables decentralized, multi-participant model training where no single party ever holds a full copy of the model weights. This makes models “unextractable” and supports collaborative ownership, allowing value from model usage to flow back to contributors. They aim to democratize access and innovation in AI, reduce dependency on large tech firms, and create a sustainable, open ecosystem for foundation model development.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

ARGENTINA
Full-time
Expires Jun 20, 2026
AWS
Terraform
TensorFlow
+4 more

2 days ago

Apply
Melbourne
Full-time
Expires May 15, 2026
Python
Kubernetes
PyTorch
+4 more

1 month ago

Apply
Heidi Health Pty Ltd logo

Heidi Health Pty Ltd

Melbourne
Full-time
Expires May 15, 2026
Python
Go
TensorFlow
+4 more

1 month ago

Apply
FundraiseUp Inc. logo

FundraiseUp Inc.

Portugal - Remote
Full-time
Expires May 23, 2026
Python
FastAPI
MongoDB
+4 more

1 month ago

Apply