Machine Learning Engineer - ML Training Platform

Pluralis Research Ltd

Job Overview

Location

Melbourne, Indiana, USA

Job Type

Full-time

Full Job Description

📋 Description

• Join Pluralis Research, a pioneering company at the forefront of Protocol Learning, a revolutionary approach to training foundation models in a decentralized, multi-participant environment. We are dedicated to building community-trained and community-owned frontier models, fostering self-sustaining economic models that democratize access and prevent monopolization by a few large corporations.
• We are seeking highly skilled and experienced Machine Learning Engineers, specifically those with a strong background in distributed systems and large-scale ML training, to play a pivotal role in implementing our novel training substrate. This role is ideal for individuals who thrive on complex technical challenges and are passionate about building the future of AI infrastructure.
• As a Machine Learning Engineer on our ML Training Platform team, you will be instrumental in designing, developing, and optimizing systems that enable distributed ML model training under challenging network conditions, including consumer-grade internet with low bandwidth and high latency.
• Your core responsibilities will revolve around architecting and implementing sophisticated large-scale distributed training systems. This includes optimizing for heterogeneous hardware environments and ensuring robust performance even in the face of network constraints.
• You will dive deep into model parallelism strategies, including data, tensor, and pipeline parallelism. A key aspect of your work will involve developing and refining custom sharding techniques to significantly minimize communication overhead between distributed nodes.
• A critical focus will be on optimizing GPU utilization, enhancing memory efficiency, and maximizing compute performance across all distributed nodes. This requires a meticulous approach to resource management and performance tuning.
• You will be responsible for implementing and maintaining robust checkpointing, state synchronization, and recovery mechanisms. These are essential for ensuring the reliability and continuity of long-running, potentially fault-prone training jobs.
• Building comprehensive monitoring and metrics systems is also a key part of this role. You will develop tools to meticulously track training progress, evaluate model quality, and identify system bottlenecks, enabling proactive problem-solving and continuous improvement.
• Beyond the core training infrastructure, you will architect resilient training systems capable of withstanding node failures, network partitions, and dynamic participant changes. This demands a deep understanding of fault tolerance and distributed system design.
• You will design and optimize peer-to-peer topologies to facilitate efficient and decentralized coordination among non-co-located nodes, ensuring seamless communication and collaboration.
• Implementing essential networking components such as NAT traversal, peer discovery, dynamic routing, and connection lifecycle management will be crucial for establishing and maintaining stable connections in a decentralized network.
• A significant part of your contribution will involve profiling and optimizing communication patterns to drastically reduce latency and bandwidth overhead, making our training process efficient even in challenging multi-participant environments.
• This is an opportunity to work on cutting-edge research and engineering problems with a world-class team, previously comprising members from Google, Amazon, Microsoft, and leading startups. You will be backed by top-tier investors like Union Square Ventures, contributing to a mission that aims to democratize AI development and prevent corporate monopolization.
• If you are driven by the idea of building a more open and equitable future for AI, and possess the technical acumen to tackle these complex challenges, we encourage you to apply and become a part of Pluralis Research's transformative journey.

🎯 Requirements

• 5+ years of experience building and operating distributed systems in production.
• Hands-on expertise with distributed training frameworks such as FSDP, DeepSpeed, Megatron, or similar.
• Deep understanding of model parallelism concepts (data, tensor, pipeline parallelism).
• Expert-level proficiency in Python, including experience with concurrency, error handling, retry logic, and clean architectural patterns in a production environment.
• Strong networking fundamentals, including P2P systems, gRPC, routing, NAT traversal, and distributed coordination.
• Proven experience in optimizing GPU workloads, memory management, and large-scale compute efficiency.

🏖️ Benefits

• Equity-heavy compensation package offering meaningful ownership in a mission-driven company.
• Competitive base salary commensurate with senior engineering roles in Australia.
• Visa sponsorship available for exceptional candidates requiring relocation.
• Remote-first work environment with the flexibility of optional access to our Melbourne hub.
• Opportunity to collaborate with a world-class team of ML researchers and engineers with prior experience at leading tech companies and startups.

Skills & Technologies

Python

gRPC

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Pluralis Research Ltd

Visit Website

About Pluralis Research Ltd

Pluralis Research develops a novel approach to training large AI models called “Protocol Learning.” Instead of traditional centralized or open-source models, their method enables decentralized, multi-participant model training where no single party ever holds a full copy of the model weights. This makes models “unextractable” and supports collaborative ownership, allowing value from model usage to flow back to contributors. They aim to democratize access and innovation in AI, reduce dependency on large tech firms, and create a sustainable, open ecosystem for foundation model development.

View Company Profile