Machine Learning Engineer - ML Training Platform

Pluralis Research Ltd

Job Overview

Location

Melbourne, Australia

Job Type

Full-time

Full Job Description

📋 Description

• Pluralis Research is at the forefront of a revolutionary AI paradigm shift, pioneering Protocol Learning – a fully decentralized approach to training and deploying AI models. Our mission is to democratize AI, opening it up to individuals rather than concentrating power within well-resourced corporations. By pooling compute resources from a diverse network of participants and incentivizing their contributions, we are building a genuinely open and collaborative ecosystem for frontier-scale AI development. We are seeking a highly skilled and motivated Machine Learning Training Platform Engineer to architect, build, and scale the foundational infrastructure that powers this groundbreaking decentralized ML training platform.
• In this pivotal role, you will be the architect and builder of core systems that span infrastructure orchestration, distributed compute management, and seamless service integration. Your work will be instrumental in enabling continuous experimentation, facilitating large-scale model training, and ensuring the robustness and scalability of our unique platform. You will have the opportunity to shape the future of decentralized AI infrastructure from the ground up.
• Your responsibilities will encompass the design and implementation of sophisticated multi-cloud infrastructure management systems. This includes provisioning and orchestrating compute resources across major cloud providers such as AWS, GCP, and Azure, leveraging infrastructure-as-code tools like Pulumi or Terraform. You will be tasked with handling dynamic scaling requirements, ensuring state synchronization across distributed components, and managing concurrent operations across hundreds of heterogeneous nodes, all while maintaining high availability and efficiency.
• A significant part of your role will involve architecting fault-tolerant infrastructure specifically designed for distributed Machine Learning training. This includes setting up and managing GPU clusters, integrating with NVIDIA runtimes, implementing robust S3 checkpointing mechanisms for model and state persistence, managing large datasets with efficient streaming capabilities, and establishing comprehensive health monitoring systems with resilient retry strategies to ensure uninterrupted training processes.
• You will also be responsible for building and optimizing systems that simulate and effectively handle real-world networking conditions. This is a critical differentiator for our platform, as training occurs on consumer nodes and non-co-located infrastructure, not within a traditional data center. You will need to implement solutions for bandwidth shaping, latency injection, and packet loss simulation, while adeptly managing dynamic node churn and ensuring efficient data flow across workers with heterogeneous connectivity. This requires a deep understanding of the challenges and nuances of decentralized networking.
• The ideal candidate will bring a wealth of experience, ideally 5+ years, with a deep and proven track record in infrastructure and platform engineering. This includes extensive production experience with infrastructure-as-code tools (Pulumi, Terraform, CloudFormation) for managing multi-cloud deployments, orchestrating complex lifecycles, building self-healing systems, and deploying containerized applications using Docker and Kubernetes (EKS). Experience with GPU workloads and managing heterogeneous clusters at scale is essential.
• Furthermore, you will possess a profound understanding of distributed systems and ML infrastructure. This encompasses a deep knowledge of distributed training workflows, effective checkpointing strategies, data sharding techniques, model versioning, long-running job orchestration, and decentralized networking concepts such as P2P communication, NAT traversal, and traffic shaping. A keen awareness of real-world bandwidth constraints and their impact on distributed training is also crucial.
• Strong systems programming skills, particularly in Python, are a must. This includes expertise in asynchronous programming (asyncio), concurrency patterns, implementing robust retry logic, utilizing cloud SDKs effectively, and developing sophisticated CLI tooling. Hands-on experience with observability practices, SRE principles, monitoring tools like Prometheus and Grafana, performance profiling, and incident response is highly valued.
• We are particularly interested in candidates who have experience in a startup environment, with a strong emphasis on micro-services orchestration, or those coming from a big tech background with relevant experience. A deep understanding of multi-cloud infrastructure and distributed training systems is paramount. You should be a collaborative team player with exceptional attention to detail and a genuine passion for our mission to democratize AI and prevent monopolization of model development and access.
• Pluralis Research is backed by leading investors like Union Square Ventures and a world-class, deeply technical team of ML researchers. We are unapologetically ideological, driven by the belief that Protocol Learning is the only viable path to prevent a few massive corporations from monopolizing AI model development, access, and release, thereby avoiding significant economic capture. If this vision resonates with you and you are eager to contribute to a truly transformative project, we encourage you to apply.

Skills & Technologies

Python

Node.js

AWS

Azure

GCP

Onsite

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Pluralis Research Ltd

Visit Website

About Pluralis Research Ltd

Pluralis Research develops a novel approach to training large AI models called “Protocol Learning.” Instead of traditional centralized or open-source models, their method enables decentralized, multi-participant model training where no single party ever holds a full copy of the model weights. This makes models “unextractable” and supports collaborative ownership, allowing value from model usage to flow back to contributors. They aim to democratize access and innovation in AI, reduce dependency on large tech firms, and create a sustainable, open ecosystem for foundation model development.

View Company Profile