
Job Overview
Location
Remote
Job Type
Full-time
Category
Machine Learning Engineer
Date Posted
February 12, 2026
Full Job Description
đź“‹ Description
- • Perplexity is at the forefront of AI innovation, and we are seeking a highly skilled and motivated AI Infrastructure Engineer to join our dynamic and rapidly expanding team. In this pivotal role, you will be instrumental in architecting, deploying, and optimizing the large-scale AI training and inference clusters that power our cutting-edge products. You will collaborate intimately with our Inference and Research teams, acting as a bridge between groundbreaking AI development and the robust infrastructure required to bring it to life. Your expertise will directly impact our ability to train state-of-the-art models and serve them efficiently to our users.
- • Your core responsibilities will encompass the design, deployment, and meticulous maintenance of scalable Kubernetes clusters. These clusters will be the bedrock for our AI model inference and training workloads, demanding a deep understanding of Kubernetes' capabilities and best practices. You will ensure these environments are not only functional but also highly available and performant, capable of handling the immense demands of modern AI.
- • A significant part of your role will involve managing and optimizing our Slurm-based High-Performance Computing (HPC) environments. This includes fine-tuning Slurm for the efficient distributed training of massive language models, ensuring optimal resource allocation, job scheduling, and overall cluster health. You will be the guardian of our training infrastructure, making sure our researchers have the computational power they need, when they need it.
- • You will be tasked with developing and maintaining robust APIs and sophisticated orchestration systems. These systems will streamline both our training pipelines and our inference services, enabling seamless integration and efficient operation. This involves creating intuitive interfaces and automated workflows that simplify complex AI infrastructure management.
- • Implementing advanced resource scheduling and job management systems will be crucial. You will work with heterogeneous compute environments, ensuring that workloads are efficiently distributed across available resources, whether they are GPUs, CPUs, or specialized AI accelerators. This requires a nuanced understanding of how to maximize utilization and performance.
- • A key aspect of your contribution will be benchmarking system performance, diligently diagnosing bottlenecks, and implementing targeted improvements across both our training and inference infrastructure. You will be proactive in identifying areas for optimization, ensuring our systems are always operating at peak efficiency.
- • Building comprehensive monitoring, alerting, and observability solutions tailored specifically for ML workloads running on Kubernetes and Slurm is a critical responsibility. You will leverage cutting-edge tools to gain deep insights into system behavior, enabling rapid issue detection and resolution.
- • You will be expected to respond swiftly and effectively to system outages, collaborating closely with cross-functional teams to minimize downtime and maintain the highest levels of uptime for our critical training runs and inference services. Your ability to troubleshoot under pressure will be vital.
- • Finally, you will continuously work to optimize cluster utilization and implement sophisticated autoscaling strategies. This ensures that our infrastructure can dynamically adapt to fluctuating workload demands, providing cost-efficiency and scalability.
🎯 Requirements
- • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management.
- • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization.
- • Proficiency with Python and C++ programming, with a focus on systems and infrastructure automation.
- • Deep understanding of container orchestration and distributed systems architecture.
🏖️ Benefits
- • Competitive salary and equity.
- • Comprehensive health, dental, and vision insurance.
- • Generous paid time off and holidays.
- • Opportunity to work with a world-class team on cutting-edge AI technology.
Skills & Technologies
Python
AWS
Kubernetes
Terraform
TensorFlow
Remote
About Perplexity AI, Inc.
Perplexity AI operates an AI-powered conversational search engine that answers queries by synthesizing live web information. The platform combines large language models with real-time retrieval, citing sources for transparency. Founded in 2022, the San Francisco-based company offers free and subscription tiers, mobile apps, and browser extensions, targeting consumers and enterprises seeking accurate, verifiable answers instead of traditional link lists.
Similar Opportunities

IDEXX Laboratories, Inc.
NZ-AKL-Auckland
Full-time
Expires Mar 8, 2026
TypeScript
TensorFlow
PyTorch
+2 more
1 month ago
Sydney
Full-time
Expires Mar 10, 2026
1 month ago


