This job has expired

This position was posted on February 26, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Performance Engineer - AI Infrastructure

Andromeda Technologies Inc.

Job Overview

Location

Remote

Job Type

Full-time

Full Job Description

📋 Description

• Join Andromeda Technologies Inc., a pioneering company founded by Nat Friedman and Daniel Gross, dedicated to democratizing access to scaled AI infrastructure for early-stage startups. We are building the essential systems, network, and orchestration layer that makes global AI infrastructure more accessible, working with leading AI labs, data centers, and cloud providers to deliver compute where and when it's needed most.
• Our mission is to route training and inference jobs across a global supply chain, unlocking unparalleled flexibility and efficiency in the rapidly expanding AI market. Our long-term vision is to establish the liquidity layer for global AI compute, and we are actively seeking the brightest minds in AI infrastructure, research, and engineering to join us on this journey.
• As a Performance Engineer on our Growth team, your primary product will be the efficiency and throughput of our massive-scale AI clusters. In a field where optimization can translate to millions of dollars in value and weeks of saved research time for our customers, your role is critical. You will operate at the crucial intersection of systems engineering and cutting-edge research, meticulously profiling end-to-end training runs to pinpoint and eliminate bottlenecks across compute, communication, and storage.
• **Profile and Optimize:** Conduct comprehensive, end-to-end profiling of diverse AI training workloads. This involves deep dives into GPU kernels, optimizing NCCL (NVIDIA Collective Communications Library) communication patterns, and identifying and resolving storage I/O limitations that hinder performance. Your analysis will be the bedrock for performance improvements.
• **System Refinement:** Collaborate closely with our talented systems engineering team to enhance the efficiency of our job schedulers, boost the performance of collective communication operations, and fine-tune kernel execution for maximum throughput. You will be instrumental in shaping the core performance characteristics of our infrastructure.
• **Observability and Monitoring:** Develop, implement, and maintain high-fidelity tooling and dashboards to provide real-time monitoring and visualization of key performance indicators such as GPU utilization (MFU), overall throughput, and cluster uptime. This ensures we have a clear, data-driven understanding of our system's health and performance.
• **Process Design and Improvement:** Design and implement robust technical processes, including structured postmortem reviews for incidents and efficient incident response protocols. These processes are vital for enabling the team to operate effectively, learn from challenges, and proactively prevent the recurrence of performance regressions.
• **Cross-functional Collaboration:** Work hand-in-hand with research scientists, ML engineers, and other infrastructure teams to understand workload characteristics and translate performance findings into actionable engineering tasks. Your insights will directly influence the development roadmap and feature prioritization.
• **Performance Benchmarking:** Establish and maintain rigorous benchmarking methodologies to quantify the impact of optimizations and track performance trends over time. This data-driven approach ensures accountability and continuous improvement.
• **Root Cause Analysis:** Investigate and diagnose complex performance issues that may span multiple layers of the stack, from application-level code to the underlying hardware and network fabric. Develop a deep understanding of the interdependencies within our distributed systems.
• **Tooling Development:** Contribute to the development of internal tools and frameworks that automate performance analysis, testing, and reporting, thereby increasing the team's efficiency and scalability.
• **Knowledge Sharing:** Document findings, best practices, and optimization techniques, and share this knowledge across the engineering organization to foster a culture of performance excellence.
• **Customer Focus:** Understand the performance needs and challenges of our AI startup customers and advocate for improvements that directly enhance their training and inference efficiency and reduce their operational costs.
• **Scalability Engineering:** Contribute to the design and implementation of performance strategies that ensure our AI infrastructure can scale seamlessly to meet the growing demands of our customer base and the evolving landscape of AI workloads.
• **Innovation:** Stay abreast of the latest advancements in AI hardware, distributed systems, and performance analysis techniques, and proactively explore their application to our platform.

Skills & Technologies

Python

Rust

Kubernetes

Linux

TensorFlow

DevOps

Remote

Degree Required

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Andromeda Technologies Inc.

Visit Website

About Andromeda Technologies Inc.

Andromeda is a technology company focused on developing advanced AI solutions for the space industry. Their core business revolves around creating sophisticated software and hardware that enhances space exploration, satellite operations, and data analysis. Andromeda's platform leverages machine learning and computer vision to automate complex tasks, improve mission efficiency, and provide actionable insights from vast amounts of space-derived data. They aim to be a leader in the burgeoning space tech sector, offering innovative tools that empower researchers, commercial entities, and government agencies to better understand and utilize the space environment. Their work supports a range of applications from Earth observation to deep space missions.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.