ML Infrastructure Engineer

Nebius Group N.V.

Job Overview

Location

Amsterdam, Netherlands; Remote - Europe; Remote - United States

Job Type

Full-time

Full Job Description

📋 Description

• **Pioneering AI Cloud Infrastructure at Nebius**
• This role is at the heart of Nebius’s mission to revolutionize cloud infrastructure for the global AI economy. As an ML Infrastructure Engineer, you will be instrumental in shaping the performance and efficiency of GPU platforms that power cutting-edge machine learning and AI workloads. Your work will directly influence the development of next-generation hardware and software stacks, ensuring that Nebius remains a leader in AI cloud platforms. This is a unique opportunity to work on high-impact projects that bridge the gap between hardware innovation and AI advancement, enabling developers and enterprises to deploy AI solutions without the complexity of building in-house infrastructure.
• **Day-to-Day Responsibilities: Driving Performance and Innovation**
• **GPU Performance Benchmarking and Analysis**: Collaborate closely with hardware and development teams to profile and analyze GPU performance at both the system and kernel levels. Your insights will drive data-driven decisions for optimizing GPU platforms, ensuring they meet the demands of modern AI workloads.
• **Cross-Platform GPU Evaluation**: Assess and compare GPU performance across diverse platforms, architectures, and software stacks (e.g., CUDA, ROCm). Your evaluations will help identify the best-performing configurations for training and inference tasks, ensuring Nebius’s infrastructure remains at the forefront of AI innovation.
• **Debugging and Optimization**: Debug and optimize ML workloads to run efficiently on GPU hardware. You will identify and resolve performance bottlenecks, ensuring that AI models train and infer with maximum efficiency and minimal latency.
• **Acceptance Testing for GPU Clusters**: Conduct rigorous acceptance testing for new GPU clusters, verifying that hardware and software meet stringent performance, stability, and compatibility requirements. Your work will ensure that Nebius’s infrastructure is reliable and ready for production-scale AI workloads.
• **Experimentation and Scalability Assessment**: Design and execute experiments across various GPU system configurations to evaluate the impact of interconnect strategies and system-level optimizations on performance and scalability. Your findings will inform the development of future-proof AI infrastructure.
• **Tooling and Visualization Development**: Build tools and dashboards to visualize performance metrics, bottlenecks, and trends. These tools will empower teams across Nebius to make informed decisions about hardware and software optimizations, fostering a culture of data-driven innovation.
• **Contributing to Internal Frameworks**: Play a key role in developing internal tooling, frameworks, and best practices. Your contributions will help standardize performance benchmarking and optimization processes, ensuring consistency and excellence across Nebius’s AI cloud platform.
• **About Nebius: A Global Leader in AI Cloud Infrastructure**
• Nebius is a Nasdaq-listed company headquartered in Amsterdam, with a global footprint spanning R&D hubs across Europe, the UK, North America, and Israel. Our team of 1,500+ includes hundreds of engineers with deep expertise in hardware, software, and AI research and development. Nebius is building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment. By joining Nebius, you will be part of a bold, fast-moving team that values trust, ownership, and the opportunity to shape the future of AI.
• **What You’ll Learn and Achieve**
• **Mastery of GPU Performance Optimization**: Gain deep expertise in GPU performance benchmarking, debugging, and optimization, positioning yourself as a go-to expert in AI infrastructure.
• **Impact on AI Advancements**: Your work will directly contribute to the development of next-generation GPU platforms, enabling breakthroughs in AI model training and inference. You will have the opportunity to work on projects that push the boundaries of what’s possible in AI.
• **Cross-Functional Collaboration**: Work alongside hardware engineers, software developers, and AI researchers, broadening your technical knowledge and expanding your professional network.
• **Career Growth in a High-Impact Field**: Nebius offers a collaborative and innovative culture with ample opportunities for career growth. You will have the flexibility to explore new challenges and the support to achieve meaningful impact in the AI industry.

Skills & Technologies

Python

AWS

Azure

GCP

Docker

DevOps

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Nebius Group N.V.

Visit Website

About Nebius Group N.V.

Nebius Group N.V. is a Netherlands-based technology company that operates a full-stack cloud platform designed for AI and machine learning workloads. It provides scalable GPU and CPU infrastructure, managed Kubernetes, object storage, and specialized AI services to enterprises and research organizations worldwide. The company was formed from the restructuring of Yandex N.V. and continues to serve global markets with data centers across Europe and North America.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.