
Job Overview
Location
San Francisco
Job Type
Full-time
Category
Backend Engineer
Date Posted
April 3, 2026
Full Job Description
đź“‹ Description
- • Engineer, Supercomputing & Distributed Systems at Krea Inc. is a pivotal role focused on building and operating the core infrastructure that powers next-generation AI creative tools. This position is critical to enabling researchers and engineers to train, deploy, and scale massive AI models across petabyte-scale data pipelines and 1000+ GPU Kubernetes clusters, directly supporting Krea’s mission to make AI intuitive and controllable for creatives.
- • The role involves designing, implementing, and maintaining distributed systems that handle extreme-scale data processing, GPU orchestration, and low-latency training workloads — replacing or augmenting existing tools like Kafka and Ray with custom-built solutions tailored for modern AI demands.
- • Day-to-day responsibilities include:
- • Designing and implementing multi-stage data pipelines to transform petabytes of raw data into clean, annotated datasets using tools like DuckDB, Arrow, and PyTorch.
- • Managing and optimizing distributed training and inference workloads across 1000+ GPU Kubernetes clusters, including job scheduling, resource allocation, and scaling across multiple datacenters.
- • Profiling and debugging dataloaders to achieve throughput of thousands of images per second, identifying bottlenecks in I/O, CPU, and GPU utilization.
- • Diagnosing and resolving InfiniBand and RDMA networking issues in large-scale pretraining runs to ensure stable, high-bandwidth communication between nodes.
- • Building fault-tolerant systems for distributed training, including checkpointing, job recovery, and automated failure detection mechanisms.
- • Developing custom orchestration layers and streaming pipelines to replace or enhance existing frameworks like Kafka and Ray for AI-specific workloads.
- • Collaborating with ML researchers to evolve reinforcement learning (RL) infrastructure and support experimental training paradigms.
- • Implementing and tuning custom dataloaders and data preprocessing pipelines in PyTorch to maximize GPU utilization during training.
- • Contributing to the design of scalable, reliable distributed datastores that support low-latency access to massive multimedia datasets.
- • The Supercomputing & AI Infra team at Krea is a small, highly skilled group of systems engineers who build foundational infrastructure from the ground up. They operate at the intersection of systems programming, distributed computing, and machine learning, prioritizing deep technical understanding over specific tool familiarity. The team values curiosity, systems thinking, and the ability to reason about complex interactions under extreme scale.
- • In this role, you will gain deep expertise in large-scale distributed systems, high-performance computing (HPC), and AI infrastructure — working on problems few companies tackle at this scale. You will have the opportunity to architect and deploy systems that directly impact the performance and reliability of cutting-edge AI research, with visibility into how infrastructure enables scientific breakthroughs in generative AI.
Skills & Technologies
About Krea Inc.
Krea is a company focused on revolutionizing the way businesses manage and leverage their data. They offer a comprehensive platform designed to streamline data operations, enhance data quality, and unlock actionable insights. Their solution caters to a wide range of industries, enabling organizations to make data-driven decisions more effectively. Krea's technology aims to simplify complex data challenges, providing tools for data integration, analysis, and visualization. By empowering businesses with better data management capabilities, Krea helps them improve efficiency, reduce costs, and gain a competitive edge in today's data-intensive market.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities
20 days ago

Silver.com LLC
2 months ago


