Krea Inc. logo

Engineer, Supercomputing & Distributed Systems

Job Overview

Location

San Francisco

Job Type

Full-time

Category

Backend Engineer

Date Posted

April 3, 2026

Full Job Description

đź“‹ Description

  • • Engineer, Supercomputing & Distributed Systems at Krea Inc. is a pivotal role focused on building and operating the core infrastructure that powers next-generation AI creative tools. This position is critical to enabling researchers and engineers to train, deploy, and scale massive AI models across petabyte-scale data pipelines and 1000+ GPU Kubernetes clusters, directly supporting Krea’s mission to make AI intuitive and controllable for creatives.
  • • The role involves designing, implementing, and maintaining distributed systems that handle extreme-scale data processing, GPU orchestration, and low-latency training workloads — replacing or augmenting existing tools like Kafka and Ray with custom-built solutions tailored for modern AI demands.
  • • Day-to-day responsibilities include:
  • • Designing and implementing multi-stage data pipelines to transform petabytes of raw data into clean, annotated datasets using tools like DuckDB, Arrow, and PyTorch.
  • • Managing and optimizing distributed training and inference workloads across 1000+ GPU Kubernetes clusters, including job scheduling, resource allocation, and scaling across multiple datacenters.
  • • Profiling and debugging dataloaders to achieve throughput of thousands of images per second, identifying bottlenecks in I/O, CPU, and GPU utilization.
  • • Diagnosing and resolving InfiniBand and RDMA networking issues in large-scale pretraining runs to ensure stable, high-bandwidth communication between nodes.
  • • Building fault-tolerant systems for distributed training, including checkpointing, job recovery, and automated failure detection mechanisms.
  • • Developing custom orchestration layers and streaming pipelines to replace or enhance existing frameworks like Kafka and Ray for AI-specific workloads.
  • • Collaborating with ML researchers to evolve reinforcement learning (RL) infrastructure and support experimental training paradigms.
  • • Implementing and tuning custom dataloaders and data preprocessing pipelines in PyTorch to maximize GPU utilization during training.
  • • Contributing to the design of scalable, reliable distributed datastores that support low-latency access to massive multimedia datasets.
  • • The Supercomputing & AI Infra team at Krea is a small, highly skilled group of systems engineers who build foundational infrastructure from the ground up. They operate at the intersection of systems programming, distributed computing, and machine learning, prioritizing deep technical understanding over specific tool familiarity. The team values curiosity, systems thinking, and the ability to reason about complex interactions under extreme scale.
  • • In this role, you will gain deep expertise in large-scale distributed systems, high-performance computing (HPC), and AI infrastructure — working on problems few companies tackle at this scale. You will have the opportunity to architect and deploy systems that directly impact the performance and reliability of cutting-edge AI research, with visibility into how infrastructure enables scientific breakthroughs in generative AI.

Skills & Technologies

Python
Express
Kubernetes
Kafka
PyTorch
Onsite

Ready to Apply?

You will be redirected to an external site to apply.

About Krea Inc.

Krea is a company focused on revolutionizing the way businesses manage and leverage their data. They offer a comprehensive platform designed to streamline data operations, enhance data quality, and unlock actionable insights. Their solution caters to a wide range of industries, enabling organizations to make data-driven decisions more effectively. Krea's technology aims to simplify complex data challenges, providing tools for data integration, analysis, and visualization. By empowering businesses with better data management capabilities, Krea helps them improve efficiency, reduce costs, and gain a competitive edge in today's data-intensive market.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

Yerevan, Armenia
Full-time
Expires Jun 4, 2026
Go
Rust
Ruby
+5 more

1 month ago

Apply
Argentina - Remote
Full-time
Expires Jun 21, 2026
TypeScript
Scala
React
+4 more

20 days ago

Apply
❌ EXPIRED
Argentina
Full-time
Expired May 12, 2026
Java
Remote

2 months ago

Apply
Argentina
Full-time
Expires May 20, 2026
JavaScript
TypeScript
React
+5 more

2 months ago

Apply