Pragmatike Soluciones Tecnológicas S.L. logo

CUDA Kernel Engineer (Remote US)

Job Overview

Location

Remote

Job Type

Full-time

Category

Software Engineering

Date Posted

March 4, 2026

Full Job Description

📋 Description

  • Join a rapidly expanding AI startup, recognized among the Top 10 GenAI companies by GTM Capital, and founded by esteemed researchers from MIT CSAIL, as a pivotal CUDA Kernel Engineer.
  • This remote US-based role offers a unique opportunity to architect and refine the foundational GPU performance layer that underpins large-scale, high-throughput AI systems deployed by Fortune 500 clientele.
  • You will be instrumental in driving the efficiency, throughput, and scalability of mission-critical AI solutions, directly impacting their performance on NVIDIA GPUs.
  • The ideal candidate possesses a profound and practical understanding of NVIDIA GPU architecture, including its intricate memory hierarchy, warp-level execution dynamics, and sophisticated profiling workflows.
  • This position is specifically tailored for engineers with direct, hands-on experience in developing and optimizing CUDA kernels from the ground up, rather than those with backgrounds in generic hardware, FPGAs, or non-NVIDIA compute platforms.
  • Your primary responsibility will involve the design, implementation, and meticulous optimization of custom CUDA kernels, with an unwavering focus on maximizing GPU occupancy, achieving peak memory throughput, and enhancing warp execution efficiency.
  • You will leverage advanced profiling tools such as Nsight Compute, Nsight Systems, nvprof, and CUDA‐MEMCHECK to rigorously diagnose and address performance bottlenecks.
  • Key areas of analysis and optimization will include mitigating warp divergence, ensuring memory coalescing, reducing register pressure, and minimizing PCIe transfer overhead.
  • A significant aspect of the role involves enhancing GPU memory pipelines, encompassing global, shared, L2, and texture memory, ensuring optimal data access patterns.
  • You will collaborate synergistically with cross-functional teams, including AI systems engineers, model acceleration specialists, and backend distributed systems developers, to integrate and enhance GPU performance across the entire AI stack.
  • Contribute significantly to strategic GPU architecture decisions, the development of robust kernel libraries, and the establishment and dissemination of internal best practices in performance engineering.
  • This role demands a proven track record of building NVIDIA CUDA kernels from scratch, demonstrating an ability to go beyond merely calling existing libraries.
  • Develop and apply advanced kernel optimization techniques, including sophisticated tiling strategies, occupancy tuning, efficient shared memory design, and nuanced warp scheduling.
  • Cultivate and apply a deep understanding of CUDA's fundamental constructs: threads, warps, blocks, and grids, alongside a comprehensive grasp of the GPU memory hierarchy, memory coalescing principles, and the detection, analysis, and mitigation of warp divergence.
  • Gain practical experience in diagnosing PCIe bottlenecks and optimizing host-device data transfers through techniques like pinned memory allocation, stream synchronization, batching, and asynchronous data transfer overlap.
  • Become proficient in C++, CUDA runtime APIs, and the suite of GPU debugging and profiling tools essential for high-performance computing.
  • Bonus contributions may include experience with multi-GPU configurations or distributed GPU systems leveraging technologies like NCCL, NVLink, and MIG.
  • A background in GPU acceleration for machine learning frameworks or high-performance computing (HPC) workloads will be highly valued.
  • Familiarity with model inference optimization tools and techniques such as TensorRT, CUDA Graphs, and CUTLASS is advantageous.
  • Exposure to compiler-level optimization strategies or the analysis of PTX/SASS code will be considered a strong plus.
  • Experience within a startup environment or a demonstrated comfort level working in fast-paced, dynamic, and sometimes ambiguous settings is beneficial.
  • This role offers a unique career pivot, providing exposure to cutting-edge AI research from MIT CSAIL founders, direct customer impact with Fortune 500 clients, and the opportunity to be part of a company with a strong track record of successful exits and significant growth potential.
  • You will own critical systems, collaborate with world-class engineers, and tackle GPU/AI performance challenges at an unprecedented scale.

Skills & Technologies

Remote

Ready to Apply?

You will be redirected to an external site to apply.

Pragmatike Soluciones Tecnológicas S.L. logo
Pragmatike Soluciones Tecnológicas S.L.
Visit Website

About Pragmatike Soluciones Tecnológicas S.L.

Spanish technology firm founded in 2014, delivering custom software, mobile apps, cloud migration, and data analytics. Combines agile development, AI, and DevOps practices to serve finance, healthcare, retail, and public sectors across Europe and Latin America. Core services include UX/UI design, QA automation, and 24/7 managed support, with ISO 27001-certified processes and multilingual teams in Madrid, Barcelona, and remote hubs.

Similar Opportunities

❌ EXPIRED
Scale to Win LLC logo

Scale to Win LLC

Remote
Full-time
Expired Jan 22, 2026
Senior
Remote

3 months ago

Apply
USA
Full-time
Expires May 2, 2026
Senior
Remote

5 days ago

Apply
Dandy Technology, Inc. logo

Dandy Technology, Inc.

USA
Full-time
Expires May 3, 2026
REST
Remote

3 days ago

Apply
Canada
Full-time
Expires May 2, 2026
Go
MongoDB
Redis
+3 more

5 days ago

Apply