
Job Overview
Location
Remote
Job Type
Full-time
Category
Software Engineering
Date Posted
March 4, 2026
Full Job Description
📋 Description
- • Join a rapidly expanding AI startup, recognized among the Top 10 GenAI companies by GTM Capital, and founded by esteemed researchers from MIT CSAIL, as a pivotal CUDA Kernel Engineer.
- • This remote US-based role offers a unique opportunity to architect and refine the foundational GPU performance layer that underpins large-scale, high-throughput AI systems deployed by Fortune 500 clientele.
- • You will be instrumental in driving the efficiency, throughput, and scalability of mission-critical AI solutions, directly impacting their performance on NVIDIA GPUs.
- • The ideal candidate possesses a profound and practical understanding of NVIDIA GPU architecture, including its intricate memory hierarchy, warp-level execution dynamics, and sophisticated profiling workflows.
- • This position is specifically tailored for engineers with direct, hands-on experience in developing and optimizing CUDA kernels from the ground up, rather than those with backgrounds in generic hardware, FPGAs, or non-NVIDIA compute platforms.
- • Your primary responsibility will involve the design, implementation, and meticulous optimization of custom CUDA kernels, with an unwavering focus on maximizing GPU occupancy, achieving peak memory throughput, and enhancing warp execution efficiency.
- • You will leverage advanced profiling tools such as Nsight Compute, Nsight Systems, nvprof, and CUDA‐MEMCHECK to rigorously diagnose and address performance bottlenecks.
- • Key areas of analysis and optimization will include mitigating warp divergence, ensuring memory coalescing, reducing register pressure, and minimizing PCIe transfer overhead.
- • A significant aspect of the role involves enhancing GPU memory pipelines, encompassing global, shared, L2, and texture memory, ensuring optimal data access patterns.
- • You will collaborate synergistically with cross-functional teams, including AI systems engineers, model acceleration specialists, and backend distributed systems developers, to integrate and enhance GPU performance across the entire AI stack.
- • Contribute significantly to strategic GPU architecture decisions, the development of robust kernel libraries, and the establishment and dissemination of internal best practices in performance engineering.
- • This role demands a proven track record of building NVIDIA CUDA kernels from scratch, demonstrating an ability to go beyond merely calling existing libraries.
- • Develop and apply advanced kernel optimization techniques, including sophisticated tiling strategies, occupancy tuning, efficient shared memory design, and nuanced warp scheduling.
- • Cultivate and apply a deep understanding of CUDA's fundamental constructs: threads, warps, blocks, and grids, alongside a comprehensive grasp of the GPU memory hierarchy, memory coalescing principles, and the detection, analysis, and mitigation of warp divergence.
- • Gain practical experience in diagnosing PCIe bottlenecks and optimizing host-device data transfers through techniques like pinned memory allocation, stream synchronization, batching, and asynchronous data transfer overlap.
- • Become proficient in C++, CUDA runtime APIs, and the suite of GPU debugging and profiling tools essential for high-performance computing.
- • Bonus contributions may include experience with multi-GPU configurations or distributed GPU systems leveraging technologies like NCCL, NVLink, and MIG.
- • A background in GPU acceleration for machine learning frameworks or high-performance computing (HPC) workloads will be highly valued.
- • Familiarity with model inference optimization tools and techniques such as TensorRT, CUDA Graphs, and CUTLASS is advantageous.
- • Exposure to compiler-level optimization strategies or the analysis of PTX/SASS code will be considered a strong plus.
- • Experience within a startup environment or a demonstrated comfort level working in fast-paced, dynamic, and sometimes ambiguous settings is beneficial.
- • This role offers a unique career pivot, providing exposure to cutting-edge AI research from MIT CSAIL founders, direct customer impact with Fortune 500 clients, and the opportunity to be part of a company with a strong track record of successful exits and significant growth potential.
- • You will own critical systems, collaborate with world-class engineers, and tackle GPU/AI performance challenges at an unprecedented scale.
Skills & Technologies
Remote
About Pragmatike Soluciones Tecnológicas S.L.
Spanish technology firm founded in 2014, delivering custom software, mobile apps, cloud migration, and data analytics. Combines agile development, AI, and DevOps practices to serve finance, healthcare, retail, and public sectors across Europe and Latin America. Core services include UX/UI design, QA automation, and 24/7 managed support, with ISO 27001-certified processes and multilingual teams in Madrid, Barcelona, and remote hubs.


