Pragmatike Soluciones Tecnológicas S.L. logo

CUDA Kernel Engineer

Job Overview

Location

Remote

Job Type

Full-time

Category

Software Engineering

Date Posted

February 26, 2026

Full Job Description

📋 Description

  • Join a pioneering AI startup, recognized as a Top 10 GenAI company, founded by esteemed researchers from MIT CSAIL, and contribute to the core GPU performance layer of cutting-edge AI systems.
  • This is a unique opportunity to work on the foundational elements that power large-scale, high-throughput AI solutions for Fortune 500 clients, directly impacting the efficiency and scalability of mission-critical applications.
  • You will be instrumental in designing, implementing, and meticulously optimizing custom CUDA kernels from the ground up, specifically tailored for NVIDIA GPUs.
  • The primary focus will be on maximizing GPU performance metrics, including achieving peak occupancy, enhancing memory throughput, and ensuring optimal warp execution efficiency.
  • Engage in in-depth performance analysis of GPU workloads, leveraging industry-standard profiling tools such as NVIDIA Nsight Compute, Nsight Systems, nvprof, and CUDA‐MEMCHECK to identify and diagnose performance bottlenecks.
  • Take ownership of analyzing and systematically eliminating performance limitations, including but not limited to warp divergence, uncoalesced memory access patterns, excessive register pressure, and inefficiencies in PCIe transfer overhead.
  • Play a key role in refining and optimizing GPU memory pipelines, encompassing global, shared, L2, and texture memory, ensuring that memory access patterns are coalesced for maximum efficiency.
  • Collaborate dynamically with cross-functional teams, including AI systems engineers, model acceleration specialists, and backend distributed systems developers, to integrate and enhance GPU performance across the entire AI stack.
  • Contribute significantly to strategic GPU architecture decisions, the development of robust kernel libraries, and the establishment and enforcement of internal best practices in performance engineering.
  • This role demands a deep, practical understanding of NVIDIA GPU architecture, including its intricate memory hierarchy, warp-level execution models, and comprehensive profiling workflows.
  • You will be expected to go beyond simply utilizing existing libraries and demonstrate a proven track record of building and optimizing CUDA kernels from scratch.
  • Develop a nuanced understanding of CUDA's execution model, including threads, warps, blocks, and grids, and how they interact with the GPU's memory hierarchy.
  • Gain expertise in diagnosing and mitigating performance issues related to memory coalescing and warp divergence, understanding how to detect, analyze, and resolve these common bottlenecks.
  • Become proficient in identifying and addressing PCIe bottlenecks, optimizing data transfers between the host and device through techniques like pinned memory, asynchronous streams, efficient batching, and overlapping computation with communication.
  • Work with modern C++ and CUDA runtime APIs, utilizing advanced GPU debugging and profiling tools to ensure code correctness and performance.
  • This position offers a direct line of sight into the impact of your work, as the AI solutions you help optimize are deployed by leading Fortune 500 companies.
  • You will be part of an environment that fosters innovation and growth, with a strong research pedigree and a history of successful exits by its alumni, including acquisitions by Databricks, Nvidia, and CoreWeave.
  • The company has secured significant funding and is poised for further growth, offering a stable yet dynamic environment for career advancement.
  • This role provides an exceptional opportunity for career growth and influence, allowing you to lead AI initiatives, refine critical performance pipelines, and make a tangible impact on production AI systems at a massive scale.
  • Embrace a culture that values autonomy and collaboration, where you can own critical systems while working alongside a team of world-class engineers.
  • Tackle some of the most challenging and aspirational GPU/AI performance problems in the industry, pushing the boundaries of what's possible in AI computation.
  • The company is committed to providing equal employment opportunities and fostering a fair and inclusive hiring process, processing personal data solely for recruitment purposes in accordance with privacy laws.

Skills & Technologies

Remote

Ready to Apply?

You will be redirected to an external site to apply.

Pragmatike Soluciones Tecnológicas S.L. logo
Pragmatike Soluciones Tecnológicas S.L.
Visit Website

About Pragmatike Soluciones Tecnológicas S.L.

Spanish technology firm founded in 2014, delivering custom software, mobile apps, cloud migration, and data analytics. Combines agile development, AI, and DevOps practices to serve finance, healthcare, retail, and public sectors across Europe and Latin America. Core services include UX/UI design, QA automation, and 24/7 managed support, with ISO 27001-certified processes and multilingual teams in Madrid, Barcelona, and remote hubs.

Similar Opportunities

❌ EXPIRED
Scale to Win LLC logo

Scale to Win LLC

Remote
Full-time
Expired Jan 22, 2026
Senior
Remote

3 months ago

Apply
USA
Full-time
Expires May 2, 2026
Senior
Remote

5 days ago

Apply
Dandy Technology, Inc. logo

Dandy Technology, Inc.

USA
Full-time
Expires May 3, 2026
REST
Remote

3 days ago

Apply
Canada
Full-time
Expires May 2, 2026
Go
MongoDB
Redis
+3 more

5 days ago

Apply