
Job Overview
Location
Remote
Job Type
Full-time
Category
Software Engineering
Date Posted
February 26, 2026
Full Job Description
📋 Description
- • Join a pioneering AI startup, recognized as a Top 10 GenAI company, founded by esteemed researchers from MIT CSAIL, and contribute to the core GPU performance layer of cutting-edge AI systems.
- • This is a unique opportunity to work on the foundational elements that power large-scale, high-throughput AI solutions for Fortune 500 clients, directly impacting the efficiency and scalability of mission-critical applications.
- • You will be instrumental in designing, implementing, and meticulously optimizing custom CUDA kernels from the ground up, specifically tailored for NVIDIA GPUs.
- • The primary focus will be on maximizing GPU performance metrics, including achieving peak occupancy, enhancing memory throughput, and ensuring optimal warp execution efficiency.
- • Engage in in-depth performance analysis of GPU workloads, leveraging industry-standard profiling tools such as NVIDIA Nsight Compute, Nsight Systems, nvprof, and CUDA‐MEMCHECK to identify and diagnose performance bottlenecks.
- • Take ownership of analyzing and systematically eliminating performance limitations, including but not limited to warp divergence, uncoalesced memory access patterns, excessive register pressure, and inefficiencies in PCIe transfer overhead.
- • Play a key role in refining and optimizing GPU memory pipelines, encompassing global, shared, L2, and texture memory, ensuring that memory access patterns are coalesced for maximum efficiency.
- • Collaborate dynamically with cross-functional teams, including AI systems engineers, model acceleration specialists, and backend distributed systems developers, to integrate and enhance GPU performance across the entire AI stack.
- • Contribute significantly to strategic GPU architecture decisions, the development of robust kernel libraries, and the establishment and enforcement of internal best practices in performance engineering.
- • This role demands a deep, practical understanding of NVIDIA GPU architecture, including its intricate memory hierarchy, warp-level execution models, and comprehensive profiling workflows.
- • You will be expected to go beyond simply utilizing existing libraries and demonstrate a proven track record of building and optimizing CUDA kernels from scratch.
- • Develop a nuanced understanding of CUDA's execution model, including threads, warps, blocks, and grids, and how they interact with the GPU's memory hierarchy.
- • Gain expertise in diagnosing and mitigating performance issues related to memory coalescing and warp divergence, understanding how to detect, analyze, and resolve these common bottlenecks.
- • Become proficient in identifying and addressing PCIe bottlenecks, optimizing data transfers between the host and device through techniques like pinned memory, asynchronous streams, efficient batching, and overlapping computation with communication.
- • Work with modern C++ and CUDA runtime APIs, utilizing advanced GPU debugging and profiling tools to ensure code correctness and performance.
- • This position offers a direct line of sight into the impact of your work, as the AI solutions you help optimize are deployed by leading Fortune 500 companies.
- • You will be part of an environment that fosters innovation and growth, with a strong research pedigree and a history of successful exits by its alumni, including acquisitions by Databricks, Nvidia, and CoreWeave.
- • The company has secured significant funding and is poised for further growth, offering a stable yet dynamic environment for career advancement.
- • This role provides an exceptional opportunity for career growth and influence, allowing you to lead AI initiatives, refine critical performance pipelines, and make a tangible impact on production AI systems at a massive scale.
- • Embrace a culture that values autonomy and collaboration, where you can own critical systems while working alongside a team of world-class engineers.
- • Tackle some of the most challenging and aspirational GPU/AI performance problems in the industry, pushing the boundaries of what's possible in AI computation.
- • The company is committed to providing equal employment opportunities and fostering a fair and inclusive hiring process, processing personal data solely for recruitment purposes in accordance with privacy laws.
Skills & Technologies
Remote
About Pragmatike Soluciones Tecnológicas S.L.
Spanish technology firm founded in 2014, delivering custom software, mobile apps, cloud migration, and data analytics. Combines agile development, AI, and DevOps practices to serve finance, healthcare, retail, and public sectors across Europe and Latin America. Core services include UX/UI design, QA automation, and 24/7 managed support, with ISO 27001-certified processes and multilingual teams in Madrid, Barcelona, and remote hubs.


