
Job Overview
Location
Spain
Job Type
Full-time
Category
Software Engineering
Date Posted
May 25, 2026
Full Job Description
đź“‹ Description
- • Optimize LLM inference performance by directly improving throughput, latency, and KV cache utilization across production systems, with measurable impact on customer p99 and company P&L.
- • Lead the technical direction of inference optimization at Kimchi, owning the roadmap for model serving efficiency without executing someone else’s plan.
- • Tune inference engines including vLLM, SGLang, and TensorRT-LLM at the kernel level to maximize GPU utilization and push performance ceilings for each GPU SKU.
- • Attack TTFT and TPOT independently by profiling real bottlenecks—compute, memory bandwidth, scheduling, or networking—and implementing targeted fixes based on empirical data.
- • Enhance KV cache efficiency through paged attention, prefix caching, eviction policy optimization, and quantized KV cache implementations to unlock unrealized throughput.
- • Design and implement quantization strategies (INT8, INT4, FP8) for weights, activations, and KV cache while empirically measuring and minimizing quality regression on real-world workloads, not just benchmark metrics.
- • Reduce cold start times and memory footprint through faster model initialization, smarter weight loading, and precise memory accounting to enable scalable inference deployment.
- • Design and deploy distributed inference topologies across multi-node, multi-GPU clusters, optimizing for network-aware placement, checkpointing, and interconnect efficiency.
- • Instrument and measure all optimizations rigorously, distinguishing real performance gains from benchmark artifacts, and ensuring every change is data-driven and reproducible.
- • Collaborate with a global team of cloud infrastructure experts to integrate inference optimization into Kubernetes-based autonomous infrastructure platforms used by over 2,100 companies.
- • Define benchmarking priorities, evaluate new tools and frameworks, and document technical decisions through clear writeups and reproducible experiments to align the team.
- • Maintain and extend the production stack including Python, PyTorch, CUDA-adjacent tooling, Kubernetes, gRPC, ClickHouse, PostgreSQL, GCP Pub/Sub, AWS/GCP/Azure, GitLab CI, ArgoCD, Prometheus, Grafana, Loki, and Tempo.
- • Balance autonomy with ownership: operate with a wide mandate to innovate, experiment, and drive technical decisions in a fast-paced environment where feature cycles are 1–4 weeks.
🎯 Requirements
- • 5+ years building real ML systems with depth in inference or training infrastructure (not just model training notebooks)
- • Strong Python skills for production services, not scripting
- • Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM and a working mental model of inference engine behavior on GPU hardware
- • Fluency with quantization tradeoffs, including measured quality regressions on real workloads
- • Comfort with distributed systems: collective communication, sharding strategies, and failure modes in multi-GPU and multi-node setups
- • Bias toward measurement: instrument before optimizing, and distinguish real wins from benchmark artifacts
🏖️ Benefits
- • Competitive salary based on experience
- • Flexible, remote-first global environment
- • Equity options
- • 10% work time allocated to personal projects or self-improvement
- • Learning budget for professional development, including access to international conferences and courses
- • Annual hackathon and team-building budget for company events
Skills & Technologies
About Castai Group Inc.
Castai Group Inc. provides specialized investment and strategic advisory services to middle-market companies across North America. The firm focuses on mergers and acquisitions, private placements, restructuring, and growth capital transactions for businesses in manufacturing, consumer goods, and business services sectors. Headquartered in New York, Castai operates a lean, senior-led model that emphasizes direct principal involvement, rigorous financial analysis, and long-term client partnerships. Its principals have executed more than 200 transactions totaling over $10 billion in aggregate value since inception.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities

Hangar Aviation Technologies, Inc.
20 hours ago

MongoDB, Inc.
22 days ago

Nexus Cognitive LLC
20 hours ago