Castai Group Inc. logo

Senior ML Engineer - Kimchi (LLM Inference Optimization)

Job Overview

Location

Poland

Job Type

Full-time

Category

Software Engineering

Date Posted

May 25, 2026

Full Job Description

đź“‹ Description

  • • Optimize LLM inference performance by directly improving throughput, latency, and KV cache utilization across production-grade systems, with measurable impact on customer p99 latency and company P&L.
  • • Lead the technical direction of inference optimization at Kimchi, owning the roadmap for model-serving efficiency rather than executing predefined tasks.
  • • Tune inference engines including vLLM, SGLang, and TensorRT-LLM at the kernel level to maximize GPU utilization and push performance ceilings for each GPU SKU.
  • • Reduce latency by profiling and resolving actual bottlenecks in TTFT and TPOT, distinguishing between compute, memory bandwidth, scheduling, and networking constraints.
  • • Maximize KV cache efficiency through paged attention, prefix caching, eviction policy optimization, and quantized KV cache implementations to unlock unrealized throughput.
  • • Implement and validate quantization strategies (INT8, INT4, FP8) across weights, activations, and KV cache, measuring real-world quality regressions on production workloads—not just perplexity benchmarks.
  • • Decrease cold start times and memory footprint through faster model initialization, smarter weight loading, and precise memory accounting to enable scalable deployment.
  • • Design and deploy distributed inference topologies across multi-node, multi-GPU environments, optimizing for network-aware placement, checkpointing, and interconnect efficiency.
  • • Instrument and measure all optimizations rigorously, ensuring improvements are real and reproducible, not artifacts of benchmarking conditions.
  • • Select, adopt, or build core components of the inference stack based on empirical evidence, documenting decisions with clear writeups and reproducible experiments.
  • • Collaborate with a global team of cloud infrastructure experts to integrate inference optimization into Kubernetes-based, autonomous cloud-native environments.
  • • Work with a full-stack infrastructure toolchain including Python, PyTorch, CUDA-adjacent tooling, Kubernetes, gRPC, ClickHouse, PostgreSQL, GCP Pub/Sub, AWS/GCP/Azure, GitLab CI, ArgoCD, Prometheus, Grafana, Loki, and Tempo.
  • • Drive continuous improvement in inference efficiency by adapting to shifting workload patterns including traffic shape, sequence-length distribution, batch dynamics, and hardware variability.
  • • Own end-to-end delivery of inference optimizations from hypothesis to production deployment, with most projects completed within 1 to 4 weeks.
  • • Allocate 10% of work time to personal projects or skill development to foster innovation and technical growth.
  • • Participate in annual company hackathons to generate new ideas and strengthen team collaboration.

🎯 Requirements

  • • 5+ years building real ML systems with demonstrated depth in inference or training infrastructure (not just model training notebooks)
  • • Strong Python proficiency in production services, not scripting
  • • Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM and a working mental model of inference engine behavior on GPU hardware
  • • Fluency with quantization tradeoffs, including empirical measurement of quality regressions beyond compression ratios
  • • Comfort with distributed systems: collective communication, sharding strategies, and failure modes in multi-GPU and multi-node setups
  • • Bias toward measurement: instrument before optimizing, distinguish real performance gains from benchmark artifacts

🏖️ Benefits

  • • Competitive salary based on experience
  • • Flexible, remote-first global work environment
  • • Equity options
  • • Learning budget for professional development, including access to international conferences and courses
  • • Annual hackathon to spark innovation and team bonding
  • • Equipment budget to ensure optimal hardware setup
  • • Extra days off to support work-life balance
  • • Team-building budget and company events

Skills & Technologies

Python
Node.js
PostgreSQL
AWS
Azure
Data Science
Senior
Remote

Ready to Apply?

You will be redirected to an external site to apply.

Castai Group Inc. logo
Castai Group Inc.
Visit Website

About Castai Group Inc.

Castai Group Inc. provides specialized investment and strategic advisory services to middle-market companies across North America. The firm focuses on mergers and acquisitions, private placements, restructuring, and growth capital transactions for businesses in manufacturing, consumer goods, and business services sectors. Headquartered in New York, Castai operates a lean, senior-led model that emphasizes direct principal involvement, rigorous financial analysis, and long-term client partnerships. Its principals have executed more than 200 transactions totaling over $10 billion in aggregate value since inception.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

Expired
Remote US
Full-time
Expired May 18, 2026
Senior
Remote

2 months ago

Apply
Expired
Boston, MA
Full-time
Expired Jan 9, 2026
Python
Senior
Hybrid

7 months ago

Apply
Remote - Other
Full-time
Expires Jul 6, 2026
Remote

20 days ago

Apply
Expired
Remote
Full-time
Expired Mar 9, 2026
Python
R
PostgreSQL
+4 more

5 months ago

Apply