AI Researcher Inference Optimization

Featherless AI Inc.

Job Overview

Location

Remote

Job Type

Full-time

Full Job Description

📋 Description

• Featherlessai is at the forefront of AI innovation, and we are seeking a highly skilled and motivated AI Researcher specializing in Inference Optimization to join our dynamic team. In this pivotal role, you will be instrumental in designing, evaluating, and deploying cutting-edge, high-performance inference systems tailored for large-scale machine learning models. You will operate at the critical intersection of advanced model architecture, robust systems engineering, and hardware-aware optimization techniques. Your primary objective will be to significantly enhance latency, throughput, and cost efficiency, ensuring our AI models perform optimally in demanding real-world production environments.
• Your responsibilities will encompass a broad spectrum of research and development activities. You will be tasked with researching and developing novel techniques to dramatically improve the inference performance of large neural networks. This includes a deep focus on optimizing key metrics such as latency (the time it takes for a model to produce an output), throughput (the number of inferences a system can handle per unit of time), memory efficiency (minimizing the RAM required for inference), and the overall cost per inference, making our AI solutions more scalable and economically viable.
• A significant part of your role will involve designing and evaluating model-level optimizations. This includes exploring and implementing techniques like quantization (reducing the precision of model weights and activations), pruning (removing redundant model parameters), KV-cache optimization (improving the efficiency of attention mechanisms in transformer models), and architecture-aware simplifications that maintain model accuracy while reducing computational overhead. You will also be responsible for implementing sophisticated systems-level optimizations. These may include dynamic batching (grouping incoming requests to maximize hardware utilization), kernel fusion (combining multiple operations into a single GPU kernel), multi-GPU inference strategies for distributing workloads, and optimizing the distinct computational phases of prefill (processing the initial prompt) versus decode (generating subsequent tokens).
• You will be expected to rigorously benchmark inference workloads across a variety of hardware accelerators, including GPUs and potentially specialized AI chips, to understand performance characteristics and identify bottlenecks. A crucial aspect of this role is close collaboration with our dedicated engineering teams. You will work hand-in-hand with them to seamlessly integrate and deploy the optimized inference pipelines you develop, ensuring a smooth transition from research to production.
• Furthermore, your ability to translate complex research insights into practical, production-ready improvements will be highly valued. This involves not just theoretical exploration but also the practical application of your findings to solve real-world challenges. You will contribute to the continuous improvement of our AI infrastructure, driving performance gains that directly impact our products and services. The ideal candidate will possess a strong academic or practical background in machine learning, deep learning, or AI systems, coupled with hands-on experience in optimizing inference for large-scale models. Proficiency in Python and modern ML frameworks such as PyTorch is essential, as is experience with industry-standard inference tooling like Triton, TensorRT, vLLM, or ONNX Runtime. The ability to design well-controlled experiments, analyze results, and communicate findings clearly and effectively to both technical and non-technical audiences is paramount.
• We are particularly interested in candidates who have experience deploying production inference systems at scale, demonstrating an understanding of the challenges and best practices involved in bringing AI models to live environments. Familiarity with distributed and multi-GPU inference techniques is a significant advantage, as is a history of contributing to open-source ML or inference frameworks, showcasing a commitment to the broader AI community. Authorship or co-authorship of peer-reviewed research papers in machine learning, systems, or related fields will be highly regarded, indicating a strong research aptitude. Experience working closely with hardware, including proficiency with CUDA, ROCm, and profiling tools, will allow you to delve deeper into performance bottlenecks at the lowest levels.
• Success in this role will be measured by tangible outcomes: demonstrable and significant gains in latency, throughput, and cost efficiency for our inference systems. You will see your optimized inference systems running reliably and efficiently in production, directly impacting our users. Your research ideas will be successfully translated into deployable systems, showcasing the practical impact of your work. You will produce clear benchmarks and comprehensive documentation that will inform critical product and engineering decisions, guiding the future direction of our AI development. Bonus points for experience or interest in advanced research areas such as long-context inference optimization, speculative decoding, KV-cache compression and paging, efficient decoding strategies, and hardware-aware inference design, which are key areas for future breakthroughs.

Skills & Technologies

Python

PyTorch

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Featherless AI Inc.

Visit Website

About Featherless AI Inc.

Featherless AI Inc. provides serverless LLM hosting, offering developers and AI teams worldwide instant access to a continually expanding library of over 17,300 open-source models. Their platform facilitates seamless deployment for fine-tuning, testing, and production, empowering diverse applications from AI software development to creative writing platforms. Featherless distinguishes itself by eliminating the burden of server management and significantly reducing inference costs, providing transparent, flat-rate pricing with unlimited tokens. As an AI research lab, they pioneer open-source, post-transformer model research and aim to make advanced AI more accessible and affordable for a global customer base, supporting innovation across various industries.

View Company Profile