AI Inference Engineer

Perplexity AI, Inc.

Job Overview

Location

Remote

Job Type

Full-time

Full Job Description

📋 Description

• Join Perplexity, a rapidly growing AI company, as an AI Inference Engineer and play a pivotal role in deploying cutting-edge machine learning models at scale for real-time inference.
• You will be instrumental in building and optimizing the infrastructure that powers our advanced AI capabilities, ensuring seamless and efficient operation for both internal development teams and external users.
• This role offers a unique opportunity to work with a modern and high-performance technology stack, including Python, Rust, C++, PyTorch, Triton, CUDA, and Kubernetes, pushing the boundaries of what's possible in AI deployment.
• A core responsibility will be the development of robust and scalable APIs for AI inference. These APIs will serve as the backbone for various applications, enabling other teams within Perplexity to leverage our AI models and providing external customers with direct access to our powerful AI services.
• You will be deeply involved in the performance tuning of our inference systems. This includes rigorous benchmarking to identify performance bottlenecks across the entire inference stack, from model execution to network latency.
• Your work will directly contribute to improving the reliability and observability of our production systems. This involves implementing comprehensive monitoring solutions, setting up alerting mechanisms, and actively responding to and resolving system outages to maintain high availability.
• You will have the exciting opportunity to explore and implement novel research in LLM inference optimizations. This could involve investigating new techniques for improving inference speed, reducing memory footprint, or enhancing the overall efficiency of large language models.
• Specific areas of focus for optimization may include techniques such as continuous batching, quantization, model pruning, efficient attention mechanisms, and kernel fusion, all aimed at maximizing throughput and minimizing latency.
• You will collaborate closely with machine learning researchers and engineers to understand the unique inference requirements of various models and translate those needs into efficient, production-ready solutions.
• The role demands a proactive approach to system design, ensuring that our inference infrastructure is not only performant but also maintainable, scalable, and secure.
• You will contribute to the evolution of our deployment strategies, potentially exploring new hardware accelerators or distributed inference techniques.
• By optimizing our inference pipelines, you will directly impact the user experience of Perplexity products, enabling faster response times and the deployment of more sophisticated AI features.
• This position requires a deep understanding of the trade-offs involved in different optimization techniques and the ability to select and implement the most appropriate solutions for specific use cases.
• You will be a key player in ensuring that Perplexity remains at the forefront of AI innovation by providing the underlying infrastructure for our state-of-the-art models.
• The opportunity to work on large-scale deployments means your contributions will have a significant and measurable impact on the performance and availability of our services used by millions.
• You will gain hands-on experience with the latest advancements in AI infrastructure and MLOps practices.
• This role is ideal for someone who is passionate about performance optimization, distributed systems, and the practical application of machine learning.
• You will be empowered to make architectural decisions and drive technical initiatives within the inference domain.
• The continuous learning aspect of this role, staying abreast of new research and technologies in LLM inference, is crucial for success.
• You will be part of a dynamic and collaborative team that values innovation and technical excellence.
• Your efforts will directly support Perplexity's mission to make information accessible and useful through advanced AI.

🎯 Requirements

• Proven experience with Machine Learning systems and deep learning frameworks such as PyTorch, TensorFlow, or ONNX.
• Familiarity with common Large Language Model (LLM) architectures and inference optimization techniques (e.g., continuous batching, quantization, model pruning, efficient attention mechanisms).
• Solid understanding of GPU architectures and/or experience with GPU kernel programming using CUDA.
• Proficiency in Python and experience with systems programming languages like Rust or C++.

🏖️ Benefits

• Competitive salary and equity package.
• Comprehensive health, dental, and vision insurance.
• Generous paid time off and holidays.
• Remote work flexibility.
• Opportunity to work on groundbreaking AI technology with a talented team.

Skills & Technologies

Python

Rust

Kubernetes

TensorFlow

PyTorch

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Perplexity AI, Inc.

Visit Website

About Perplexity AI, Inc.

Perplexity AI operates an AI-powered conversational search engine that answers queries by synthesizing live web information. The platform combines large language models with real-time retrieval, citing sources for transparency. Founded in 2022, the San Francisco-based company offers free and subscription tiers, mobile apps, and browser extensions, targeting consumers and enterprises seeking accurate, verifiable answers instead of traditional link lists.

View Company Profile