This job has expired

This position was posted on February 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Applied AI Evaluation Scientist

Jump Technologies Inc.

Job Overview

Location

Remote Global

Job Type

Full-time

Full Job Description

📋 Description

• Jump Technologies is at the forefront of empowering financial advisors with AI-driven solutions, automating critical tasks such as meeting preparation, note-taking, compliance documentation, CRM updates, client recaps, and follow-up actions. Since our launch in January 2024, we have rapidly expanded to serve over 30,000 users across a spectrum of firms, from solo practitioners to large enterprise Registered Investment Advisors (RIAs) and independent broker-dealers. Our growth is further solidified by strategic partnerships with industry leaders like LPL Financial, Sanctuary Wealth, and Osaic.
• As a Series A company that has successfully raised $30 million in venture capital from prominent investors including Battery Ventures, Citi Ventures, Sorenson Capital, and Pelion Venture Partners, Jump is backed by a robust financial foundation. Our team of over 100 dedicated professionals comprises seasoned leaders from esteemed organizations such as Google, Stripe, JP Morgan, Snowflake, Fidelity, BILL, Apple, Harvard, and Stanford, bringing a wealth of expertise and innovation to our mission.
• We are seeking a highly skilled and motivated Applied AI Evaluation Scientist to join our AIML Quality team, reporting into Engineering leadership. This pivotal role sits at the critical intersection of data science, information retrieval, machine learning, and product strategy. You will be instrumental in defining, building, and executing rigorous evaluation frameworks to ensure the quality, reliability, and trustworthiness of our AI/ML systems.
• The primary focus of this role will be on optimizing our Agentic Retrieval-Augmented Generation (RAG) pipelines. This involves a deep dive into every stage of the RAG process: meticulously designing how data is chunked, how information is embedded, the effectiveness of retrieval mechanisms, the quality of generated responses, and the overall user experience. Your expertise will be crucial in enhancing how we retrieve and synthesize information to deliver accurate and relevant outputs for our users.
• Beyond RAG, your responsibilities will extend to evaluating other AI/ML systems deployed across the company, ensuring a consistent high standard of performance and reliability throughout our AI offerings. We are looking for an individual with exceptional judgment, capable of discerning which aspects of our AI systems require rigorous evaluation and which do not, while possessing the statistical acumen to ensure all evaluations are sound, realistic, and yield actionable insights.
• A key aspect of this role involves balancing resource constraints with the imperative for rapid iteration and improvement. Your ability to identify the most critical metrics and measurement methodologies will directly drive enhancements that benefit our customers. You will collaborate closely with Product and Engineering teams, translating qualitative product needs into quantitative evaluation criteria and ensuring our AI systems align with user expectations and business objectives.
• While your code does not need to be production-hardened, it must be effective in achieving its intended outcomes. We value research-quality Python code, clear and reproducible experimental notebooks, and a methodical approach to experimentation, rather than the development of bulletproof microservices.
• **Key Responsibilities include:**
• **Agentic RAG Pipeline Evaluation & Optimization:**
• Designing and curating comprehensive evaluation datasets for retrieval quality, encompassing synthetically generated query-answer-context triplets, adversarial test cases, and meticulously selected gold sets derived from actual user queries.
• Measuring retrieval effectiveness using established metrics such as Recall@k, Precision@k, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG@k), with a keen understanding of their applicability and limitations for specific use cases.
• Recommending and implementing data cleaning and normalization strategies to mitigate the impact of noise in real-world data, thereby enhancing the discriminative power of retrieval algorithms, optimizing LLM context windows, and reducing irrelevant downstream responses.
• Evaluating and optimizing chunking strategies through systematic grid searches over chunk size, overlap, and segmentation methods, understanding the cascading effects of these decisions on retrieval and generation quality.
• Assessing embedding and re-ranking strategies by benchmarking various embedding models, evaluating the performance of re-rankers, and quantifying their downstream impact on the quality of generated content.
• Evaluating generation quality within the context of retrieved information, measuring faithfulness, relevance, hallucination rates, and omissions through a combination of automated checks, LLM-as-judge methodologies, and targeted human review.
• Attributing failures across the entire RAG pipeline, accurately diagnosing whether a suboptimal answer stems from data cleanliness issues, retrieval inaccuracies, ineffective chunking, generation errors, or complex inter-component interactions. Developing diagnostic tooling to isolate root causes effectively.
• **Broader AI/ML Evaluation:**
• Conducting systematic error analysis on AI/ML system outputs, including detailed trace reading, identification of failure modes using open and axial coding techniques, and the construction of structured failure taxonomies.
• Designing and validating LLM-as-Judge evaluators where appropriate, including prompt engineering, data splitting for train/dev/test sets, iterative refinement, and performance measurement against human-labeled ground truth.
• Estimating true success rates using imperfect judges by applying bias-correction techniques (e.g., Rogan-Gladen) and bootstrap confidence intervals to provide statistically robust performance estimates.
• Building and maintaining golden datasets to facilitate CI regression testing for AI pipelines, ensuring continuous quality assurance.
• Ruthlessly prioritizing evaluation efforts by assessing which failure modes warrant significant investment versus those that can be resolved through prompt clarification or tool description adjustments.
• **Collaboration & Data Review:**
• Partnering closely with Product Management to deeply understand desired outcomes for specific use cases and translating qualitative product requirements into precise, measurable evaluation criteria.
• Collaborating with Engineering to instrument pipelines for enhanced observability, design effective trace logging mechanisms, and integrate evaluation checks seamlessly into CI/CD workflows.
• Designing and building lightweight review interfaces, or working with engineers to develop them, that streamline the process for domain experts to review traces, label data, and provide structured feedback efficiently.
• Leading or facilitating annotation workflows, including defining clear rubrics, measuring inter-annotator agreement (e.g., Cohen's Kappa), conducting alignment sessions, and producing consensus-labeled datasets.
• Our team values Velocity, World Class performance, and a Direct and Kind approach with No Drama.

Skills & Technologies

Python

Redis

Elasticsearch

Remote

Degree Required

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Jump Technologies Inc.

Visit Website

About Jump Technologies Inc.

Jump Technologies Inc. offers Jump, a cloud-based mobile-first platform that enables retailers to sell, finance and insure smartphones and connected devices in-store or online. The software combines point-of-sale, financing decisioning, inventory management and customer lifecycle support, allowing wireless dealers, carriers and electronics stores to increase attachment rates and reduce transaction time. Founded in 2014, the company serves more than 1,500 retail locations across North America.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.