Research Scientist, Benchmarks & Evaluations

Protege Inc.

Job Overview

Location

Remote

Job Type

Full-time

Full Job Description

📋 Description

• Design benchmarks and evaluations that meaningfully distinguish capability levels across frontier AI models, including agentic, reasoning-heavy, and domain-specific systems in healthcare, finance, and scientific domains.
• Validate evaluations rigorously by running human baselines, analyzing inter-rater reliability, studying the impact of elicitation and scaffolding on model performance, and quantifying signal versus noise in evaluation results.
• Develop the scientific foundation for evaluation at Protege, applying item response theory, contamination analysis, predictive validity studies, and statistical frameworks that account for uncertainty in model comparisons.
• Conduct evaluations on current frontier models in collaboration with AI labs, enterprises, and government partners to ensure real-world relevance and practical utility.
• Publish research that establishes Protege as the standard-setter for trustworthy evaluation data and contributes to the broader AI community’s understanding of high-quality evaluation design.
• Translate research findings into production evaluation datasets by partnering closely with data and engineering teams to ensure scalability, usability, and integration into Protege’s platform.
• Own the statistical machinery for annotator trustworthiness, determining which annotators are reliable on which tasks, and translating this into calibrated trust scores that customers can rely on.
• Design and oversee labeling protocols for outsourced annotation vendors, ensuring high-quality, bias-aware, and statistically valid human-generated evaluation data.
• Analyze annotator bias, calibration, and performance using agreement statistics and quality control metrics to maintain the integrity of evaluation datasets.
• Communicate complex technical findings to diverse audiences including frontier labs, enterprise customers, and policymakers through clear, actionable narratives and documentation.
• Operate with a bias toward velocity, identifying which evaluation pipelines require production-grade rigor and which can be iterated quickly to deliver reliable results under tight timelines.
• Maintain alignment with Protege’s core values: integrity, resilience, velocity, kindness, candor, and shared ownership in all research and operational decisions.

🎯 Requirements

• Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field such as applied econometrics, quantitative finance, computer science, engineering, statistics, or mathematics.
• Hands-on experience evaluating LLMs, agents, or other ML systems, including proficiency in prompting, scaffolding, and using researcher tooling to run evaluations at scale.
• Experience with annotator quality control, inter-rater reliability, labeling protocol design, and reasoning about annotator bias and calibration.
• Excellent scientific writing and communication skills, with a proven ability to synthesize technical findings into narratives usable by frontier labs, enterprises, and policymakers.
• Demonstrated bias toward velocity — ability to distinguish between production-grade and scrappy pipelines and deliver reliable results quickly.
• Experience with statistical models of annotator skill (e.g., Dawid-Skene, MACE, IRT-style approaches) or running large expert-annotator panels in regulated domains.

🏖️ Benefits

• Opportunity to shape the future of AI evaluation standards through publishable research with real-world impact.
• Work with world-class investors and partnerships with leading AI teams at the frontier of the industry.
• High-trust, fast-moving culture built for individuals who thrive on ambiguity and own outcomes.
• Direct influence on product development through close collaboration with data and engineering teams.
• Competitive compensation and equity package aligned with company growth and market leadership.
• Fully remote work environment with flexibility to operate across time zones.

Skills & Technologies

Remote

Degree Required

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Protege Inc.

Visit Website

About Protege Inc.

Protege is a career development platform that helps early-career talent connect directly with industry mentors and secure paid apprenticeships. The company partners with employers to create short-term, project-based experiences that give participants real work opportunities while companies evaluate candidates for full-time roles. Its marketplace offers mentorship, skill-building projects, and application tools designed to reduce hiring bias and widen access to competitive industries such as tech, finance, and media. Founded in 2020 and headquartered in New York City, Protege has facilitated thousands of placements and aims to replace traditional campus recruiting with scalable experiential hiring programs.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.