
Job Overview
Location
Remote Work( USA)
Job Type
Full-time
Category
Data Science
Date Posted
May 17, 2026
Full Job Description
đź“‹ Description
- • Design and develop evaluation benchmarks for multimodal foundation models across text-image, text-audio, text-video, or cross-modal retrieval combinations, defining task formats, annotation guidelines, scoring criteria, and coverage dimensions.
- • Execute benchmarks against multimodal models, analyze performance patterns, identify failure modes, and synthesize findings into clear, actionable research summaries and recommendations.
- • Investigate and compare automated scoring approaches for multimodal outputs, including model-as-judge methods, reference-free metrics, and human alignment studies, assessing tradeoffs in reliability, validity, cost, and scalability.
- • Contribute to the collection, filtering, and quality review of multimodal evaluation datasets, including designing annotation schemes and conducting inter-rater reliability analysis.
- • Survey the state of the art in multimodal evaluation and benchmarking, identify gaps in existing benchmark coverage, and propose novel evaluation methodologies grounded in academic literature.
- • Produce high-quality internal research write-ups, benchmark datasheets, and presentation-ready summaries of findings tailored for both technical and non-technical audiences.
- • Focus on one or more primary areas: vision-language evaluation (e.g., image captioning, visual question answering, document understanding, chart reasoning), audio-speech-language benchmarking (e.g., spoken language comprehension, audio captioning), video understanding benchmarks (e.g., temporal reasoning, video QA, video-text retrieval), cross-modal consistency and robustness testing under perturbations or distribution shifts, or automated multimodal scoring via judge-model pipelines.
- • Work with multimodal models and datasets using Python, PyTorch, and Hugging Face Transformers for data processing, model inference, and quantitative analysis.
- • Apply statistical analysis to interpret benchmark results, including understanding variance, significance, and limitations of evaluation conclusions.
- • Collaborate with senior research scientists and ML engineers on frontier AI evaluation problems within an integrated ecosystem of 1.8 million domain experts and 150+ PhDs.
- • Engage with enterprise AI workflows and customer-facing research consulting as part of an applied research team focused on reducing GenAI costs and accelerating deployment.
- • Document all research activities with precision to support reproducibility, publication potential, and open-source benchmark releases.
- • Communicate complex technical findings through structured written reports and presentations to diverse internal stakeholders.
🎯 Requirements
- • Currently enrolled in an MS or PhD program in Computer Science, Machine Learning, Statistics, AI, Linguistics, or a closely related quantitative field.
- • Coursework, research projects, or hands-on experience with multimodal models, vision-language systems, or NLP, with familiarity with at least one non-text modality (image, audio, or video).
- • Exposure to model evaluation concepts such as benchmark design, metric selection, or experimental comparison through academic or internship work.
- • Solid Python skills for data processing, model inference, and quantitative analysis; working experience with PyTorch or Hugging Face Transformers.
- • Comfort with basic statistical analysis including understanding variance, significance, and limitations of benchmark conclusions.
- • Ability to write clearly and present findings in an organized, audience-appropriate manner.
🏖️ Benefits
- • Mentorship from senior research scientists and ML engineers working on frontier AI evaluation problems.
- • Ownership of a focused, publishable research project with real-world impact on how leading AI models are evaluated.
- • Exposure to enterprise AI workflows, customer-facing research consulting, and cross-functional applied research teams.
- • Potential co-authorship on publications or open-source benchmark releases upon completion of high-quality work.
- • A competitive internship stipend of $40/hr and flexible hybrid/remote working arrangement.
Skills & Technologies
About Centific Global Technologies Pte. Ltd.
Centific is a data-centric AI services company providing data collection, annotation, and model validation solutions to enterprises and technology vendors. It operates a global crowd platform that combines human intelligence with automation to prepare, curate, and test datasets for computer vision, NLP, and generative AI applications. The company supports full AI lifecycle needs, from training data to reinforcement learning and model safety, serving industries including retail, automotive, healthcare, and technology. Headquartered in Singapore, Centific maintains delivery centers across Asia, Europe, and North America.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities
1 month ago
16 days ago

Poshmark, Inc.
3 days ago


