This job has expired

This position was posted on February 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

AI Benchmark Engineer - Native Language Specialist | Hindi

Lilt Production

Job Overview

Location

India (Remote)

Job Type

Contract

Full Job Description

📋 Description

• Lilt Production is at the forefront of a revolution in how the world communicates, driven by advancements in Artificial Intelligence. We are building a sophisticated and rigorously verifiable evaluation suite of Terminal-Bench tasks specifically engineered to push the boundaries of large language models (LLMs) in complex, multilingual software challenges. Our overarching mission is to quantify and enhance the multilingual robustness of these AI models. This involves meticulously assessing their performance across a spectrum of critical areas, including the subtle yet significant impact of prompt language variations, their capacity to process non-English data efficiently, and their resilience when encountering intricate locale and encoding edge cases within typical terminal workflows.
• As an AI Benchmark Engineer - Native Language Specialist for Hindi, you will be an integral part of this pioneering effort. We are actively seeking seasoned, native-speaking software engineers who possess a deep understanding of their native language and a strong technical acumen. Your primary role will be to conceptualize, develop, and rigorously validate these specialized benchmarks. This is a unique opportunity to craft high-signal, high-quality tasks that serve as genuine tests of an AI model's ability to navigate and perform within diverse multilingual environments, critically ensuring these evaluations do not rely on English as a translation crutch. Your expertise will directly contribute to building more capable and globally relevant AI.
• Your responsibilities will span the entire lifecycle of benchmark creation and refinement. You will engage in 'Task Engineering,' focusing on evaluating the capabilities of advanced Coding Agents. This involves understanding how these agents interact with code and text in various languages.
• A significant part of your role will be 'Asset Creation.' You will be responsible for constructing realistic and challenging task environments. This will involve utilizing and generating datasets and files exclusively in your native Hindi language. The crucial aspect here is maintaining the integrity of the native language assets to provide an authentic measure of the AI's multilingual handling capabilities, free from the influence of English.
• You will also be deeply involved in 'Prompting & Translation' analysis, specifically identifying and documenting failure points where AI models falter when operating in your native Hindi. This requires a keen eye for linguistic nuances and an understanding of how AI interprets and generates text in non-English contexts.
• Furthermore, your role extends to 'Implementation & Verification.' You will support the development of robust reference implementations for these benchmarks. Crucially, you will write highly reliable and deterministic verifier scripts to automatically assess the performance of LLMs against the created tasks. Reliance on rubric-based judging will be minimized, emphasizing objective, script-driven validation wherever possible.
• 'Calibration & Execution' is another key area. You will meticulously analyze execution logs generated from running benchmarks against various LLM tiers (such as Haiku, Sonnet, and Opus). Based on this analysis, you will calibrate the difficulty of tasks, ranging from 'Easy' to 'Very Hard,' ensuring a comprehensive and graded evaluation of model capabilities using standard Terminal-Bench run configurations.
• Finally, you will be a vital participant in our 'Quality Assurance' process. This is a multi-layered, rigorous human quality control system involving creation review, human review, calibration review, and final audit. This process works in tandem with automated LLM-based checks to guarantee the fairness, grammatical accuracy, and overall integrity of our benchmarks. Your native language expertise will be paramount in ensuring these benchmarks are culturally and linguistically sound.
• This is a remote, freelance opportunity, offering flexibility and the chance to work from anywhere in India. You will be joining a dynamic team dedicated to advancing AI's global reach and applicability. Lilt's mission is to democratize information, making it accessible to everyone, regardless of their spoken language. By joining us, you become part of a global community that values innovation, excellence, and the unique contributions of each member. Our collective expertise in multilingual AI and human-verified services empowers enterprises, governments, and AI developers worldwide. Embrace the opportunity to earn, learn, and contribute to the advancement of human knowledge while working on diverse and impactful projects.

Skills & Technologies

Python

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Lilt Production

Visit Website

About Lilt Production

Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.