This job has expired

This position was posted on February 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

AI Benchmark Engineer - Native Language Specialist | Marathi

Lilt Production

Job Overview

Location

India (Remote)

Job Type

Contract

Full Job Description

📋 Description

• Lilt Production is at the forefront of revolutionizing how the world communicates by making information accessible to everyone, regardless of the language they speak. We are actively developing a sophisticated and verifiable evaluation suite of Terminal-Bench tasks specifically engineered to push the boundaries of large language models (LLMs) in complex multilingual software challenges. Our overarching objective is to rigorously assess the multilingual robustness of these models, focusing on critical areas such as prompt language effects, the efficacy of non-English data processing, and the handling of intricate locale and encoding edge cases within terminal workflows.
• To achieve this, we are seeking highly skilled and experienced native-speaking software engineers who possess a deep understanding of their respective languages and a strong technical aptitude. As a key member of our team, you will be instrumental in designing, building, and validating these advanced benchmarks. Your primary focus will be on creating high-signal, high-quality tasks that serve as genuine tests of a model's ability to navigate and perform within multilingual environments, critically ensuring these tasks do not rely on English translation crutches for their effectiveness.
• This role involves significant "Task Engineering," where you will be responsible for evaluating the capabilities of coding agents. This means conceptualizing and defining the specific challenges these AI agents will face, ensuring they are relevant to real-world multilingual software development scenarios.
• A core component of your work will be "Asset Creation." You will be tasked with building realistic and challenging task environments. This involves utilizing datasets and files exclusively in your native language. It is paramount that these assets remain untranslated to genuinely measure the AI's proficiency in handling native language inputs and contexts without defaulting to English.
• "Prompting & Translation" will be another crucial area. You will actively seek out and identify failure points where AI models falter when operating in your native language. This involves creative prompting and a keen eye for linguistic nuances that might trip up an LLM.
• Your role will extend to "Implementation & Verification." You will support the development of robust solutions, including creating reference implementations for tasks. Furthermore, you will write highly reliable and deterministic verifier scripts to objectively assess the performance of the AI models. The use of rubric-based judging will be employed only when strictly necessary, emphasizing objective, script-driven verification.
• "Calibration & Execution" involves meticulous analysis of execution logs. You will be responsible for calibrating the difficulty of tasks, ranging from Easy to Very Hard, using standard Terminal-Bench run configurations. This calibration will be performed against various model tiers, including Haiku, Sonnet, and Opus, to understand performance across different LLM capabilities.
• "Quality Assurance" is a non-negotiable aspect of this role. You will participate in a rigorous, multi-layered human quality control process. This includes creation review, human review of generated outputs, calibration review, and final audit. This process, combined with automated LLM-based checks, ensures the fairness, grammatical accuracy, and overall integrity of the benchmarks.
• This is a remote, freelance opportunity, offering flexibility and the chance to contribute to groundbreaking AI research from anywhere in India. You will be working with a global community that thrives on innovation and excellence, contributing to LILT's mission to deliver multilingual AI and human-verified services to a diverse range of clients, including Enterprises, Governments, and AI Developers worldwide.
• By joining LILT, you will have the opportunity to earn money, have fun, and advance human knowledge. You will work on diverse projects, build your professional network, and get paid quickly and fairly, all through a streamlined application process tailored to your unique expertise. We are committed to a fair, inclusive, and transparent hiring process, and while we may use AI tools to assist in evaluation, all final hiring decisions are made by people.

Skills & Technologies

Python

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Lilt Production

Visit Website

About Lilt Production

Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.