This job has expired

This position was posted on February 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

AI Benchmark Engineer - Native Language Specialist | Japanese

Lilt Production

Job Overview

Location

Japan (Remote)

Job Type

Contract

Full Job Description

📋 Description

• Lilt Production is at the forefront of revolutionizing how the world interacts with information, driven by a mission to make all global knowledge accessible to everyone, regardless of their native tongue. We are building a sophisticated and rigorously verifiable evaluation suite of Terminal-Bench tasks specifically engineered to probe the absolute limits of large language models (LLMs) when confronted with complex multilingual software challenges. Our core objective is to meticulously measure the robustness of these models across a spectrum of multilingual scenarios, focusing on nuanced prompt language effects, the intricate processing of non-English data, and the identification of complex locale and encoding edge cases within terminal workflows.
• In this pivotal role, you will be instrumental in designing, constructing, and validating these critical benchmarks. As an experienced native-speaking software engineer, your primary focus will be on creating high-signal, high-quality tasks. These tasks must genuinely challenge an AI model's capacity to navigate and operate effectively within multilingual environments, deliberately avoiding reliance on English translation as a crutch. This ensures that the benchmarks accurately reflect real-world multilingual performance.
• Your responsibilities will encompass a broad range of critical activities. You will engage in 'Task Engineering,' specifically focusing on the evaluation of coding agents. This involves understanding how AI models interact with and execute code-based tasks in various linguistic contexts.
• A significant part of your contribution will be 'Asset Creation.' You will be responsible for building realistic and challenging task environments. This will involve utilizing datasets and files exclusively in your native language. It is paramount that these assets remain untranslated to provide a true test of the AI's multilingual handling capabilities.
• You will also be deeply involved in 'Prompting & Translation' analysis, actively seeking out and identifying failure points where AI systems falter when operating in your native language. This detective work is crucial for uncovering the limitations of current LLMs.
• Furthermore, you will contribute to 'Implementation & Verification.' This includes supporting the development of robust solutions, often in the form of reference implementations, and authoring highly reliable, deterministic verifier scripts. The emphasis will be on algorithmic verification, with rubric-based judging reserved only for situations where it is strictly necessary, ensuring objectivity and reproducibility.
• 'Calibration & Execution' is another key area. You will meticulously analyze execution logs from benchmark runs and calibrate the difficulty of tasks, ranging from 'Easy' to 'Very Hard.' This calibration will be performed using standard Terminal-Bench run configurations against various model tiers, such as Haiku, Sonnet, and Opus, to provide a comprehensive performance profile.
• Finally, you will be a vital part of our 'Quality Assurance' process. This involves participating in a stringent, multi-layered human quality control framework. This framework includes creation review, human review, calibration review, and audit stages, working in tandem with automated LLM-based checks. Your involvement will ensure the utmost fairness, grammatical accuracy, and overall integrity of our benchmarks.
• This is a unique opportunity to leverage your deep linguistic and technical expertise to shape the future of AI evaluation. You will work remotely, contributing to a global effort that is making information universally accessible. Join a community that values innovation, excellence, and the advancement of human knowledge, earning money while having fun and building your professional network.

Skills & Technologies

Python

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Lilt Production

Visit Website

About Lilt Production

Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.