Lilt Production logo

AI Benchmark Engineer - Native Language Specialist | Japanese

Job Overview

Location

Japan (Remote)

Job Type

Contract

Category

Software Engineer

Date Posted

February 25, 2026

Full Job Description

đź“‹ Description

  • • Lilt Production is at the forefront of revolutionizing how the world interacts with information, driven by a mission to make all global knowledge accessible to everyone, regardless of their native tongue. We are building a sophisticated and rigorously verifiable evaluation suite of Terminal-Bench tasks specifically engineered to probe the absolute limits of large language models (LLMs) when confronted with complex multilingual software challenges. Our core objective is to meticulously measure the robustness of these models across a spectrum of multilingual scenarios, focusing on nuanced prompt language effects, the intricate processing of non-English data, and the identification of complex locale and encoding edge cases within terminal workflows.
  • • In this pivotal role, you will be instrumental in designing, constructing, and validating these critical benchmarks. As an experienced native-speaking software engineer, your primary focus will be on creating high-signal, high-quality tasks. These tasks must genuinely challenge an AI model's capacity to navigate and operate effectively within multilingual environments, deliberately avoiding reliance on English translation as a crutch. This ensures that the benchmarks accurately reflect real-world multilingual performance.
  • • Your responsibilities will encompass a broad range of critical activities. You will engage in 'Task Engineering,' specifically focusing on the evaluation of coding agents. This involves understanding how AI models interact with and execute code-based tasks in various linguistic contexts.
  • • A significant part of your contribution will be 'Asset Creation.' You will be responsible for building realistic and challenging task environments. This will involve utilizing datasets and files exclusively in your native language. It is paramount that these assets remain untranslated to provide a true test of the AI's multilingual handling capabilities.
  • • You will also be deeply involved in 'Prompting & Translation' analysis, actively seeking out and identifying failure points where AI systems falter when operating in your native language. This detective work is crucial for uncovering the limitations of current LLMs.
  • • Furthermore, you will contribute to 'Implementation & Verification.' This includes supporting the development of robust solutions, often in the form of reference implementations, and authoring highly reliable, deterministic verifier scripts. The emphasis will be on algorithmic verification, with rubric-based judging reserved only for situations where it is strictly necessary, ensuring objectivity and reproducibility.
  • • 'Calibration & Execution' is another key area. You will meticulously analyze execution logs from benchmark runs and calibrate the difficulty of tasks, ranging from 'Easy' to 'Very Hard.' This calibration will be performed using standard Terminal-Bench run configurations against various model tiers, such as Haiku, Sonnet, and Opus, to provide a comprehensive performance profile.
  • • Finally, you will be a vital part of our 'Quality Assurance' process. This involves participating in a stringent, multi-layered human quality control framework. This framework includes creation review, human review, calibration review, and audit stages, working in tandem with automated LLM-based checks. Your involvement will ensure the utmost fairness, grammatical accuracy, and overall integrity of our benchmarks.
  • • This is a unique opportunity to leverage your deep linguistic and technical expertise to shape the future of AI evaluation. You will work remotely, contributing to a global effort that is making information universally accessible. Join a community that values innovation, excellence, and the advancement of human knowledge, earning money while having fun and building your professional network.

Skills & Technologies

Python
Remote

Ready to Apply?

You will be redirected to an external site to apply.

Lilt Production logo
Lilt Production
Visit Website

About Lilt Production

Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.

Similar Opportunities

Argentina
Full-time
Expires Apr 25, 2026
Python
JavaScript
TypeScript
+4 more

11 days ago

Apply
Argentina - Remote
Full-time
Expires May 4, 2026
Python
PHP
Ruby
+5 more

2 days ago

Apply
Argentina
Full-time
Expires Apr 29, 2026
Java
Spring
PostgreSQL
+5 more

7 days ago

Apply
Argentina
Full-time
Expires Apr 28, 2026
JavaScript
TypeScript
Go
+4 more

8 days ago

Apply