
Job Overview
Location
Japan (Remote)
Job Type
Contract
Category
Software Engineer
Date Posted
February 25, 2026
Full Job Description
đź“‹ Description
- • Lilt Production is at the forefront of revolutionizing how the world interacts with information, driven by a mission to make all global knowledge accessible to everyone, regardless of their native tongue. We are building a sophisticated and rigorously verifiable evaluation suite of Terminal-Bench tasks specifically engineered to probe the absolute limits of large language models (LLMs) when confronted with complex multilingual software challenges. Our core objective is to meticulously measure the robustness of these models across a spectrum of multilingual scenarios, focusing on nuanced prompt language effects, the intricate processing of non-English data, and the identification of complex locale and encoding edge cases within terminal workflows.
- • In this pivotal role, you will be instrumental in designing, constructing, and validating these critical benchmarks. As an experienced native-speaking software engineer, your primary focus will be on creating high-signal, high-quality tasks. These tasks must genuinely challenge an AI model's capacity to navigate and operate effectively within multilingual environments, deliberately avoiding reliance on English translation as a crutch. This ensures that the benchmarks accurately reflect real-world multilingual performance.
- • Your responsibilities will encompass a broad range of critical activities. You will engage in 'Task Engineering,' specifically focusing on the evaluation of coding agents. This involves understanding how AI models interact with and execute code-based tasks in various linguistic contexts.
- • A significant part of your contribution will be 'Asset Creation.' You will be responsible for building realistic and challenging task environments. This will involve utilizing datasets and files exclusively in your native language. It is paramount that these assets remain untranslated to provide a true test of the AI's multilingual handling capabilities.
- • You will also be deeply involved in 'Prompting & Translation' analysis, actively seeking out and identifying failure points where AI systems falter when operating in your native language. This detective work is crucial for uncovering the limitations of current LLMs.
- • Furthermore, you will contribute to 'Implementation & Verification.' This includes supporting the development of robust solutions, often in the form of reference implementations, and authoring highly reliable, deterministic verifier scripts. The emphasis will be on algorithmic verification, with rubric-based judging reserved only for situations where it is strictly necessary, ensuring objectivity and reproducibility.
- • 'Calibration & Execution' is another key area. You will meticulously analyze execution logs from benchmark runs and calibrate the difficulty of tasks, ranging from 'Easy' to 'Very Hard.' This calibration will be performed using standard Terminal-Bench run configurations against various model tiers, such as Haiku, Sonnet, and Opus, to provide a comprehensive performance profile.
- • Finally, you will be a vital part of our 'Quality Assurance' process. This involves participating in a stringent, multi-layered human quality control framework. This framework includes creation review, human review, calibration review, and audit stages, working in tandem with automated LLM-based checks. Your involvement will ensure the utmost fairness, grammatical accuracy, and overall integrity of our benchmarks.
- • This is a unique opportunity to leverage your deep linguistic and technical expertise to shape the future of AI evaluation. You will work remotely, contributing to a global effort that is making information universally accessible. Join a community that values innovation, excellence, and the advancement of human knowledge, earning money while having fun and building your professional network.
Skills & Technologies
Python
Remote
About Lilt Production
Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.
Similar Opportunities

Ryzlabs Inc.
Argentina
Full-time
Expires Apr 25, 2026
Python
JavaScript
TypeScript
+4 more
11 days ago

Web.com Group, Inc.
Argentina - Remote
Full-time
Expires May 4, 2026
Python
PHP
Ruby
+5 more
2 days ago

