
Job Overview
Location
Germany (Remote)
Job Type
Contract
Category
Data Scientist
Date Posted
February 25, 2026
Full Job Description
đź“‹ Description
- • Lilt is at the forefront of transforming global communication through AI, and we are seeking a highly skilled and experienced AI Benchmark Engineer, specializing as a Native Language Specialist for German, to join our innovative team. This is a unique, remote, freelance opportunity to play a pivotal role in building a cutting-edge evaluation suite for Large Language Models (LLMs). You will be instrumental in designing, developing, and validating a rigorous, verifiable set of Terminal-Bench tasks specifically engineered to push the boundaries of LLM capabilities in multilingual software challenges.
- • Our core objective is to meticulously measure the multilingual robustness of LLMs. This involves assessing their performance across various dimensions, including the subtle effects of prompt language, their ability to process non-English data effectively, and their resilience against complex locale and encoding edge cases encountered in terminal workflows. As a Native Language Specialist, your deep linguistic and cultural understanding of German will be paramount in creating tasks that truly reflect real-world multilingual complexities, moving beyond simple English-centric evaluations.
- • Your primary responsibility will be **Task Engineering**, focusing on the evaluation of Coding Agents. This involves a sophisticated understanding of how LLMs interact with and generate code within diverse linguistic contexts. You will be tasked with identifying and exploiting the limitations of these agents when faced with German language inputs and outputs, ensuring our benchmarks are not easily circumvented by English-based training data.
- • A significant part of your role will be **Asset Creation**. This entails building realistic and challenging task environments. You will leverage datasets, files, and other resources exclusively in German. The critical requirement here is that these assets must remain untranslated, forcing the LLMs to operate natively within the German language ecosystem. This direct engagement with German-language assets is key to genuinely measuring multilingual handling capabilities.
- • You will also be deeply involved in **Prompting & Translation** analysis, specifically focusing on uncovering failure points where AI models falter when operating in German. This requires a keen eye for linguistic nuances, idiomatic expressions, and the subtle ways in which meaning can be lost or distorted when models are not truly proficient in the target language.
- • Furthermore, your role extends to **Implementation & Verification**. You will support the development of robust reference implementations for the benchmark tasks. Crucially, you will write highly reliable and deterministic verifier scripts. These scripts will objectively assess the LLM's performance, minimizing reliance on subjective rubric-based judging, ensuring consistency and accuracy in our evaluations. Your ability to translate complex linguistic requirements into precise, executable code will be vital.
- • **Calibration & Execution** is another key area. You will meticulously analyze execution logs from benchmark runs. This analysis will inform the calibration of task difficulty, ranging from 'Easy' to 'Very Hard', using standard Terminal-Bench run configurations. You will execute these benchmarks against various LLM tiers, such as Haiku, Sonnet, and Opus, to provide a comprehensive performance profile.
- • **Quality Assurance** is non-negotiable. You will be an integral part of a rigorous, four-layer human quality control process. This process includes creation review, human review of benchmark outputs, calibration review, and final audit. This is complemented by automated LLM-based checks, all designed to guarantee the fairness, grammatical accuracy, and overall integrity of our benchmarks. Your native German expertise will be essential in upholding these high standards.
- • This role demands a deep technical understanding of the pitfalls inherent in multilingual text processing. This includes, but is not limited to, robustness in encoding/decoding, Unicode normalization, locale-dependent conventions (such as collation, casing, and non-Gregorian dates), text I/O, toolchain interoperability, and safe string operations. For German, specific considerations might include handling of special characters, umlauts, and ß, ensuring accurate processing and rendering in various contexts.
- • By contributing to this project, you will be directly impacting the future of AI development, enabling the creation of more robust, reliable, and truly multilingual AI systems. You will work with a passionate team dedicated to advancing human knowledge and making information accessible to everyone, regardless of the language they speak. This is an opportunity to earn money, have fun, and contribute to a significant technological transformation from the comfort of your remote workspace.
Skills & Technologies
Python
Remote
About Lilt Production
Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.
Similar Opportunities

Shift Technology SAS
Brazil - Sao Paolo
Full-time
Expires Apr 25, 2026
Data Science
Junior
Remote
11 days ago

Feedzai, Inc.
SĂŁo Paulo, Brazil
Full-time
Expires Apr 25, 2026
Python
Apache Spark
Onsite
+1 more
11 days ago

Atlas Computing Inc.
Canada
Full-time
Expires Apr 23, 2026
Python
GitHub
TensorFlow
+6 more
13 days ago
