This job has expired

This position was posted on February 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

AI Benchmark Engineer - Native Language Specialist | German

Lilt Production

Job Overview

Location

Germany (Remote)

Job Type

Contract

Full Job Description

📋 Description

• Lilt is at the forefront of transforming global communication through AI, and we are seeking a highly skilled and experienced AI Benchmark Engineer, specializing as a Native Language Specialist for German, to join our innovative team. This is a unique, remote, freelance opportunity to play a pivotal role in building a cutting-edge evaluation suite for Large Language Models (LLMs). You will be instrumental in designing, developing, and validating a rigorous, verifiable set of Terminal-Bench tasks specifically engineered to push the boundaries of LLM capabilities in multilingual software challenges.
• Our core objective is to meticulously measure the multilingual robustness of LLMs. This involves assessing their performance across various dimensions, including the subtle effects of prompt language, their ability to process non-English data effectively, and their resilience against complex locale and encoding edge cases encountered in terminal workflows. As a Native Language Specialist, your deep linguistic and cultural understanding of German will be paramount in creating tasks that truly reflect real-world multilingual complexities, moving beyond simple English-centric evaluations.
• Your primary responsibility will be **Task Engineering**, focusing on the evaluation of Coding Agents. This involves a sophisticated understanding of how LLMs interact with and generate code within diverse linguistic contexts. You will be tasked with identifying and exploiting the limitations of these agents when faced with German language inputs and outputs, ensuring our benchmarks are not easily circumvented by English-based training data.
• A significant part of your role will be **Asset Creation**. This entails building realistic and challenging task environments. You will leverage datasets, files, and other resources exclusively in German. The critical requirement here is that these assets must remain untranslated, forcing the LLMs to operate natively within the German language ecosystem. This direct engagement with German-language assets is key to genuinely measuring multilingual handling capabilities.
• You will also be deeply involved in **Prompting & Translation** analysis, specifically focusing on uncovering failure points where AI models falter when operating in German. This requires a keen eye for linguistic nuances, idiomatic expressions, and the subtle ways in which meaning can be lost or distorted when models are not truly proficient in the target language.
• Furthermore, your role extends to **Implementation & Verification**. You will support the development of robust reference implementations for the benchmark tasks. Crucially, you will write highly reliable and deterministic verifier scripts. These scripts will objectively assess the LLM's performance, minimizing reliance on subjective rubric-based judging, ensuring consistency and accuracy in our evaluations. Your ability to translate complex linguistic requirements into precise, executable code will be vital.
• **Calibration & Execution** is another key area. You will meticulously analyze execution logs from benchmark runs. This analysis will inform the calibration of task difficulty, ranging from 'Easy' to 'Very Hard', using standard Terminal-Bench run configurations. You will execute these benchmarks against various LLM tiers, such as Haiku, Sonnet, and Opus, to provide a comprehensive performance profile.
• **Quality Assurance** is non-negotiable. You will be an integral part of a rigorous, four-layer human quality control process. This process includes creation review, human review of benchmark outputs, calibration review, and final audit. This is complemented by automated LLM-based checks, all designed to guarantee the fairness, grammatical accuracy, and overall integrity of our benchmarks. Your native German expertise will be essential in upholding these high standards.
• This role demands a deep technical understanding of the pitfalls inherent in multilingual text processing. This includes, but is not limited to, robustness in encoding/decoding, Unicode normalization, locale-dependent conventions (such as collation, casing, and non-Gregorian dates), text I/O, toolchain interoperability, and safe string operations. For German, specific considerations might include handling of special characters, umlauts, and ß, ensuring accurate processing and rendering in various contexts.
• By contributing to this project, you will be directly impacting the future of AI development, enabling the creation of more robust, reliable, and truly multilingual AI systems. You will work with a passionate team dedicated to advancing human knowledge and making information accessible to everyone, regardless of the language they speak. This is an opportunity to earn money, have fun, and contribute to a significant technological transformation from the comfort of your remote workspace.

Skills & Technologies

Python

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Lilt Production

Visit Website

About Lilt Production

Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.