This job has expired

This position was posted on February 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

AI Benchmark Engineer - Native Language Specialist | Turkish

Lilt Production

Job Overview

Location

Turkey (Remote)

Job Type

Contract

Full Job Description

📋 Description

• Lilt Production is at the forefront of revolutionizing how the world interacts with information, breaking down language barriers through cutting-edge AI and human-verified services. We are seeking a highly skilled and experienced AI Benchmark Engineer, specializing as a Native Language Specialist for Turkish, to join our innovative team on a remote, freelance basis. This pivotal role is designed for seasoned software engineers who possess a deep understanding of their native language and a strong technical background, enabling them to contribute to the development of a sophisticated evaluation suite for large language models (LLMs).
• Your primary mission will be to engineer, build, and validate rigorous, verifiable benchmark tasks within the Terminal-Bench framework. These tasks are meticulously designed to push the boundaries of LLMs, specifically focusing on their multilingual capabilities. The goal is to measure and understand the robustness of these models when faced with challenges such as prompt language effects, processing non-English data, and navigating complex locale and encoding edge cases inherent in terminal workflows. You will be instrumental in creating high-signal, high-quality tasks that serve as genuine tests of a model's ability to operate effectively in multilingual environments, without the crutch of English translation.
• As a Task Engineer, your responsibilities will extend to evaluating Coding Agents. This involves a creative and analytical approach to designing tasks that expose potential failure points in AI models. You will be responsible for Asset Creation, which entails building realistic task environments. This will involve utilizing datasets and files exclusively in your native language, Turkish. It is crucial that these assets remain untranslated to ensure an authentic assessment of the AI's multilingual handling capabilities. This requires a nuanced understanding of how language is used in practical, technical contexts.
• Prompting and Translation will be a key area of focus, where you will actively seek out and identify failure points in AI performance within the Turkish language context. This involves understanding the subtleties of Turkish grammar, idiomatic expressions, and common phrasing that might trip up an AI. Furthermore, you will be involved in Implementation and Verification. This means supporting the development of robust solutions, including creating reference implementations. A significant part of this is writing highly reliable, deterministic verifier scripts. The emphasis is on automated verification, with rubric-based judging reserved only for situations where it is strictly necessary, ensuring objectivity and scalability.
• Calibration and Execution will involve analyzing execution logs from benchmark runs. You will be responsible for calibrating task difficulty, ranging from Easy to Very Hard, using standard Terminal-Bench run configurations. This calibration will be performed against various model tiers, such as Haiku, Sonnet, and Opus, allowing for a granular understanding of LLM performance across different capabilities and price points. This analytical work is critical for refining the benchmark's effectiveness.
• Quality Assurance is paramount in this role. You will participate in a rigorous, four-layer human quality control process. This process includes creation review, human review of generated tasks, calibration review, and final audit. This is complemented by automated LLM-based checks, all designed to ensure the utmost fairness, grammatical accuracy, and overall integrity of the benchmarks. Your native language expertise will be invaluable in ensuring that the benchmarks are not only technically sound but also linguistically accurate and culturally relevant.
• This is a unique opportunity to leverage your software engineering expertise and native Turkish language skills to contribute to the advancement of AI technology. You will work on challenging, impactful projects that directly influence the development and evaluation of next-generation AI models. By joining Lilt, you become part of a global community dedicated to making information accessible to everyone, regardless of the language they speak. We foster an environment of innovation, excellence, and continuous learning, where your contributions are valued and recognized. This remote, freelance position offers flexibility and the chance to work on diverse projects from anywhere, at any time, while earning money and advancing human knowledge.

Skills & Technologies

Python

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Lilt Production

Visit Website

About Lilt Production

Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.