This job has expired

This position was posted on February 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

AI Benchmark Engineer - Native Language Specialist | Czech

Lilt Production

Job Overview

Location

Czech Republic (Remote)

Job Type

Contract

Full Job Description

📋 Description

• Lilt is at the forefront of transforming global communication through AI, and we are seeking a highly skilled and experienced AI Benchmark Engineer with native Czech language proficiency to join our innovative team. This is a unique opportunity to contribute to the development of a cutting-edge evaluation suite for Large Language Models (LLMs), specifically focusing on their multilingual capabilities within complex terminal environments. You will play a pivotal role in designing, building, and validating rigorous benchmark tasks that push the boundaries of AI's understanding and processing of non-English languages.
• As a Native Language Specialist for Czech, your primary responsibility will be to engineer high-signal, high-quality tasks that genuinely assess an LLM's ability to handle multilingual software challenges without relying on English translations. This involves a deep dive into the nuances of your native language, identifying failure points, and creating realistic test scenarios that expose the limitations of current AI models. You will be instrumental in ensuring our benchmarks are robust, verifiable, and accurately reflect the real-world performance of LLMs in diverse linguistic contexts.
• Your day-to-day will involve a blend of creative task engineering and meticulous technical implementation. You will be responsible for creating realistic task environments using datasets and files exclusively in Czech. This is a critical aspect of the role, as it ensures that the benchmarks truly measure the model's multilingual handling capabilities rather than its English translation prowess. You will also be involved in sophisticated prompting strategies, actively seeking out and documenting instances where AI models falter in understanding or generating Czech text, thereby uncovering crucial insights into their weaknesses.
• A significant part of your role will be dedicated to implementation and verification. You will support the development of robust reference implementations for the benchmark tasks and write highly reliable, deterministic verifier scripts. These scripts will be the backbone of our evaluation process, ensuring consistent and objective assessment. While rubric-based judging will be used only when strictly necessary, the emphasis is on automated, script-driven verification to maintain scalability and accuracy.
• Furthermore, you will be involved in the calibration and execution phase of the benchmarks. This entails analyzing execution logs from various LLM tiers (such as Haiku, Sonnet, and Opus) and calibrating task difficulty, ranging from 'Easy' to 'Very Hard,' using standard Terminal-Bench run configurations. This analytical work is crucial for understanding model performance across different capabilities and complexities.
• Quality assurance is paramount in this role. You will participate in a rigorous, four-layer human quality control process. This includes your direct involvement in the creation of tasks, human review of generated content, calibration review to ensure consistency, and final audit to guarantee benchmark integrity. This process is complemented by automated LLM-based checks, ensuring a comprehensive and fair evaluation system that uphns grammatical accuracy and overall benchmark integrity.
• This is a remote, freelance opportunity, offering flexibility and the chance to work from anywhere in the Czech Republic. You will be contributing to Lilt's mission to make the world's information accessible to everyone, regardless of the language they speak, by enhancing the multilingual capabilities of AI.
• You will leverage your deep technical understanding of multilingual text processing pitfalls, including encoding/decoding robustness, Unicode normalization, locale-dependent conventions (like collation, casing, and non-Gregorian dates), text I/O, toolchain interoperability, and safe string operations. For languages with specific complexities, such as bidirectional text handling (RTL), font fallbacks, and rendering/typography in UI or artifacts, your domain expertise will be invaluable.
• By joining Lilt, you become part of a global community that thrives on innovation and excellence. You will work on diverse projects, earn money, advance human knowledge, and build your professional network in a supportive environment. We are committed to a fair, inclusive, and transparent hiring process, and while we may use AI tools to assist in evaluation, all final hiring decisions are made by people.

Skills & Technologies

Python

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Lilt Production

Visit Website

About Lilt Production

Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.