This job has expired

This position was posted on February 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

AI Benchmark Engineer - Native Language Specialist | Spanish

Lilt Production

Job Overview

Location

Spain (Remote)

Job Type

Contract

Full Job Description

📋 Description

• Are you a seasoned software engineer with a passion for language and a keen eye for detail? Lilt Production is seeking a highly skilled AI Benchmark Engineer, specializing as a Native Language Specialist for Spanish, to join our groundbreaking initiative. This is a unique, remote, freelance opportunity to contribute to the development of a rigorous, verifiable evaluation suite for large language models (LLMs). Our mission is to push the boundaries of multilingual AI by creating sophisticated benchmark tasks that meticulously test LLMs' capabilities in handling complex software challenges across various languages, with a specific focus on Spanish.
• In this pivotal role, you will be instrumental in designing, building, and validating these crucial benchmarks. Your primary objective will be to engineer high-signal, high-quality tasks that genuinely assess an LLM's ability to navigate and perform within multilingual environments. This means creating scenarios that do not rely on English as a crutch, forcing the models to demonstrate true multilingual robustness. You will delve into prompt language effects, non-English data processing, and intricate locale and encoding edge cases inherent in terminal workflows.
• Your responsibilities will span across several key areas. As a Task Engineer, you will focus on evaluating Coding Agents, understanding their strengths and weaknesses in practical, language-specific scenarios. A significant part of your work will involve Asset Creation, where you will build realistic task environments. This entails curating and utilizing datasets and files exclusively in Spanish, ensuring that the integrity of the multilingual testing is maintained. You will be actively involved in Prompting & Translation, specifically identifying failure points where AI models falter when interacting with Spanish language inputs and commands.
• Furthermore, you will contribute to Implementation & Verification. This involves supporting the development of robust reference implementations for the tasks you design and writing highly reliable, deterministic verifier scripts. The emphasis will be on objective, rubric-based judging only when absolutely necessary, ensuring the benchmarks are as automated and objective as possible.
• Calibration & Execution will be another critical aspect of your role. You will analyze execution logs from various model tiers (such as Haiku, Sonnet, and Opus) and calibrate task difficulty, ranging from Easy to Very Hard, using standard Terminal-Bench run configurations. This iterative process ensures that our benchmarks provide meaningful and nuanced evaluations of LLM performance.
• Quality Assurance is paramount. You will participate in a stringent, four-layer human quality control process. This includes rigorous checks at the creation, human review, calibration review, and audit stages. This process, complemented by automated LLM-based checks, guarantees the fairness, grammatical accuracy, and overall integrity of our benchmarks. Your expertise as a native Spanish speaker will be invaluable in ensuring these benchmarks are culturally and linguistically sound.
• This role demands a deep technical understanding of the nuances and pitfalls of multilingual text processing. You will need to be adept at handling encoding/decoding robustness and Unicode normalization, understanding locale-dependent conventions like collation, casing, and non-Gregorian dates. Familiarity with text I/O, toolchain interoperability, and safe string operations is essential. For Spanish, specific considerations might include handling bidirectional text (though less common than in Arabic or Hebrew, it can appear in specific contexts or mixed scripts), font fallbacks, and rendering/typography in user interfaces or generated artifacts, ensuring a comprehensive test of the models' capabilities.
• LILT is at the forefront of transforming how the world communicates through AI. Our mission is to make global information accessible to everyone, regardless of their spoken language. By joining our team, you become part of a global community that thrives on innovation, excellence, and a shared commitment to advancing human knowledge. We deliver cutting-edge multilingual AI and human-verified services to enterprises, governments, and AI developers worldwide. This freelance opportunity offers the chance to earn money, have fun, and contribute to a significant technological advancement, all while working remotely on diverse projects at your own pace.
• This is a remote, freelance position based in Spain. We are looking for individuals who are passionate about AI, language, and ensuring the quality and fairness of AI systems. If you are ready to apply your software engineering expertise to a challenging and rewarding project, we encourage you to submit your application.

Skills & Technologies

Python

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Lilt Production

Visit Website

About Lilt Production

Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.