This job has expired

This position was posted on February 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

AI Benchmark Engineer - Native Language Specialist | Arabic

Lilt Production

Job Overview

Location

Egypt (Remote)

Job Type

Contract

Full Job Description

📋 Description

• Lilt is at the forefront of revolutionizing global communication through AI, and we are seeking a highly skilled AI Benchmark Engineer with native Arabic (Egyptian) fluency to join our innovative team. This is a unique, remote, freelance opportunity to contribute to the development of a cutting-edge evaluation suite for large language models (LLMs), specifically focusing on their multilingual capabilities within complex software challenges.
• Our mission is to create a rigorous, verifiable evaluation suite of Terminal-Bench tasks. These tasks are meticulously designed to push the boundaries of LLMs, assessing their performance in multilingual software environments. We aim to quantify and understand the nuances of multilingual robustness, examining how models handle prompt language variations, process non-English data, and navigate intricate locale and encoding edge cases inherent in terminal workflows.
• As an AI Benchmark Engineer, you will be instrumental in designing, building, and validating these critical benchmarks. Your primary focus will be on crafting high-signal, high-quality tasks that serve as genuine tests of a model's ability to operate effectively in multilingual contexts, without the crutch of English translation. This role requires a deep understanding of both software engineering principles and the specific linguistic and cultural intricacies of the Arabic language.
• Key responsibilities include **Task Engineering**, where you will be evaluating the capabilities of coding agents. This involves understanding how these agents perform complex tasks within a terminal environment.
• A significant part of your role will be **Asset Creation**. You will be responsible for building realistic task environments. This involves utilizing datasets and files exclusively in your native Arabic language. The integrity of these assets is paramount; they must remain in Arabic to provide a true measure of the AI model's multilingual handling capabilities. This requires a keen eye for detail and an understanding of how to create representative and challenging scenarios.
• You will also be involved in **Prompting & Translation** analysis. This entails actively seeking out failure points where AI models falter when interacting with tasks or prompts in your native language. Identifying these weaknesses is crucial for improving LLM performance.
• Furthermore, you will contribute to **Implementation & Verification**. This involves supporting the development of robust reference implementations for the benchmark tasks. A critical component of this is writing highly reliable, deterministic verifier scripts. The goal is to minimize reliance on subjective rubric-based judging, ensuring objective and consistent evaluation.
• **Calibration & Execution** is another key area. You will analyze execution logs from benchmark runs to calibrate task difficulty, ranging from Easy to Very Hard. This calibration will be performed using standard Terminal-Bench run configurations against various model tiers, such as Haiku, Sonnet, and Opus, allowing for nuanced performance comparisons.
• Finally, you will participate in **Quality Assurance**. This is a multi-layered process involving a rigorous 4-layer human quality control system. This includes creation review, human review of generated tasks, calibration review, and final audit. This process, combined with automated LLM-based checks, ensures the fairness, grammatical accuracy, and overall integrity of the benchmarks.
• This role is ideal for experienced software engineers who are native Arabic speakers with a profound understanding of the language's nuances, including grammar, register, and phrasing. A strong command of English is also essential for collaboration and documentation. You should possess extensive experience with Python, standard shell scripting, and data processing, coupled with a deep familiarity with Terminal/CLI-based development workflows and coding agents.
• Domain expertise in the technical challenges of multilingual text processing is highly valued. This includes a solid grasp of encoding/decoding robustness, Unicode normalization, locale-dependent conventions (such as collation, casing, and non-Gregorian dates), text I/O, toolchain interoperability, and safe string operations. For Arabic, specific knowledge of Bidirectional/RTL handling, font fallbacks, and rendering/typography in UI or artifacts is particularly advantageous.
• Join Lilt and be part of a global community dedicated to making the world's information accessible to everyone, regardless of the language they speak. Advance human knowledge, earn money, and have fun working on diverse projects from anywhere, anytime. We offer quick and fair payment, and the opportunity to build your professional network within a supportive community, all through a streamlined application process tailored to your expertise.

Skills & Technologies

Python

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Lilt Production

Visit Website

About Lilt Production

Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.