This job has expired

This position was posted on February 25, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

AI Benchmark Engineer - Native Language Specialist | Korean

Lilt Production

Job Overview

Location

Korea (Remote)

Job Type

Contract

Full Job Description

📋 Description

• Lilt is at the forefront of transforming global communication through AI, and we are seeking a highly skilled and experienced AI Benchmark Engineer with native Korean language expertise to join our innovative team. This is a unique, remote, freelance opportunity to contribute to a groundbreaking project focused on rigorously evaluating the capabilities of large language models (LLMs) in multilingual software challenges. You will play a pivotal role in designing, building, and validating a comprehensive suite of Terminal-Bench tasks specifically engineered to push the boundaries of LLM performance across diverse linguistic landscapes.
• Our core mission is to measure and enhance the multilingual robustness of AI models. This involves meticulously assessing their ability to process non-English data, navigate complex locale and encoding edge cases within terminal workflows, and overcome prompt language effects. As a Native Language Specialist, your deep understanding of Korean will be instrumental in creating high-signal, high-quality tasks that genuinely test an LLM's capacity to handle multilingual environments without relying on English as a crutch. This ensures that our benchmarks provide authentic insights into true multilingual performance.
• Your primary responsibility will be Task Engineering, specifically focusing on the evaluation of Coding Agents. This involves conceptualizing and developing intricate tasks that challenge AI's ability to understand and execute commands, process information, and generate outputs in Korean. You will be responsible for Asset Creation, which entails building realistic and complex task environments. This includes sourcing and preparing datasets, files, and other necessary components that are entirely in Korean. The integrity of these assets is paramount; they must remain in the target language to provide a genuine test of multilingual handling capabilities.
• A crucial aspect of your role will involve Prompting & Translation, where you will actively seek out and identify failure points in AI performance within the Korean language context. This requires a keen eye for subtle linguistic nuances, idiomatic expressions, and cultural specificities that might trip up an AI. You will leverage your native fluency to uncover these weaknesses, providing invaluable feedback for model improvement.
• Furthermore, you will be involved in Implementation & Verification. This means supporting the development of robust solutions, often referred to as reference implementations, that serve as a baseline for comparison. You will also be tasked with writing highly reliable and deterministic verifier scripts. These scripts are essential for automatically assessing the correctness and quality of AI outputs, ensuring objective evaluation. While rubric-based judging will be used only when strictly necessary, the emphasis is on automated, verifiable results.
• Calibration & Execution will be another key area of your work. You will meticulously analyze execution logs from benchmark runs to understand AI performance patterns. Based on this analysis, you will calibrate task difficulty, ranging from Easy to Very Hard, using standard Terminal-Bench run configurations. This calibration will be performed against various model tiers, such as Haiku, Sonnet, and Opus, allowing us to understand how different LLMs perform under varying levels of complexity and linguistic challenge.
• Quality Assurance is paramount to the integrity of our benchmarks. You will actively participate in a rigorous, four-layer human quality control process. This process includes creation review, human review of AI outputs, calibration review to ensure consistency, and final audit. This human oversight, combined with automated LLM-based checks, guarantees the fairness, grammatical accuracy, and overall benchmark integrity. Your expertise will be vital in upholding these high standards.
• This role offers the chance to work on cutting-edge AI technology, directly impacting the development of more capable and globally relevant language models. You will collaborate with a passionate team dedicated to making information accessible to everyone, regardless of the language they speak. By joining Lilt, you become part of a global community that thrives on innovation, excellence, and the advancement of human knowledge. This is an opportunity to earn money, have fun, and contribute to a significant technological transformation from the comfort of your remote workspace in Korea.

Skills & Technologies

Python

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Lilt Production

Visit Website

About Lilt Production

Lilt Production is a full-service video production studio based in Paris, France, creating commercial, corporate, and branded content for agencies and global brands. Services span concept development, live-action filming, motion graphics, post-production, color grading, and localized adaptations. The company operates a bilingual French-English team and works across Europe, the Middle East, and Africa, emphasizing cinematic storytelling and contemporary visual aesthetics for broadcast, digital, and social media distribution.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.