Senior Research Data Engineer - Foundation Models

DeepL SE

Job Overview

Location

Berlin

Job Type

Full-time

Full Job Description

📋 Description

• Join DeepL's pioneering research department, a dynamic cross-functional group of research scientists and data engineers dedicated to advancing the frontiers of Artificial Intelligence.
• As a Senior Research Data Engineer, you will play a pivotal role in our Foundation Model track, specializing in machine learning and contributing to the development of cutting-edge foundation models that power DeepL's innovative AI products.
• Your primary focus will be on creating, refining, and managing multi-modal training corpora, taking ownership of the entire lifecycle of data collection and preparation pipelines.
• You will be instrumental in handling unstructured data at an immense scale, processing petabytes of information and leveraging tens of thousands of CPU cores within a hybrid cloud environment to fuel our most ambitious AI projects.
• This role offers a unique opportunity to work on challenging, frontier research projects, collaborating closely with a team of world-class research scientists and fellow research data engineers.
• You will be responsible for architecting, designing, and building robust data pipelines capable of efficiently managing petabytes of multi-modal unstructured data, including text, code, images, and audio.
• Contribute to building a modern data engineering stack, grounded in state-of-the-art technologies for orchestration and parallel computation, with a strong emphasis on utilizing and contributing to actively developing open-source solutions.
• Proactively identify performance bottlenecks across all system levels, from individual components to the overall system architecture, and implement effective debugging strategies to ensure pipeline stability and efficiency.
• Utilize DeepL's extensive on-premise data centers and AWS cloud infrastructure to achieve blazing-fast data processing speeds, enabling rapid iteration and experimentation.
• Go beyond traditional "Big Data" and ETL paradigms to engineer and operate complex, production-ready Python data solutions tailored for real-world unstructured data across various modalities.
• Foster strong collaborative relationships with a diverse range of stakeholders, including research scientists, other research data engineers, and dedicated data tooling and platform teams, ensuring seamless integration and alignment.
• Elevate the standard for data engineering excellence within the team, acting as a key owner and champion for the quality, integrity, and availability of our foundation model training data.
• Ensure mission-critical reliability for all data pipeline jobs, maintaining rigorous standards for high-quality code and operational stability.
• Contribute your unique strengths, including creativity, thoroughness, pragmatism, foresight, ingenuity, and persistence, to elevate the team's collective capabilities and drive innovation.
• This role is crucial for DeepL's mission to become the global leader in trusted, intelligent AI technology, directly impacting the development of AI products that enhance communication, foster connections, and create meaningful impact.
• You will be at the forefront of AI research, working with data at a scale and complexity that few organizations can match, pushing the boundaries of what's possible in machine learning and natural language processing.
• The opportunity to work with petabyte-scale, multi-modal unstructured data provides a unique learning and development experience, exposing you to diverse data types and advanced processing techniques.
• By engineering and operating complex Python data solutions, you will gain deep expertise in building scalable and reliable data infrastructure for AI model training.
• Collaborating with researchers and engineers will provide invaluable insights into the AI product development lifecycle and the specific data challenges faced in cutting-edge AI research.
• Your role in ensuring the quality and availability of training data directly influences the performance and reliability of DeepL's AI models, making your contribution highly impactful.
• The emphasis on open-source solutions and modern data engineering practices ensures you'll be working with the latest tools and methodologies in the field.
• Debugging performance bottlenecks and optimizing code for highly scalable, parallel compute workloads will hone your skills in high-performance computing and distributed systems.
• The hybrid cloud environment, combining on-premise data centers with AWS, offers a comprehensive understanding of managing and optimizing large-scale data infrastructure across different platforms.
• This position is ideal for an individual passionate about the intersection of data engineering and AI research, eager to contribute to groundbreaking projects in a fast-paced, innovative environment.
• You will be empowered to take ownership and drive initiatives, contributing to the strategic direction of data infrastructure for foundation models.
• The role demands a blend of deep technical expertise, strong problem-solving skills, and effective communication to translate complex data needs into tangible engineering solutions.
• By working with multi-modal data, you will gain exposure to a wide array of data types, enhancing your versatility and understanding of diverse AI applications.
• The continuous evolution of AI research means you will constantly be learning and adapting, staying at the cutting edge of the field.
• Your work will directly contribute to DeepL's mission of making work simpler, smarter, and more connected through advanced AI technologies.
• This is a chance to be part of a company that is not only a leader in AI but also deeply committed to ethical AI development and user trust.
• The collaborative environment encourages knowledge sharing and continuous improvement, fostering personal and professional growth.
• You will be instrumental in building the data foundations upon which future AI breakthroughs at DeepL will be built.

Skills & Technologies

Python

Rust

AWS

Kubernetes

Senior

Hybrid

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

DeepL SE

Visit Website

About DeepL SE

DeepL SE develops and operates neural machine translation technology. Founded in 2017 as Linguee GmbH, the Cologne-based company rebranded in 2019 after creating DeepL Translator, a service noted for outperforming Google Translate in blind tests. The platform supports 33 languages and offers browser, desktop, mobile and API access, serving individuals, businesses, and developers worldwide. Revenue comes from tiered subscription plans for pro features and volume-based API usage. DeepL continues to invest in AI research to maintain translation quality and expand language coverage.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.