
Job Overview
Location
Remote
Job Type
Full-time
Category
Data Engineer
Date Posted
February 17, 2026
Full Job Description
đź“‹ Description
- • Join ReflectionAI Inc. as a Member of Technical Staff specializing in Data Ingestion Engineering, a pivotal role in our mission to build and democratize open superintelligence. At Reflection, we are at the forefront of developing open-weight models designed for a diverse range of users, from individuals and agents to enterprises and even nation-states. Our distinguished team comprises leading AI researchers and seasoned company builders, with individuals hailing from prestigious institutions such as DeepMind, OpenAI, Google Brain, Meta, Character.AI, and Anthropic.
- • In the rapidly evolving landscape of Artificial Intelligence, data has emerged as a cornerstone of innovation. Many of the most significant breakthroughs in recent years have stemmed not from novel architectures, but from the strategic enhancement and utilization of data. As an integral member of our Data Team, your primary objective will be to architect, construct, and maintain the sophisticated ingestion systems responsible for transforming vast datasets from the open web and other large-scale sources into reliable, meticulously structured corpora. These corpora are the lifeblood of our frontier model training pipelines.
- • You will take ownership of the entire data acquisition machinery, encompassing the processes of acquiring, extracting, normalizing, versioning, and delivering data. This includes developing and operating large-scale data ingestion systems specifically designed for pre-training, covering aspects like web crawling, data extraction, and the seamless delivery of datasets. Your work will directly influence the performance and capabilities of our cutting-edge AI models.
- • A key aspect of this role involves working in close collaboration with our world-class AI researchers. You will be instrumental in closing the feedback loop between the data we collect and its tangible impact on model performance. This symbiotic relationship allows for rapid iteration and continuous improvement, ensuring our data strategies are directly aligned with our research objectives.
- • This position is ideally suited for engineers who possess a passion for building robust, distributed systems, but who also thrive in an environment that encourages experimentation, critical reasoning about data acquisition tradeoffs, and swift iteration based on measurable outcomes. You will be empowered to explore new strategies and optimize existing ones.
- • Your responsibilities will extend to running experiments aimed at evaluating various crawling strategies, assessing different extraction methodologies, and understanding the complex tradeoffs inherent in the ingestion process. You will analyze the ingested data to proactively identify gaps, detect redundancies, and pinpoint areas ripe for improvement, ensuring the highest quality of training data.
- • You will be tasked with building ingestion pipelines that are designed for reliable scalability, capable of handling massive data campaigns efficiently and effectively. Furthermore, you will develop specialized crawlers tailored for high-priority data sources, ensuring we capture the most relevant and valuable information for our models.
- • The role also involves a commitment to code quality and system reliability. You will be expected to review code, meticulously debug production issues as they arise, and continuously enhance the ingestion infrastructure to maintain optimal performance and stability. This includes fostering a culture of proactive maintenance and preventative problem-solving.
- • Success in this role requires a deep curiosity about how training data fundamentally influences model capabilities. You must possess the agility to iterate quickly, guided by measurable downstream impact, and demonstrate an ability to collaborate effectively across diverse functional teams, including researchers, infrastructure engineers, operations specialists, and external partners. A genuine enjoyment of working in a hybrid research-engineering capacity, where theoretical understanding meets practical application, is essential.
- • You will contribute to the development of robust, observable, testable, and maintainable systems capable of handling datasets ranging from multi-terabyte to petabyte scales. This involves a strong understanding of distributed systems principles and best practices for large-scale data processing. Your ability to design experiments and leverage data-driven insights to guide system improvements will be crucial for optimizing our data pipelines and ensuring the quality and relevance of our training corpora.
Skills & Technologies
About ReflectionAI Inc.
ReflectionAI builds autonomous AI agents for enterprise process automation. The platform lets organizations create, deploy, and manage software agents that observe workflows, make decisions, and act across internal systems. Using reinforcement learning and large language models, agents learn from human guidance and adapt to changing environments. Customers use the technology for customer support triage, IT operations, compliance monitoring, and sales process automation, reducing repetitive manual tasks. The company offers cloud-hosted and on-premise deployments, role-based access controls, audit trails, and integrations with common business applications including Salesforce, ServiceNow, Jira, and Slack.



