
Job Overview
Location
Remote (EMEA/East Coast)
Job Type
Full-time
Category
Data Engineer
Date Posted
May 19, 2026
Full Job Description
đź“‹ Description
- • Design, build, and operate a large-scale web crawler to acquire all openly accessible data on the internet for pre-training frontier LLMs.
- • Develop specialized deep crawlers targeting high-value sources to improve recall and coverage of data critical for software development models.
- • Own the long-term roadmap for data acquisition in collaboration with pre-training data researchers and engineers to align data sourcing with model training objectives.
- • Build observability, monitoring, and debugging tooling to ensure reliability, transparency, and performance across the crawl infrastructure.
- • Construct high-throughput ingestion pipelines to rapidly onboard and evaluate partner-sourced data for quality and relevance.
- • Collaborate closely with pre-training, post-training, and evaluations teams to prioritize data acquisition efforts based on model training needs and performance feedback.
- • Ensure all data acquisition systems adhere to responsible crawl practices, including strict compliance with robots.txt, HTTP protocols, and data privacy standards.
- • Optimize Python-based systems for performance and scalability under production conditions, debugging complex distributed data pipelines.
- • Deploy and manage high-throughput workloads using cloud platforms (AWS) and container orchestration tools (Kubernetes, Docker).
- • Maintain a focus on delivering the highest-quality, diverse, and most comprehensive data corpus to fuel the pre-training of frontier AI models for software development.
- • Work within a distributed team across EMEA and East Coast, participating in monthly in-person collaboration days in Paris (Monday-Wednesday) and annual off-sites.
- • Contribute to a culture of intellectual curiosity, low ego, and intentional collaboration centered on building AGI through intelligence systems for software development.
- • Operate as the first dedicated data acquisition engineer, shaping foundational systems that directly determine the capability of models trained at Poolside AI.
- • Translate research objectives into scalable, production-grade infrastructure for trillion-scale data acquisition and processing.
- • Continuously improve data acquisition systems to maximize recall from high-value sources while minimizing resource waste and compliance risk.
- • Engage in a structured hiring process including intro calls with founding engineers, technical interviews with members of engineering, a team fit call with the People team, and a final interview with a founding engineer.
🎯 Requirements
- • Strong distributed systems background with proven experience building and operating large-scale infrastructure such as data pipelines or web crawlers
- • Proficiency in Python and experience optimizing performance and debugging complex systems under production conditions
- • Hands-on experience with web crawling or large-scale data extraction, including understanding of HTTP protocols, distributed job queues, and data parsing at scale
- • Familiarity with AWS and container orchestration tools (Kubernetes, Docker) for deploying and managing high-throughput workloads
- • Awareness of data privacy, robots.txt adherence, and responsible crawl practices at internet scale
🏖️ Benefits
- • Fully remote work & flexible hours
- • 37 days/year of vacation & holidays
- • 16 weeks of flexible, full-pay parental leave
- • Health insurance allowance for you & dependents
- • Company-provided equipment
- • Well-being, always-be-learning & home office allowances
- • Frequent team get togethers
- • Diverse & inclusive people-first culture
Skills & Technologies
See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.
About Poolside AI, Inc.
Poolside AI develops and operates a cloud-based platform that turns natural-language prompts into functioning software. Using large-scale language models trained on public and proprietary code, the system autonomously writes, tests, and refines programs, enabling users to create applications, scripts, and data workflows without traditional coding. Founded in 2023 and headquartered in Paris with offices in New York, the company serves individual developers, startups, and enterprise teams seeking faster, more accessible software creation.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities

Allata, LLC
3 months ago

Allata, LLC
3 months ago

IT Labs
3 months ago

Mutt Data SRL
4 months ago