Member of Engineering (Pre-training / Data Acquisition)

Poolside AI, Inc.

Job Overview

Location

Remote (EMEA/East Coast)

Job Type

Full-time

Full Job Description

📋 Description

• Design, build, and operate a large-scale web crawler to acquire all openly accessible data on the internet for pre-training frontier LLMs.
• Develop specialized deep crawlers targeting high-value sources to improve recall and coverage of data critical for software development models.
• Own the long-term roadmap for data acquisition in collaboration with pre-training data researchers and engineers to align data sourcing with model training objectives.
• Build observability, monitoring, and debugging tooling to ensure reliability, transparency, and performance across the crawl infrastructure.
• Construct high-throughput ingestion pipelines to rapidly onboard and evaluate partner-sourced data for quality and relevance.
• Collaborate closely with pre-training, post-training, and evaluations teams to prioritize data acquisition efforts based on model training needs and performance feedback.
• Ensure all data acquisition systems adhere to responsible crawl practices, including strict compliance with robots.txt, HTTP protocols, and data privacy standards.
• Optimize Python-based systems for performance and scalability under production conditions, debugging complex distributed data pipelines.
• Deploy and manage high-throughput workloads using cloud platforms (AWS) and container orchestration tools (Kubernetes, Docker).
• Maintain a focus on delivering the highest-quality, diverse, and most comprehensive data corpus to fuel the pre-training of frontier AI models for software development.
• Work within a distributed team across EMEA and East Coast, participating in monthly in-person collaboration days in Paris (Monday-Wednesday) and annual off-sites.
• Contribute to a culture of intellectual curiosity, low ego, and intentional collaboration centered on building AGI through intelligence systems for software development.
• Operate as the first dedicated data acquisition engineer, shaping foundational systems that directly determine the capability of models trained at Poolside AI.
• Translate research objectives into scalable, production-grade infrastructure for trillion-scale data acquisition and processing.
• Continuously improve data acquisition systems to maximize recall from high-value sources while minimizing resource waste and compliance risk.
• Engage in a structured hiring process including intro calls with founding engineers, technical interviews with members of engineering, a team fit call with the People team, and a final interview with a founding engineer.

🎯 Requirements

• Strong distributed systems background with proven experience building and operating large-scale infrastructure such as data pipelines or web crawlers
• Proficiency in Python and experience optimizing performance and debugging complex systems under production conditions
• Hands-on experience with web crawling or large-scale data extraction, including understanding of HTTP protocols, distributed job queues, and data parsing at scale
• Familiarity with AWS and container orchestration tools (Kubernetes, Docker) for deploying and managing high-throughput workloads
• Awareness of data privacy, robots.txt adherence, and responsible crawl practices at internet scale

🏖️ Benefits

• Fully remote work & flexible hours
• 37 days/year of vacation & holidays
• 16 weeks of flexible, full-pay parental leave
• Health insurance allowance for you & dependents
• Company-provided equipment
• Well-being, always-be-learning & home office allowances
• Frequent team get togethers
• Diverse & inclusive people-first culture

Skills & Technologies

Python

AWS

Docker

Kubernetes

Design

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Poolside AI, Inc.

Visit Website

About Poolside AI, Inc.

Poolside AI develops and operates a cloud-based platform that turns natural-language prompts into functioning software. Using large-scale language models trained on public and proprietary code, the system autonomously writes, tests, and refines programs, enabling users to create applications, scripts, and data workflows without traditional coding. Founded in 2023 and headquartered in Paris with offices in New York, the company serves individual developers, startups, and enterprise teams seeking faster, more accessible software creation.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.