Member of Engineering (Pre-training / Data Research)

Poolside AI, Inc.

Job Overview

Location

Remote (EMEA/East Coast)

Job Type

Full-time

Full Job Description

📋 Description

• Work on the data team to enhance the quality of pretraining datasets used for training Poolside’s language and coding agent models, with a primary focus on synthetic data generation and data mix optimization.
• Design and implement complex, scalable data pipelines capable of generating vast volumes of high-quality, diverse natural language and source code data while optimizing resource usage across distributed systems.
• Collaborate directly with Pretraining, Posttraining, Evals, and Product teams to align dataset requirements with missing model capabilities and real-world downstream use cases, ensuring rapid feedback loops.
• Stay current with the latest research in LLMs, dataset design, and pretraining techniques, including transformer architectures, scaling laws, and training dynamics for reasoning and agentic models.
• Conduct and analyze data ablation studies and training experiments to derive quantitative insights that improve dataset quality and model performance.
• Leverage extensive GPU clusters and distributed data infrastructure to process and curate trillion-scale datasets, applying best practices in deduplication, tokenization, data mixing, curriculum design, and repetition impact mitigation.
• Maintain deep familiarity with leading open-source LLMs and datasets, using this knowledge to inform data selection, filtering, and augmentation strategies.
• Apply advanced prompt engineering skills to guide synthetic data generation and evaluate output quality across domains including general knowledge, reasoning, math, coding, and long-context understanding.
• Own original research initiatives through short, time-bounded experiments that directly feed into production data systems and model training cycles.
• Contribute to the development of a high-performance data infrastructure stack that supports continuous scaling of training data for next-generation AI models.
• Communicate research findings and technical decisions clearly to cross-functional teams, ensuring alignment on dataset goals and model performance thresholds.
• Actively participate in the company’s monthly in-person collaboration weeks in Paris (Monday–Wednesday), with optional extended stays, and contribute to annual off-site events.
• Maintain an obsession with data quality, rigorously evaluating datasets for bias, redundancy, relevance, and alignment with model training objectives.
• Engage with the broader AI research community by reading, discussing, and applying insights from the latest papers on applied deep learning, source code generation, and LLM training.
• Author or co-author scientific publications on relevant topics is encouraged but not required, and active discussion of fine-grained technical details in research papers is expected.
• Work in a fully remote, distributed environment spanning EMEA and East Coast time zones, with flexible hours and autonomy over workflow design.

🎯 Requirements

• Strong machine learning and engineering background
• Experience with Large Language Models (LLMs), including understanding of transformer architectures, scaling laws, and training reasoning/agentic models
• Experience building trillion-scale pretraining datasets, with knowledge of data curation, deduplication, mixing, tokenization, and curriculum design
• Excellent programming skills in Python
• Strong prompt engineering skills and experience working with large-scale GPU clusters and distributed data pipelines
• Experience with evals tracking model capabilities (general knowledge, reasoning, math, coding, long-context, etc.)

🏖️ Benefits

• Fully remote work & flexible hours
• 37 days/year of vacation & holidays
• Health insurance allowance for you & dependents
• Company-provided equipment
• Well-being, always-be-learning & home office allowances
• Frequent team get togethers including monthly in-person weeks in Paris and annual off-sites

Skills & Technologies

Python

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Poolside AI, Inc.

Visit Website

About Poolside AI, Inc.

Poolside AI develops and operates a cloud-based platform that turns natural-language prompts into functioning software. Using large-scale language models trained on public and proprietary code, the system autonomously writes, tests, and refines programs, enabling users to create applications, scripts, and data workflows without traditional coding. Founded in 2023 and headquartered in Paris with offices in New York, the company serves individual developers, startups, and enterprise teams seeking faster, more accessible software creation.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.