This job has expired

This position was posted on October 3, 2025 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Data Engineer

Jasper AI, Inc.

Job Overview

Location

Paris

Job Type

Full-time

Full Job Description

📋 Description

• Architect and own the end-to-end lifecycle of petabyte-scale data pipelines that ingest, transform, and load multi-modal datasets (images, text, video) into our cloud data warehouse and ML training clusters. You will design fault-tolerant, idempotent workflows that run on Kubernetes and leverage distributed frameworks such as Ray, Spark, or Dask to process millions of assets per day with sub-second latency guarantees.
• Partner daily with research scientists to translate experimental requirements into production-grade data contracts. You will profile raw corpora, identify coverage gaps, and implement automated quality gates that detect label noise, duplication, and demographic bias before data ever reaches a GPU. Your work directly determines the fidelity and fairness of the next generation of Jasper’s generative models.
• Build reusable, versioned datasets optimized for vision-language pre-training. This includes writing deterministic extract-transform-load (ETL) jobs that apply classical computer-vision filters (edge detection, color-space normalization, object detection) and modern foundation-model–based captioning and tagging. You will maintain a feature store that enables researchers to slice data by domain, resolution, or metadata in seconds rather than hours.
• Continuously optimize I/O throughput and memory footprint for distributed training. You will benchmark serialization formats (Parquet, WebDataset, MDS), tune prefetching and caching layers, and implement dynamic batching strategies that keep A100 clusters at 95 %+ utilization. Your profiling dashboards will surface pipeline bottlenecks and guide investment in faster storage tiers or smarter sharding schemes.
• Establish rigorous data governance and reproducibility standards. Every transformation will be codified in declarative DAGs (Airflow, Prefect, or Dagster), tracked with Git-based version control, and documented in an internal data catalog. You will champion unit tests for data schemas, enforce SLAs for freshness, and publish lineage graphs so any experiment can be rerun months later with identical inputs.
• Proactively source and license new multi-modal corpora from public repositories, academic datasets, and strategic partners. You will negotiate data-sharing agreements, ensure GDPR/CCPA compliance, and build ingestion connectors that normalize metadata, de-duplicate near-identical assets, and flag restricted content. Your pipeline will automatically tag assets with provenance and usage rights to keep legal risk near zero.
• Foster a culture of observability and continuous improvement. You will set up real-time alerts on data drift, schema evolution, and pipeline failures; run weekly blameless post-mortems; and iterate on SLIs that balance cost, latency, and accuracy. By instrumenting everything from GPU wait-times to token-level label entropy, you will give stakeholders transparent insight into the health of our data platform.
• Mentor junior engineers and data scientists on best practices for scalable data engineering. You will lead internal workshops on PyTorch DataLoader internals, Delta Lake optimization, and cost-aware cloud resource scheduling. Your code reviews will raise the bar for readability, test coverage, and performance, ensuring that every PR moves the platform closer to exabyte readiness.

Skills & Technologies

Remote

Degree Required

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

Jasper AI, Inc.

Visit Website

About Jasper AI, Inc.

Jasper AI, Inc. provides a generative artificial intelligence platform that helps marketing and content teams create, edit, and optimize written and visual assets at scale. Founded in 2021, the company offers browser extensions, API integrations, and team collaboration tools that use large language models to generate blog posts, emails, ad copy, and social media content while maintaining brand voice consistency. Customers include Fortune 500 enterprises, agencies, and freelance creators seeking to accelerate production workflows and improve conversion performance across channels.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.