📋 Description
• Build and maintain async scrapers using Python and Playwright to extract tender data from European public procurement portals, starting with Italian platforms like Maggioli PortaleAppalti, ANAC, and MePA, with expansion across Europe.
• Handle complex anti-bot protections including FriendlyCaptcha, Mosparo, Cloudflare WAF, and session management (JSESSIONID), implementing IP rotation, rate limit backoff, and retry logic to ensure resilient data collection.
• Parse diverse Italian data formats such as monetary values (€ 1.234.567,89), dates (DD/MM/YYYY, textual), and identifiers (CIG/CUP), including detection of placeholders and validation logic.
• Extract and process documents in multiple formats — PDF, .p7m (PKCS#7 signed), ZIP/7Z — applying OCR fallback when text extraction fails.
• Integrate scrapers into a Prefect orchestration pipeline with monitoring, alerting, and anomaly detection to ensure data quality and pipeline reliability.
• Store data using dual-sink architecture with PostgreSQL, Supabase, Clickhouse, and AWS S3, implementing upsert and idempotency patterns for consistency.
• Collaborate in a mission-driven environment focused on creating the data backbone for European public procurement, enabling transparency and access to tender opportunities across 100+ e-procurement systems.
• Continuously adapt scraper strategies to handle varying HTML layouts, SPAs, and dynamic content across portals that serve different structures across pages or regions.
🎯 Requirements
• Strong proficiency in async Python (asyncio), with ability to write non-blocking, efficient scraper logic without reliance on time.sleep().
• Hands-on experience with Playwright or Selenium, including interception of XHR requests, handling of SPAs, and debugging timing and rendering issues.
• Expertise in handling real-world anti-bot measures such as CAPTCHAs (FriendlyCaptcha, Mosparo), session cookies, IP rotation, and rate limiting with exponential backoff.
• Skill in parsing messy, inconsistent HTML using multi-strategy approaches to extract data from