Infrastructure Engineer

HappyRobot AI Inc.

Job Overview

Location

Spain

Job Type

Full-time

Full Job Description

📋 Description

• Lead the scaling of operational resilience for a high-stakes AI infrastructure platform that autonomously runs enterprise operations across voice, email, and internal systems.
• Own the stability, observability, and debugging workflows that ensure 24/7 system reliability in production environments with real-world consequences.
• Serve as the primary responder for critical production incidents, diagnosing root causes under pressure and coordinating rapid resolutions with engineering teams.
• Design and implement internal tooling to transform chaotic incident response into structured, repeatable reliability workflows.
• Improve system observability by enhancing log pipelines, custom metrics, and distributed tracing to enable proactive detection of system anomalies.
• Collaborate with backend and AI teams to optimize Kubernetes clusters, service deployments, and infrastructure configurations for performance and fault tolerance.
• Reduce incident load by identifying systemic failure patterns and implementing automation or architectural improvements to prevent recurrence.
• Build and maintain CI/CD pipelines and infrastructure-as-code configurations to ensure consistent, auditable, and scalable deployments.
• Document operational procedures, runbooks, and post-mortems with precision to enable knowledge sharing and team-wide learning.
• Champion Extreme Ownership by stepping in to fix issues outside traditional boundaries, regardless of team or system ownership.
• Apply Craftsmanship to every task — from log analysis to deployment scripts — ensuring high-quality, intentional, and maintainable solutions.
• Operate with Urgency with Focus, prioritizing high-impact reliability work that directly improves developer productivity and system uptime.
• Work in a meritocratic environment where technical contribution, not title or tenure, determines ownership and impact.
• Embody the company’s cultural principle of being “majo”: approachable, kind, and collaborative while maintaining high standards.
• Work from Spain with global visibility, supporting a platform used by enterprises worldwide that rely on AI systems for critical operations.
• Contribute to a foundation built from scratch — including custom voice stacks, orchestration layers, and distributed AI workflows — with no legacy constraints.
• Drive cultural shift from reactive firefighting to proactive reliability engineering, influencing how the entire engineering organization approaches system health.
• Participate in on-call rotations and be the go-to expert for deep-dive investigations into complex, multi-system failures.
• Translate business-critical uptime requirements into technical SLOs, SLIs, and monitoring thresholds that reflect real user impact.
• Work closely with security, data, and product teams to ensure infrastructure meets compliance, scale, and performance demands.

🎯 Requirements

• 3+ years of hands-on experience debugging production systems (logs, traces, incidents, etc.)
• Strong Go and Kubernetes experience
• Familiarity with observability and monitoring tools (e.g., Datadog, Prometheus, Sentry)
• Clear, calm communication under pressure — especially during live incidents
• Strong problem-solving skills and ability to dive into unfamiliar backend codebases
• Experience working with distributed systems or services at scale

🏖️ Benefits

• Opportunity to work at a high-growth AI startup, backed by top investors
• Fast Growth — backed by a16z and YC, on track for double-digit ARR
• Top-Tier Compensation — competitive salary + equity in a high-growth startup
• Ownership & Autonomy — take full ownership of projects and ship fast
• Work With the Best — join a world-class team of engineers and builders

Skills & Technologies

Kubernetes

Prometheus

Datadog

DevOps

Onsite

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

HappyRobot AI Inc.

Visit Website

About HappyRobot AI Inc.

HappyRobot AI builds voice AI agents that automate repetitive phone calls for logistics, freight, and supply-chain companies. Its cloud platform lets shippers, brokers, and carriers offload tasks like appointment scheduling, check calls, and rate negotiation to conversational bots that integrate with TMS, ELD, and CRM systems via API. The company targets mid-market and enterprise freight operations seeking to cut labor costs and accelerate data entry without sacrificing accuracy or carrier relationships.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.