Kraken logo

Site Reliability Engineer - AI Agents

Job Overview

Location

United Kingdom

Job Type

Full-time

Category

Software Engineering

Date Posted

June 13, 2026

Full Job Description

đź“‹ Description

  • • Design, build, and operate the infrastructure layer supporting AI agent workflows in production for both internal tools and external-facing products
  • • Ensure reliability, scalability, and observability of agentic systems across Kraken’s crypto trading and financial infrastructure
  • • Design and develop platform services, APIs, and SDKs that enable engineering, AI, and data teams to consume AI infrastructure as a self-service platform
  • • Manage and maintain compute, orchestration, and model-serving infrastructure powering LLM-based agent execution and inference
  • • Implement robust monitoring, alerting, and incident response procedures specifically tailored to AI/ML workloads and agent-based systems
  • • Utilize Infrastructure as Code (IaC) tools, primarily Terraform, to provision and manage AWS cloud infrastructure components
  • • Build and maintain CI/CD pipelines for rapid, reliable deployment of AI services and agent workflows
  • • Define and implement guardrails, failure handling, and recovery patterns for agentic and LLM-powered systems
  • • Collaborate with AI and Data Engineering teams to transition experimental agent prototypes into hardened, production-grade systems
  • • Manage containerized workloads using Kubernetes to ensure efficient deployment, scaling, and orchestration of AI services
  • • Implement access controls and security best practices across all AI infrastructure environments
  • • Document architecture, runbooks, and operational best practices to support knowledge sharing and team scalability
  • • Operate as a platform engineering team focused on developer experience, platform adoption, and long-term scalability of AI infrastructure
  • • Work closely with Data Engineering, ML, and product-facing teams to harden agent infrastructure to meet institutional-grade reliability standards
  • • Participate in on-call rotations to respond to production incidents affecting AI agent systems

🎯 Requirements

  • • 5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment
  • • Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
  • • Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale
  • • Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design
  • • Proficiency with Infrastructure as Code tools, particularly Terraform
  • • Experience with containerization and orchestration, particularly Kubernetes and Docker

🏖️ Benefits

  • • Opportunity to work at the intersection of data infrastructure and applied AI in a fast-moving, high-stakes production environment
  • • Collaborative culture with cross-functional teams across Data Engineering, ML, and product engineering
  • • Exposure to cutting-edge AI agent systems and LLM-powered infrastructure at scale
  • • Employment with a globally trusted crypto platform serving over 10 million users
  • • Consideration of qualified applicants with criminal histories consistent with the San Francisco Fair Chance Ordinance
  • • Equal opportunity employer that values diversity in background, perspective, and experience

Skills & Technologies

Python
AWS
Docker
Kubernetes
Terraform
Onsite

Ready to Apply?

You will be redirected to an external site to apply.

AI Job Fit Analysis
Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

About Kraken

Kraken is a global cryptocurrency exchange established in 2011, offering spot and futures trading for Bitcoin, Ethereum and 200+ digital assets. Headquartered in San Francisco with entities worldwide, it serves retail and institutional clients, providing custody, staking, an NFT marketplace and OTC desk. The platform emphasizes security, regulatory compliance and educational resources.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

Norwalk Office
Full-time
Expires Aug 3, 2026
Go
Design
Onsite
+2 more

11 days ago

CSG Systems International, Inc. logo

CSG Systems International, Inc.

India Remote
Full-time
Expires Jul 9, 2026
Python
Linux
Remote

1 month ago

Remote, London
Full-time
Expires Aug 3, 2026
Remote

11 days ago

Expired
Afresh Technologies, Inc. logo

Afresh Technologies, Inc.

Remote
Full-time
Expired Dec 28, 2025
Backend
Senior
Remote

8 months ago