Kraken logo

Site Reliability Engineer - AI Agents

Job Overview

Location

United States

Job Type

Full-time

Category

Software Engineering

Date Posted

June 13, 2026

Full Job Description

đź“‹ Description

  • • Design, build, and operate the infrastructure layer supporting AI agent workflows in production across internal tools and external-facing products
  • • Ensure reliability, scalability, and observability of agentic systems, including model inference and agent execution pipelines
  • • Design and develop platform services, APIs, and SDKs that enable engineering teams to consume AI infrastructure as a self-service platform
  • • Manage and maintain compute, orchestration, and serving infrastructure powering AI agents using Kubernetes and Docker
  • • Implement Infrastructure as Code (IaC) using Terraform to provision and manage AWS cloud infrastructure components
  • • Build and maintain CI/CD pipelines tailored for rapid, reliable deployment of AI services and agent workflows
  • • Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems
  • • Establish robust monitoring, alerting, and incident response procedures optimized for ML and AI workloads
  • • Collaborate with AI and Data Engineering teams to transition experimental agent prototypes into hardened, production-ready systems
  • • Implement access controls and security best practices across AI infrastructure environments to protect sensitive model and data assets
  • • Document architecture, runbooks, and operational best practices to enable knowledge sharing and reduce tribal knowledge across teams
  • • Participate in on-call rotations to respond to incidents affecting AI agent infrastructure with a focus on rapid resolution and post-mortem analysis
  • • Partner with Data Engineering, ML, and product teams to align platform capabilities with evolving product and research needs
  • • Prioritize developer experience in platform design, ensuring internal tools and APIs are intuitive, well-documented, and adopted at scale
  • • Operate in a fast-moving environment where platform engineering decisions directly impact the reliability of AI products used by millions of users

🎯 Requirements

  • • 5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment
  • • Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
  • • Proficiency with Infrastructure as Code tools, particularly Terraform
  • • Experience with containerization and orchestration, particularly Kubernetes and Docker
  • • Solid understanding of cloud infrastructure, preferably AWS
  • • Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)

🏖️ Benefits

  • • Opportunity to work on cutting-edge AI agent infrastructure at a leading crypto platform trusted by over 10 million users
  • • Collaborative environment working across AI, Data Engineering, and product teams to shape the future of open finance
  • • Culture that values diverse perspectives and encourages applications even if all requirements are not fully met
  • • Equal opportunity employer with no tolerance for discrimination or harassment based on protected characteristics
  • • Consideration of qualified applicants with criminal histories consistent with the San Francisco Fair Chance Ordinance
  • • Ability to redact personal information such as age, date of birth, or graduation dates from resumes during application

Skills & Technologies

Python
AWS
Docker
Kubernetes
Terraform
Onsite

Ready to Apply?

You will be redirected to an external site to apply.

AI Job Fit Analysis
Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

About Kraken

Kraken is a global cryptocurrency exchange established in 2011, offering spot and futures trading for Bitcoin, Ethereum and 200+ digital assets. Headquartered in San Francisco with entities worldwide, it serves retail and institutional clients, providing custody, staking, an NFT marketplace and OTC desk. The platform emphasizes security, regulatory compliance and educational resources.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

Expired
Red Gate Software Limited logo

Red Gate Software Limited

US Remote
Full-time
Expired May 24, 2026
Remote
$150k-200k

3 months ago

Expired
Remote
Full-time
Expired May 24, 2026
REST
Senior
Remote

3 months ago

Expired
Montu UK Limited logo

Montu UK Limited

Winnersh
Full-time
Expired May 24, 2026
Senior
Onsite

3 months ago

Expired
Red Gate Software Limited logo

Red Gate Software Limited

Cambridge
Full-time
Expired May 24, 2026
Remote
ÂŁ100k-125k

3 months ago