This job has expired

This position was posted on March 10, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Senior Site Reliability Engineer

Havoc AI Inc

Job Overview

Location

Remote

Job Type

Full-time

Full Job Description

📋 Description

• Havoc AI Inc. is at the forefront of collaborative autonomy, pioneering solutions for complex human problems through self-tasking machine teams. We are recognized leaders in autonomous surface vessels, serving critical defense and commercial maritime missions. Our rapid growth is fueled by a passion for tackling challenging problems, pushing technological boundaries, and ultimately, preventing conflict and saving lives. We are actively seeking ambitious individuals who are eager to contribute to our mission.
• As a Senior Site Reliability Engineer (SRE) at Havoc AI, you will play a pivotal role within our Cloud Platform team. This position demands a seasoned professional with over 7 years of experience in designing, operating, and scaling highly reliable distributed systems. Your primary responsibility will be to ensure the unwavering availability, optimal performance, and robust resilience of our mission-critical services. These services are the backbone of our autonomy, simulation, and data-intensive workloads, making your contribution indispensable.
• You will collaborate closely with cross-functional teams, including Cloud Platform, DevOps, Data Engineering, and Autonomy. Your expertise will be crucial in establishing stringent reliability standards, elevating our operational maturity, and architecting systems capable of scaling safely and effectively under demanding real-world conditions. The ideal candidate possesses a deep technical acumen, maintains composure under pressure, and has a proven track record of owning reliability outcomes from inception to completion.
• **Reliability Engineering & Architecture:**
• Design, implement, and continuously evolve the reliability architecture for our complex distributed and cloud-hosted systems.
• Define, champion, and embed SRE best practices across the organization, including the meticulous definition and tracking of Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and comprehensive capacity planning.
• Act as a trusted partner to platform and application teams, guiding them in designing systems with inherent reliability, scalability, and operability at their core.
• Proactively identify and systematically mitigate systemic reliability risks that span across our infrastructure and diverse service landscape.
• **Operations & Incident Management:**
• Take a leading role in our incident response processes, including managing on-call rotations, orchestrating escalations, and conducting thorough post-incident reviews to extract actionable insights.
• Perform in-depth root cause analysis for complex production incidents, driving the implementation of long-term, sustainable improvements to prevent recurrence.
• Enhance operational readiness through the development and maintenance of comprehensive runbooks, robust automation, and rigorous resilience testing.
• Systematically reduce operational toil by identifying opportunities for tooling, automation, and process optimization.
• **Observability & Performance:**
• Design, implement, and maintain sophisticated observability systems encompassing metrics, logging, tracing, and alerting to provide deep insights into system behavior.
• Ensure that all services and data pipelines are not only observable and debuggable but also performant in production environments.
• Lead performance analysis initiatives and drive tuning efforts across all layers of our infrastructure and service stack.
• **Automation & Platform Collaboration:**
• Develop and deploy automation solutions to enhance system reliability, ensure deployment safety, and streamline recovery processes.
• Collaborate closely with DevOps and Cloud Platform teams to bolster CI/CD reliability, refine rollout strategies, and implement safe deployment patterns.
• Provide expert support and drive improvements for our Kubernetes-based environments and containerized workloads.
• **Security & Resilience:**
• Work hand-in-hand with security teams to embed secure and resilient design principles into all aspects of our systems.
• Actively participate in the planning and execution of disaster recovery strategies and testing.
• Uphold and advance strong operational practices related to access control, secrets management, and change management to ensure system integrity and security.

Skills & Technologies

Python

AWS

Kubernetes

Linux

Senior

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Havoc AI Inc

Visit Website

About Havoc AI Inc

Havoc AI Inc provides an autonomous drone-swarm platform for defense and security applications. The system integrates computer vision, real-time coordination, and modular payload capabilities to enable surveillance, reconnaissance, and kinetic effects at scale. Designed for contested environments, the software stack supports rapid deployment, adaptive mission planning, and human-on-the-loop oversight. Headquartered in Seattle, the company serves U.S. and allied government customers, focusing on asymmetric advantages through low-cost, high-volume unmanned systems.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.