Havoc AI Inc logo

Senior Site Reliability Engineer

Job Overview

Location

Nice, Indiana, USA

Job Type

Full-time

Category

Software Engineering

Date Posted

March 10, 2026

Full Job Description

đź“‹ Description

  • • Havoc AI Inc. is at the forefront of collaborative autonomy, pioneering solutions for complex human problems through self-tasking machine teams. We are recognized leaders in autonomous surface vessels, serving critical defense and commercial maritime missions. Our rapid growth is fueled by a passion for tackling challenging problems, pushing technological boundaries, and ultimately, preventing conflict and saving lives. We are actively seeking ambitious individuals who are eager to contribute to our mission.
  • • As a Senior Site Reliability Engineer (SRE) at Havoc AI, you will play a pivotal role within our Cloud Platform team. This position demands a seasoned professional with over 7 years of experience in designing, operating, and scaling highly reliable distributed systems. Your primary responsibility will be to ensure the unwavering availability, optimal performance, and robust resilience of our mission-critical services. These services are the backbone of our autonomy, simulation, and data-intensive workloads, making your contribution indispensable.
  • • You will collaborate closely with cross-functional teams, including Cloud Platform, DevOps, Data Engineering, and Autonomy. Your expertise will be crucial in establishing stringent reliability standards, elevating our operational maturity, and architecting systems capable of scaling safely and effectively under demanding real-world conditions. The ideal candidate possesses a deep technical acumen, maintains composure under pressure, and has a proven track record of owning reliability outcomes from inception to completion.
  • • **Reliability Engineering & Architecture:**
  • • Design, implement, and continuously evolve the reliability architecture for our complex distributed and cloud-hosted systems.
  • • Define, champion, and embed SRE best practices across the organization, including the meticulous definition and tracking of Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and comprehensive capacity planning.
  • • Act as a trusted partner to platform and application teams, guiding them in designing systems with inherent reliability, scalability, and operability at their core.
  • • Proactively identify and systematically mitigate systemic reliability risks that span across our infrastructure and diverse service landscape.
  • • **Operations & Incident Management:**
  • • Take a leading role in our incident response processes, including managing on-call rotations, orchestrating escalations, and conducting thorough post-incident reviews to extract actionable insights.
  • • Perform in-depth root cause analysis for complex production incidents, driving the implementation of long-term, sustainable improvements to prevent recurrence.
  • • Enhance operational readiness through the development and maintenance of comprehensive runbooks, robust automation, and rigorous resilience testing.
  • • Systematically reduce operational toil by identifying opportunities for tooling, automation, and process optimization.
  • • **Observability & Performance:**
  • • Design, implement, and maintain sophisticated observability systems encompassing metrics, logging, tracing, and alerting to provide deep insights into system behavior.
  • • Ensure that all services and data pipelines are not only observable and debuggable but also performant in production environments.
  • • Lead performance analysis initiatives and drive tuning efforts across all layers of our infrastructure and service stack.
  • • **Automation & Platform Collaboration:**
  • • Develop and deploy automation solutions to enhance system reliability, ensure deployment safety, and streamline recovery processes.
  • • Collaborate closely with DevOps and Cloud Platform teams to bolster CI/CD reliability, refine rollout strategies, and implement safe deployment patterns.
  • • Provide expert support and drive improvements for our Kubernetes-based environments and containerized workloads.
  • • **Security & Resilience:**
  • • Work hand-in-hand with security teams to embed secure and resilient design principles into all aspects of our systems.
  • • Actively participate in the planning and execution of disaster recovery strategies and testing.
  • • Uphold and advance strong operational practices related to access control, secrets management, and change management to ensure system integrity and security.

Skills & Technologies

Python
AWS
Kubernetes
Linux
Senior
Remote

Ready to Apply?

You will be redirected to an external site to apply.

Havoc AI Inc logo
Havoc AI Inc
Visit Website

About Havoc AI Inc

Havoc AI Inc provides an autonomous drone-swarm platform for defense and security applications. The system integrates computer vision, real-time coordination, and modular payload capabilities to enable surveillance, reconnaissance, and kinetic effects at scale. Designed for contested environments, the software stack supports rapid deployment, adaptive mission planning, and human-on-the-loop oversight. Headquartered in Seattle, the company serves U.S. and allied government customers, focusing on asymmetric advantages through low-cost, high-volume unmanned systems.

Similar Opportunities

Indiana, USA
Full-time
Expires Apr 13, 2026
Python
JavaScript
AWS
+3 more

1 month ago

Apply
SHI International Corp. logo

SHI International Corp.

Indiana, USA
Full-time
Expires Apr 29, 2026
AWS
Azure
Remote
+2 more

22 days ago

Apply
Indiana, USA
Full-time
Expires Apr 13, 2026
Remote

1 month ago

Apply
❌ EXPIRED
Aquia Inc. logo

Aquia Inc.

Indiana, USA
Full-time
Expired Nov 24, 2025
Python
JavaScript
GitHub
+3 more

6 months ago

Apply