Onebrief Inc. logo

Senior Site Reliability Engineer (Colorado Springs, CO)

Job Overview

Location

Colorado Springs, CO

Job Type

Full-time

Category

Software Engineering

Date Posted

May 21, 2026

Full Job Description

đź“‹ Description

  • • Own the reliability, scalability, and security of Onebrief’s production applications and platforms across both on-premise DoD environments and AWS/AWS GovCloud cloud infrastructure.
  • • Design, implement, and manage a world-class observability platform using Prometheus, Loki, Alloy, and Grafana to create actionable insights and automated alerting that prevent user-impacting issues before they occur.
  • • Define, measure, and own Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to establish measurable reliability benchmarks, increasing internal and external trust in system performance.
  • • Lead incident response as primary responder or incident commander during critical production outages, directing real-time troubleshooting and conducting blameless post-mortems or After Action Reviews (AARs) to drive systemic, automated fixes.
  • • Partner with platform and application teams to design, build, and maintain secure, resilient Kubernetes clusters using Infrastructure-as-Code (Terraform, Ansible), embedding RMF, STIGs, and other DoD compliance controls directly into automation pipelines.
  • • Eliminate operational toil by identifying repetitive tasks and automating them through scripting and tooling, improving team efficiency and system stability in air-gapped and high-security environments.
  • • Collaborate with Security, Customer Success, and DevOps teams to ensure deployment processes are repeatable, auditable, and aligned with mission-critical operational standards.
  • • Serve as the subject matter expert on system reliability, translating failure modes and operational constraints into scalable, automated guardrails that reduce human error and increase system resilience.
  • • Develop and maintain comprehensive runbooks, monitoring dashboards, and alerting policies that enable fast recovery and proactive issue detection across distributed, hybrid cloud/on-prem systems.
  • • Contribute directly to improving the end-to-end experience of deploying and managing Onebrief in customer environments, particularly in DoD commands across Colorado Springs, CO, where onsite work is required.
  • • Mentor team members and foster a culture of blameless learning, continuous improvement, and shared ownership of system reliability across engineering and operational teams.
  • • Ensure all infrastructure and operational practices comply with DoD security frameworks including RMF, STIGs, and ICD 503, and maintain strict adherence to secure configuration standards in all environments.
  • • Support the readiness of customer and internal teams for production deployments by sharing best practices for managing applications in restricted, air-gapped, and classified network environments.
  • • Drive improvements in CI/CD pipeline reliability and security using GitLab CI/CD, Jenkins, or GitHub Actions to ensure rapid, safe, and auditable releases across hybrid environments.
  • • Maintain deep familiarity with core networking protocols and secure network configurations to troubleshoot connectivity, latency, and isolation issues in classified and non-classified DoD networks.
  • • Act as the primary point of contact for production incidents and operational escalations, ensuring timely communication and resolution while upholding SLA commitments to military customers.
  • • Regularly work on-site at customer locations in Colorado Springs, CO, with a requirement to be physically present at military command sites to support deployments and incident response.

Skills & Technologies

Python
AWS
Kubernetes
Terraform
Jenkins
Senior
Remote

Ready to Apply?

You will be redirected to an external site to apply.

AI Job Fit Analysis
Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Onebrief Inc. logo
Onebrief Inc.
Visit Website

About Onebrief Inc.

Onebrief develops AI-driven software that creates, updates, and synchronizes military campaign plans across classified and coalition networks. Its platform ingests doctrine, intelligence, and logistics data to generate living briefings, timelines, and risk assessments for joint and allied forces. Designed for secure environments, the system replaces static slide decks with interactive, version-controlled plans that adapt to real-time information, enabling faster decision cycles and unified command intent during multi-domain operations.

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Newsletter

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.

Similar Opportunities

Expired
Ireland - Remote
Full-time
Expired May 24, 2026
Python
GCP
Terraform
+2 more

3 months ago

SF Bay Area / Remote
Full-time
Expires Jul 5, 2026
Python
JavaScript
TypeScript
+6 more

2 months ago

Expired
Remote (New Zealand)
Full-time
Expired Jun 17, 2026
Java
AWS
Azure
+1 more

2 months ago

Expired
TIH Insurance Services, LLC logo

TIH Insurance Services, LLC

Remote - North Carolina
Full-time
Expired May 26, 2026
Remote

3 months ago