
Job Overview
Location
Colorado Springs, CO
Job Type
Full-time
Category
Software Engineering
Date Posted
May 21, 2026
Full Job Description
đź“‹ Description
- • Own the reliability, scalability, and security of Onebrief’s production applications and platforms across both on-premise DoD environments and AWS/AWS GovCloud cloud infrastructure.
- • Design, implement, and manage a world-class observability platform using Prometheus, Loki, Alloy, and Grafana to create actionable insights and automated alerting that prevent user-impacting issues before they occur.
- • Define, measure, and own Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to establish measurable reliability benchmarks, increasing internal and external trust in system performance.
- • Lead incident response as primary responder or incident commander during critical production outages, directing real-time troubleshooting and conducting blameless post-mortems or After Action Reviews (AARs) to drive systemic, automated fixes.
- • Partner with platform and application teams to design, build, and maintain secure, resilient Kubernetes clusters using Infrastructure-as-Code (Terraform, Ansible), embedding RMF, STIGs, and other DoD compliance controls directly into automation pipelines.
- • Eliminate operational toil by identifying repetitive tasks and automating them through scripting and tooling, improving team efficiency and system stability in air-gapped and high-security environments.
- • Collaborate with Security, Customer Success, and DevOps teams to ensure deployment processes are repeatable, auditable, and aligned with mission-critical operational standards.
- • Serve as the subject matter expert on system reliability, translating failure modes and operational constraints into scalable, automated guardrails that reduce human error and increase system resilience.
- • Develop and maintain comprehensive runbooks, monitoring dashboards, and alerting policies that enable fast recovery and proactive issue detection across distributed, hybrid cloud/on-prem systems.
- • Contribute directly to improving the end-to-end experience of deploying and managing Onebrief in customer environments, particularly in DoD commands across Colorado Springs, CO, where onsite work is required.
- • Mentor team members and foster a culture of blameless learning, continuous improvement, and shared ownership of system reliability across engineering and operational teams.
- • Ensure all infrastructure and operational practices comply with DoD security frameworks including RMF, STIGs, and ICD 503, and maintain strict adherence to secure configuration standards in all environments.
- • Support the readiness of customer and internal teams for production deployments by sharing best practices for managing applications in restricted, air-gapped, and classified network environments.
- • Drive improvements in CI/CD pipeline reliability and security using GitLab CI/CD, Jenkins, or GitHub Actions to ensure rapid, safe, and auditable releases across hybrid environments.
- • Maintain deep familiarity with core networking protocols and secure network configurations to troubleshoot connectivity, latency, and isolation issues in classified and non-classified DoD networks.
- • Act as the primary point of contact for production incidents and operational escalations, ensuring timely communication and resolution while upholding SLA commitments to military customers.
- • Regularly work on-site at customer locations in Colorado Springs, CO, with a requirement to be physically present at military command sites to support deployments and incident response.
Skills & Technologies
See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.
About Onebrief Inc.
Onebrief develops AI-driven software that creates, updates, and synchronizes military campaign plans across classified and coalition networks. Its platform ingests doctrine, intelligence, and logistics data to generate living briefings, timelines, and risk assessments for joint and allied forces. Designed for secure environments, the system replaces static slide decks with interactive, version-controlled plans that adapt to real-time information, enabling faster decision cycles and unified command intent during multi-domain operations.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities
3 months ago

Cala Health, Inc.
2 months ago

EverCommerce Inc.
2 months ago

TIH Insurance Services, LLC
3 months ago
