
Job Overview
Location
Remote
Job Type
Full-time
Category
Software Engineering
Date Posted
June 26, 2026
Full Job Description
đź“‹ Description
- • Set the reliability strategy for Wisdom’s platform, defining SLOs, error budgets, and operating standards for systems that handle real dental billing transactions with zero tolerance for downtime.
- • Own end-to-end observability using Datadog — implementing tracing, metrics, logging, and alerting that proactively surface issues before users are impacted, ensuring any engineer can lead incidents without relying on original code authors.
- • Define operational patterns for LLM-driven agentic workflows (using Anthropic, Mastra) including retries, backpressure, idempotency, graceful degradation, and capacity controls to prevent batch blowups, stream drops, runaway costs, or model misbehavior in production.
- • Harden integrations with dental insurance carriers and practice management systems (Dentrix, Eaglesoft) that are poorly documented, inconsistent, and prone to failure under load.
- • Own the deploy and release engineering pipeline: implement fast, safe, reversible deploys using infrastructure as code (Terraform), ensuring the team can ship multiple times daily without compromising stability.
- • Build and institutionalize the incident response practice — establish on-call rotations, detailed runbooks, blameless post-incident culture, and follow-up discipline to turn outages into permanent, team-owned fixes.
- • Raise the reliability bar across the engineering team through code reviews, architectural guidance, and actionable documentation that is consistently referenced and adopted.
- • Drive resolution of ambiguous, company-level reliability problems without waiting for formal briefs or permission, taking ownership of undefined challenges that impact production stability.
- • Operate and debug distributed services on AWS with first-principles reasoning, ensuring systems remain resilient under pressure and scale with business growth.
- • Implement and maintain infrastructure as code (Terraform), container orchestration (ECS/Kubernetes), and CI/CD pipelines to make deployment processes predictable, repeatable, and low-risk.
- • Apply deep production experience with at least one major LLM API (OpenAI, Anthropic, or Google Vertex AI), managing operational realities such as rate limits, latency, cost control, and failure modes in live systems.
- • Write and debug application-level code in TypeScript/JavaScript, not just infrastructure scripts, ensuring full-stack understanding of systems that power billing workflows.
- • Manage relational database performance (Postgres) with expertise in connection pooling, query optimization, and data integrity under high load.
- • Lead by example: default to ownership, respond to pagers proactively, surface bad news early, change position based on evidence, and write postmortems that improve team-wide practices.
- • Mentor engineers and establish technical standards that outlive individual contributions, enabling the entire team to operate at a higher reliability level without constant oversight.
- • Ensure all infrastructure and processes comply with HIPAA requirements for handling protected health information in a healthcare technology environment.
🎯 Requirements
- • 8+ years running production systems with staff/principal-level ownership of reliability in high-stakes environments
- • Deep AWS experience deploying, operating, and debugging distributed services in production
- • Hands-on production experience operating LLM APIs (Anthropic, OpenAI, or Google Vertex AI) with focus on rate limits, cost, latency, and failure modes
- • Strong command of TypeScript/JavaScript and relational databases (Postgres)
- • Proven expertise in infrastructure as code (Terraform), containers (ECS/Kubernetes), and CI/CD pipelines
- • Experience building incident response practices, on-call rotations, and blameless postmortem cultures
🏖️ Benefits
- • Fully remote role with no geographic restrictions within the US
- • Reporting directly to the Head of Engineering
- • Opportunity to build reliability practices from scratch at a Series A startup
- • Work with cutting-edge LLM-driven agentic systems in a regulated healthcare environment
- • Join a high-trust, small engineering team shaping the future of dental billing technology
- • Equal opportunity employer with inclusive policies covering all protected statuses
Skills & Technologies
See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.
About Wisdom Health Inc.
Wisdom Health offers at-home DNA testing for dogs and cats, enabling pet owners and veterinarians to identify breed ancestry, genetic health risks, and traits. The company processes samples via cheek swabs and delivers online reports with actionable insights for personalized care. Its products include Wisdom Panel and Optimal Selection, supported by a CLIA-certified laboratory, an extensive breed database, and ongoing research collaborations with academic and veterinary institutions.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities

DoiT International
3 months ago

Ddome Inc.
3 months ago

Stedi, Inc.
4 months ago

DoiT International
3 months ago