
Job Overview
Location
US Remote
Job Type
Full-time
Category
Data Science
Date Posted
June 13, 2026
Full Job Description
đź“‹ Description
- • Design, build, and operate the production infrastructure powering AI/ML systems at Hims & Hers, including Kubernetes clusters (EKS), autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production environments.
- • Own and evolve GitOps-based deployment pipelines using Helm and Kustomize overlays to enable safe, repeatable shipping of AI services across environments.
- • Design ephemeral and preview environments, feature-branched deployments, and nightly release pipelines to validate AI changes in production-like conditions before deployment.
- • Build and scale inference infrastructure for LLM-powered workflows, including multi-provider AI gateways (e.g., Bedrock, Vertex AI), managing credentials, rate limits, failover, routing, grounding, tool execution, and context assembly at the platform level.
- • Create reusable infrastructure abstractions and contracts to standardize how AI services are deployed, configured, and consumed across engineering teams.
- • Own the LLM observability and tracing stack, including provisioning and scaling Langfuse, Datadog (dd-trace), OpenTelemetry (OTLP), and underlying datastores like ClickHouse to ensure AI behavior is auditable and debuggable in production.
- • Build analytics and monitoring pipelines that surface latency, error, quality, and regression signals to engineering and clinical stakeholders.
- • Define SLOs, alerting rules, on-call runbooks, and incident response protocols for AI infrastructure; lead troubleshooting and continuously improve platform reliability.
- • Own and improve the monorepo build system and CI/CD pipelines for AI workloads, including Docker image builds, automated PR checks, convention enforcement, and cross-platform test execution.
- • Develop and maintain shared infrastructure tooling, CLIs, and IaC modules (Terraform, Scalr) used daily by AI and product engineers.
- • Identify and eliminate platform bottlenecks to reduce CI/CD cycle times, build latency, and deployment friction, improving developer velocity across the Applied AI organization.
- • Implement IAM, OIDC, and secrets management as first-class infrastructure components with least-privilege roles, write-only secret rotation, and cross-account access audits.
- • Encode security-by-default, scope boundaries, and access controls into the platform to ensure HIPAA compliance and privacy-first AI systems.
- • Partner with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog) to enforce compliant, auditable data access and governance.
- • Drive multi-quarter infrastructure initiatives including cluster architecture, deployment strategy, GPU compute planning, and observability evolution.
- • Write and lead technical design documents (TDDs/RFCs), define infrastructure standards and development workflow conventions, and contribute to technical governance across AI engineering.
- • Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, bridging the gap between prototypes and production-grade AI systems.
🎯 Requirements
- • 8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering — with at least 3 years focused on ML/AI systems in production.
- • Deep, hands-on experience with Kubernetes (ideally EKS) and the cloud-native ecosystem — autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestration.
- • Strong infrastructure-as-code skills (Terraform) and experience designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege access.
- • Strong proficiency in Python, with experience building production infrastructure tooling, CLIs, and data/observability pipelines.
- • 2+ years of experience operating LLM-based systems in production (LLMOps) — inference routing, serving, tracing, and the reliability patterns needed to run them at scale.
- • Hands-on experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or equivalent) and metrics/log/trace pipelines.
🏖️ Benefits
- • Competitive salary & equity compensation for full-time roles
- • Unlimited PTO, company holidays, and quarterly mental health days
- • Comprehensive health benefits including medical, dental & vision, and parental leave
- • Employee Stock Purchase Program (ESPP)
- • 401k benefits with employer matching contribution
- • Offsite team retreats
Skills & Technologies
See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.
About Hims & Hers Health, Inc.
Hims & Hers Health is a telehealth platform providing online consultations, prescription medications and over-the-counter wellness products for conditions such as hair loss, erectile dysfunction, anxiety, depression, skin care, and sexual health. Operating in the United States and select international markets, the company connects patients with licensed physicians and pharmacies, delivering treatments through subscription plans and direct-to-consumer shipping, while emphasizing privacy, affordability and accessibility.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities

Altarum Institute
6 months ago

Infinity Constellation Technologies Inc.
4 months ago

Attentive Mobile Inc.
8 months ago

Oliver Technologies, Inc.
3 months ago