This job has expired

This position was posted on March 27, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Site Reliability Engineer II

Veritone, Inc.

Job Overview

Location

United States of America - Remote

Job Type

Full-time

Full Job Description

📋 Description

• As a Site Reliability Engineer II at Veritone, Inc., you will play a critical role in ensuring the reliability, scalability, and performance of AI-driven SaaS platforms that power innovative machine learning workloads across cloud environments. Your expertise will directly impact system uptime, security, and the ability to rapidly deliver features while maintaining strict SLAs in a fast-paced, remote-first organization.
• You will design, deploy, and maintain resilient infrastructure optimized for AI/ML applications, including GPU provisioning and MLOps integrations, while leading automation efforts to achieve self-healing systems through advanced monitoring, alerting, and incident response mechanisms. Your work will bridge software engineering and operations to enable rapid, safe releases and proactive system improvements.
• You will join a dynamic, AI-first engineering team at Veritone that values collaboration, innovation, and technical excellence, working closely with software development, release, and build teams to solve complex challenges in large-scale distributed systems supporting enterprise AI applications.
• You will deepen your expertise in cloud-native technologies, infrastructure-as-code, observability, and AI infrastructure scaling, while developing leadership skills in guiding reliability best practices, conducting blameless post-mortems, and shaping long-term architectural strategy for mission-critical platforms.
• Deploy and maintain a resilient, secure, and efficient SaaS application platform to meet established SLAs.
• Build and maintain robust CI/CD pipelines and developer platforms to empower engineering teams to release features quickly and safely.
• Design and deploy scalable infrastructure specifically optimized for AI/ML workloads, including managing GPU resources and integrating MLOps tools.
• Automate monitoring, management, and incident response to achieve an auto-remediation system.
• Participate in on-call rotation to ensure stability and uptime for our platforms.
• Scale infrastructure to meet rapidly increasing demand.
• Independently design and develop tools to aid in operations and automation for AI, and work jointly with other team members to deliver innovative solutions to complex business and technical challenges.
• Provide deployment and operations support for multi-tiered distributed software applications.
• Estimate engineering effort, plan implementation, and rollout system changes that meet requirements for functionality, performance, scalability, reliability, and adherence to development goals and principles.
• Collaborate in a fast-paced environment with multiple teams (software development, release management, build and release, etc...).
• Define how the behavior of large-scale systems can be achieved through engineering and operations automation.
• Measure and achieve reliability through engineering and operations automation.
• Develop monitoring and alerting systems, documentation, and management with the goal of creating an auto-remediation system to bring platform stability.
• Adapt security controls to products not typically native to GA releases.
• Develop automation methods to extend standard deployment pipelines for bespoke implementations.
• Handle patching, configuration management, policy enforcement, and audit of production systems.
• Drive the Disaster Recovery process.

🎯 Requirements

• 5+ years of professional Linux and Windows systems and software management experience.
• Expertise with Infrastructure-as-Code tools such as Terraform and Cloud Formation.
• Proficiency in programming languages including Python, Go, and Node.js.
• Hands-on experience managing infrastructure across AWS, Azure, and GCP.
• Expertise in Kubernetes management, including upgrades and operations in production environments.
• Strong scripting skills for systems and data-driven solutions using BASH, Python, or similar.
• Proven experience with GitOps and CI/CD pipelines using tools like Jenkins, ArgoCD, Helm, and GitHub Actions.
• Demonstrated ability to lead root-cause analysis (RCA) and blameless post-mortems, driving strategic architectural changes to prevent incident recurrence.
• Experience acting as an infrastructure consultant to software engineering teams, guiding reliability best practices during design phases.
• Track record of identifying systemic weaknesses and advocating for reliability roadmap improvements.
• Deep background in monitoring and alerting systems (Prometheus, Grafana, Thanos, CloudWatch) and building auto-remediation capabilities.
• Familiarity with deploying, scaling, and observing AI/ML models, vector databases, or LLMs in production.
• Proven success in standardizing security controls and configuration management across large-scale, multi-environment infrastructure.
• Comfort working within project/task management platforms (e.g., Jira, Asana, Trello).

🏖️ Benefits

• Competitive base salary range of $130,000 to $140,000 annually (for Colorado and California residents; actual pay based on skills and experience).
• Eligibility for additional compensation including incentive bonuses, health benefits, retirement plans, life insurance, and paid time off.
• Access to parental leave and benefits, supporting work-life balance and family well-being.
• Opportunity to work remotely from anywhere in the United States with a flexible, distributed team.
• Exposure to cutting-edge AI/ML infrastructure challenges, including GPU provisioning and MLOps at scale.
• Professional growth through involvement in complex, large-scale systems affecting enterprise AI applications.

Skills & Technologies

Python

JavaScript

Java

Node.js

PostgreSQL

Remote

Degree Required

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Veritone, Inc.

Visit Website

About Veritone, Inc.

Veritone is a leader in artificial intelligence (AI) solutions, providing a powerful AI operating system that enables organizations to harness the full potential of their data. Their platform integrates and analyzes vast amounts of structured and unstructured data, uncovering insights and automating processes across various industries. Veritone's solutions cater to sectors such as media and entertainment, government, legal, and energy, offering capabilities like content analysis, forensic analytics, and intelligent automation. By leveraging advanced AI models, Veritone empowers businesses to make better decisions, enhance operational efficiency, and unlock new revenue streams through data-driven innovation. Their commitment is to democratize AI and make it accessible for widespread adoption.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.