This job has expired

This position was posted on January 9, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Engineering Manager (Infra) - AI Reliability (ANZ Remote)

Canva Pty Ltd

Job Overview

Location

Sydney

Job Type

Full-time

Full Job Description

📋 Description

• Lead the infrastructure that powers Canva’s next-generation creative AI. As Engineering Manager (Infra) – AI Reliability you will own the reliability, scale, and velocity of the systems that enable 100+ researchers in CORE (Canva Original Research & Exploration) to train, evaluate, and ship state-of-the-art models to 170 million monthly users. Every GPU cycle you optimise and every pipeline you harden directly translates into faster breakthroughs and richer creative experiences for our global community.
• Architect and scale multi-cloud, GPU-dense training and inference platforms spanning AWS, GCP, Cloudflare, and GCore. You will design fault-tolerant clusters that can burst from hundreds to thousands of A100/H100 GPUs in minutes, while keeping cost-per-experiment predictable and transparent. Your decisions will determine how quickly researchers can iterate on diffusion, transformer, and multimodal models that redefine design.
• Champion Infrastructure-as-Code excellence. Using Terraform, Helm, and custom tooling you will codify every network, storage, and compute layer so that environments are reproducible, auditable, and disposable. You will institute golden paths that let any researcher spin up a secure, compliant, high-performance workspace with a single CLI command.
• Elevate CI/CD for AI workflows. You will extend our GitHub Actions–based pipelines to support containerised training jobs, model registry promotions, canary releases, and automatic rollback on performance regression. Expect to integrate experiment-tracking (MLflow), data-versioning (DVC), and artefact caching so that the journey from Jupyter notebook to production endpoint is measured in hours, not weeks.
• Build world-class observability for AI workloads. You will define SLIs/SLOs for GPU utilisation, training throughput, inference latency, and cost-per-token. Using Prometheus, Grafana, Loki, and OpenTelemetry you will create dashboards that surface anomalies before researchers notice them and alerts that wake you (not the on-call) only when it matters.
• Foster a culture of DevOps best practices across CORE and the wider engineering org. You will coach a team of senior Site Reliability, Platform, and ML engineers, run blameless post-mortems, and institutionalise chaos-engineering drills that prove our systems are as resilient as our ambitions.
• Drive strategic alignment with Canva’s CORE leadership and cross-functional product teams. You will translate research roadmaps into infrastructure epics, negotiate cloud budgets, and present reliability wins to the CTO and executive staff. Your roadmap will balance bleeding-edge experimentation with rock-solid production stability.
• Stay hands-on. Whether it’s debugging a CUDA memory leak, tuning NCCL collectives, or reviewing Terraform modules, you will lead by example and keep your technical edge sharp. You will also represent Canva at meetups and conferences, sharing how we scale AI infrastructure for creativity at planetary scale.

Skills & Technologies

AWS

GCP

Kubernetes

Terraform

Onsite

Remote

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Canva Pty Ltd

Visit Website

About Canva Pty Ltd

Canva is an Australian graphic-design platform providing a web-based and mobile application for creating social media graphics, presentations, posters, documents and other visual content. It offers templates, stock photography, illustrations, fonts and drag-and-drop functionality through a freemium subscription model serving consumers, small businesses, educators and large enterprises. Founded in 2012 and headquartered in Sydney, the company operates globally with offices in Manila, Beijing and Austin, and supports over 100 languages.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.