Canva Pty Ltd logo

Engineering Manager (Infra) - AI Reliability (ANZ Remote)

Job Overview

Location

Sydney

Job Type

Full-time

Category

Engineering Manager

Date Posted

January 9, 2026

Full Job Description

đź“‹ Description

  • • Lead the infrastructure that powers Canva’s next-generation creative AI. As Engineering Manager (Infra) – AI Reliability you will own the reliability, scale, and velocity of the systems that enable 100+ researchers in CORE (Canva Original Research & Exploration) to train, evaluate, and ship state-of-the-art models to 170 million monthly users. Every GPU cycle you optimise and every pipeline you harden directly translates into faster breakthroughs and richer creative experiences for our global community.
  • • Architect and scale multi-cloud, GPU-dense training and inference platforms spanning AWS, GCP, Cloudflare, and GCore. You will design fault-tolerant clusters that can burst from hundreds to thousands of A100/H100 GPUs in minutes, while keeping cost-per-experiment predictable and transparent. Your decisions will determine how quickly researchers can iterate on diffusion, transformer, and multimodal models that redefine design.
  • • Champion Infrastructure-as-Code excellence. Using Terraform, Helm, and custom tooling you will codify every network, storage, and compute layer so that environments are reproducible, auditable, and disposable. You will institute golden paths that let any researcher spin up a secure, compliant, high-performance workspace with a single CLI command.
  • • Elevate CI/CD for AI workflows. You will extend our GitHub Actions–based pipelines to support containerised training jobs, model registry promotions, canary releases, and automatic rollback on performance regression. Expect to integrate experiment-tracking (MLflow), data-versioning (DVC), and artefact caching so that the journey from Jupyter notebook to production endpoint is measured in hours, not weeks.
  • • Build world-class observability for AI workloads. You will define SLIs/SLOs for GPU utilisation, training throughput, inference latency, and cost-per-token. Using Prometheus, Grafana, Loki, and OpenTelemetry you will create dashboards that surface anomalies before researchers notice them and alerts that wake you (not the on-call) only when it matters.
  • • Foster a culture of DevOps best practices across CORE and the wider engineering org. You will coach a team of senior Site Reliability, Platform, and ML engineers, run blameless post-mortems, and institutionalise chaos-engineering drills that prove our systems are as resilient as our ambitions.
  • • Drive strategic alignment with Canva’s CORE leadership and cross-functional product teams. You will translate research roadmaps into infrastructure epics, negotiate cloud budgets, and present reliability wins to the CTO and executive staff. Your roadmap will balance bleeding-edge experimentation with rock-solid production stability.
  • • Stay hands-on. Whether it’s debugging a CUDA memory leak, tuning NCCL collectives, or reviewing Terraform modules, you will lead by example and keep your technical edge sharp. You will also represent Canva at meetups and conferences, sharing how we scale AI infrastructure for creativity at planetary scale.

Skills & Technologies

R
AWS
GCP
Kubernetes
Terraform
Onsite

Ready to Apply?

You will be redirected to an external site to apply.

Canva Pty Ltd logo
Canva Pty Ltd
Visit Website

About Canva Pty Ltd

Canva is an Australian graphic-design platform providing a web-based and mobile application for creating social media graphics, presentations, posters, documents and other visual content. It offers templates, stock photography, illustrations, fonts and drag-and-drop functionality through a freemium subscription model serving consumers, small businesses, educators and large enterprises. Founded in 2012 and headquartered in Sydney, the company operates globally with offices in Manila, Beijing and Austin, and supports over 100 languages.

Similar Opportunities

Sydney
Full-time
Expires Mar 10, 2026

11 days ago

Apply
Sydney
Full-time
Expires Mar 10, 2026
Onsite

11 days ago

Apply
Canonical Group Limited logo

Canonical Group Limited

Remote
Full-time
Expires Mar 9, 2026
Ubuntu
Remote

12 days ago

Apply
Sydney
Full-time
Expires Mar 10, 2026

11 days ago

Apply