
Job Overview
Location
Sydney
Job Type
Full-time
Category
Engineering Manager
Date Posted
January 9, 2026
Full Job Description
đź“‹ Description
- • Lead the infrastructure that powers Canva’s next-generation creative AI. As Engineering Manager (Infra) – AI Reliability you will own the reliability, scale, and velocity of the systems that enable 100+ researchers in CORE (Canva Original Research & Exploration) to train, evaluate, and ship state-of-the-art models to 170 million monthly users. Every GPU cycle you optimise and every pipeline you harden directly translates into faster breakthroughs and richer creative experiences for our global community.
- • Architect and scale multi-cloud, GPU-dense training and inference platforms spanning AWS, GCP, Cloudflare, and GCore. You will design fault-tolerant clusters that can burst from hundreds to thousands of A100/H100 GPUs in minutes, while keeping cost-per-experiment predictable and transparent. Your decisions will determine how quickly researchers can iterate on diffusion, transformer, and multimodal models that redefine design.
- • Champion Infrastructure-as-Code excellence. Using Terraform, Helm, and custom tooling you will codify every network, storage, and compute layer so that environments are reproducible, auditable, and disposable. You will institute golden paths that let any researcher spin up a secure, compliant, high-performance workspace with a single CLI command.
- • Elevate CI/CD for AI workflows. You will extend our GitHub Actions–based pipelines to support containerised training jobs, model registry promotions, canary releases, and automatic rollback on performance regression. Expect to integrate experiment-tracking (MLflow), data-versioning (DVC), and artefact caching so that the journey from Jupyter notebook to production endpoint is measured in hours, not weeks.
- • Build world-class observability for AI workloads. You will define SLIs/SLOs for GPU utilisation, training throughput, inference latency, and cost-per-token. Using Prometheus, Grafana, Loki, and OpenTelemetry you will create dashboards that surface anomalies before researchers notice them and alerts that wake you (not the on-call) only when it matters.
- • Foster a culture of DevOps best practices across CORE and the wider engineering org. You will coach a team of senior Site Reliability, Platform, and ML engineers, run blameless post-mortems, and institutionalise chaos-engineering drills that prove our systems are as resilient as our ambitions.
- • Drive strategic alignment with Canva’s CORE leadership and cross-functional product teams. You will translate research roadmaps into infrastructure epics, negotiate cloud budgets, and present reliability wins to the CTO and executive staff. Your roadmap will balance bleeding-edge experimentation with rock-solid production stability.
- • Stay hands-on. Whether it’s debugging a CUDA memory leak, tuning NCCL collectives, or reviewing Terraform modules, you will lead by example and keep your technical edge sharp. You will also represent Canva at meetups and conferences, sharing how we scale AI infrastructure for creativity at planetary scale.
Skills & Technologies
R
AWS
GCP
Kubernetes
Terraform
Onsite
About Canva Pty Ltd
Canva is an Australian graphic-design platform providing a web-based and mobile application for creating social media graphics, presentations, posters, documents and other visual content. It offers templates, stock photography, illustrations, fonts and drag-and-drop functionality through a freemium subscription model serving consumers, small businesses, educators and large enterprises. Founded in 2012 and headquartered in Sydney, the company operates globally with offices in Manila, Beijing and Austin, and supports over 100 languages.

