
Job Overview
Location
San Francisco, CA - US
Job Type
Full-time
Category
Software Engineering
Date Posted
June 4, 2026
Full Job Description
đź“‹ Description
- • Own production reliability for Crusoe’s global network infrastructure, including edge, backbone, data center fabric, and GPU cluster interconnects, directly impacting the availability of AI workloads running on hundreds of thousands of GPUs.
- • Lead end-to-end incident response for high-severity network events, including rapid mitigation, stakeholder communication, and postmortem documentation to prevent recurrence.
- • Drive root cause analysis for production incidents, identify systemic issues, author remediation plans, and track them to closure across multi-region, hyperscale environments.
- • Define and own network service level indicators (SLIs) and service level objectives (SLOs) in partnership with Architecture and Site Reliability teams, backed by real-time dashboards and alerting.
- • Author, maintain, and enforce runbooks, escalation playbooks, and standard operating procedures (SOPs) used by the broader operations team for network reliability.
- • Improve network observability by enhancing monitoring stacks using streaming telemetry, SNMP, NetFlow, sFlow, and tools such as Kentik, Grafana, Prometheus, and ThousandEyes.
- • Build and maintain Python-based automation tooling for auto-remediation of known failure modes to reduce toil and accelerate mean time to resolution.
- • Serve as the senior escalation point during critical network outages, providing technical leadership and decision-making under pressure in 24/7 operational environments.
- • Mentor Staff and Senior Network Engineers through technical guidance, post-incident reviews, and knowledge sharing to build a culture of operational excellence.
- • Operate and troubleshoot large-scale network hardware fleets exceeding 10,000 devices across multi-vendor environments including Arista (EOS), Juniper (Junos), and NVIDIA/Mellanox platforms.
- • Configure and optimize lossless RDMA/RoCE (v1 and v2) fabrics for GPU and HPC workloads, including tuning of PFC, ECN, and DCQCN protocols.
- • Maintain expert-level operational fluency in BGP, EVPN-VXLAN, IS-IS, OSPF, MPLS, QoS, and TCP/IP within production data center CLOS architectures.
- • Collaborate with engineering and product leadership to align network reliability metrics with business-critical AI workload requirements.
- • Ensure continuous improvement of network monitoring, alerting, and diagnostic capabilities to support hyperscale AI infrastructure at global scale.
🎯 Requirements
- • 12+ years of production network engineering experience with a demonstrated focus on large-scale operations, incident response, and reliability in hyperscale or internet-scale environments.
- • Hands-on experience operating RDMA/RoCE (v1 and v2) lossless fabrics for GPU and HPC workloads, including PFC, ECN, and DCQCN tuning.
- • Expert knowledge of Arista (EOS), Juniper (Junos), and NVIDIA/Mellanox platforms in leaf-spine CLOS architectures across multi-vendor environments.
- • Proficiency in Python for developing auto-remediation scripts, diagnostic tooling, and operational workflows to reduce toil and accelerate incident resolution.
- • Demonstrated experience defining and owning network SLIs and SLOs in partnership with engineering and product leadership.
- • Comfort operating and leading incident response for 10K+ device fleets across multi-region environments with 24/7 on-call responsibility.
🏖️ Benefits
- • Competitive compensation in the range of $225,000 - $275,000 + bonus
- • Restricted Stock Units included in all offers
- • Comprehensive health, dental & vision insurance
- • Employer contributions to HSA account
- • Paid parental leave
- • 401(k) Retirement plan with company match up to 4% of salary
Skills & Technologies
About Crusoe Energy Systems LLC
Crusoe Energy Systems is building Crusoe Cloud, an AI cloud platform that provides managed AI services and AI data center infrastructure. They cater to businesses seeking to accelerate AI solution development with optimized models and high-performance computing. The company utilizes environmentally aligned power sources, including wind, solar, and natural gas, to power its data centers. With features like managed Kubernetes and Slurm, Crusoe simplifies operations and ensures reliability with 24/7 support. Crusoe is expanding its reach, including a strategic European expansion with its first data center in Norway. Crusoe recently raised $1.375 billion at a valuation above $10 billion.
Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.
Newsletter
Weekly remote jobs and featured talent.
No spam. Only curated remote roles and product updates. You can unsubscribe anytime.
Similar Opportunities
1 month ago

Smart Working Solutions Limited
9 months ago


