This job has expired

This position was posted on March 7, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Software Engineer, Infrastructure Platform

FluidStack Inc.

Job Overview

Location

San Francisco, CA

Job Type

Full-time

Full Job Description

📋 Description

• Fluidstack is at the forefront of building the infrastructure necessary for the advancement of artificial intelligence, partnering with leading AI labs, governments, and enterprises to provide high-speed compute power. We are driven by a mission to accelerate the realization of AGI, fostering a team that is deeply committed to delivering world-class infrastructure solutions. Our ethos centers on treating customer outcomes as our own, building trust through the excellence of our systems and the dedication of our team. If you are purpose-driven, relentlessly focused on excellence, and eager to contribute to a high-impact, fast-paced environment, join us in shaping the future of intelligence.
• As a Software Engineer, Infrastructure Platform at Fluidstack, you will play a pivotal role in constructing the core platforms that underpin our global infrastructure and data center operations. This is a unique opportunity to shape the foundational systems that enable our rapid growth and operational efficiency. You will be instrumental in developing comprehensive internal tooling across critical domains, including Configuration Management Database (CMDB), asset management, Data Center Infrastructure Management (DCIM), monitoring and observability, security, and operational automation. These tools will be designed to streamline the deployment, management, and operation of our infrastructure at an unprecedented scale.
• Your responsibilities will span the entire lifecycle of infrastructure management. You will design and build our next-generation CMDB system, establishing it as the definitive source of truth for all infrastructure assets, network topology, and configuration data. This will involve creating robust data models and efficient querying mechanisms to ensure data integrity and accessibility.
• You will also develop sophisticated DCIM platforms to manage critical data center operations, including rack operations, server and GPU deployment, operating system installation, quality assurance processes, and white-screen operations. This includes automating the provisioning and configuration of hardware to ensure rapid and reliable deployment.
• Furthermore, you will create end-to-end asset lifecycle management systems. These systems will meticulously track assets from receiving and racking through inventory management, break-fix workflows, and eventual decommissioning, ensuring optimal utilization and compliance.
• A significant focus will be on building advanced monitoring and observability platforms. This involves integrating telemetry data from various sources, including Building Management Systems (BMS), Electrical Power Monitoring Systems (EPMS), and IT devices. You will implement intelligent alarming and incident management capabilities to proactively identify and resolve issues before they impact services.
• To empower our teams and enhance operational efficiency, you will develop self-service portals and automation tools. These will facilitate new region bootstrapping, streamline day-2 operations, and enable fleet-scale management of our distributed infrastructure.
• A key objective is to eliminate manual toil through the development of workflow automation and self-service tooling. This will empower our operations and engineering teams, allowing them to focus on higher-value tasks. You will build robust workflow orchestration systems designed to manage complex, multi-step processes, including incident, problem, and change management.
• You will also be responsible for creating digital twin visualizations and operational dashboards that provide actionable insights into our infrastructure's performance and health. Close collaboration with our data teams will be essential for developing advanced analytics capabilities.
• Integration layers will be developed to seamlessly connect our internal platforms with external vendors and third-party systems, ensuring a cohesive and efficient operational ecosystem.
• This role demands strong cross-functional collaboration. You will work closely with data center operations, system engineering, network engineering, and security teams to deeply understand their requirements and deliver high-impact solutions. Partnering with product and business stakeholders will be crucial for prioritizing features, defining roadmaps, and effectively balancing competing needs. Close alignment with support and operations teams will ensure our platforms scale seamlessly with organizational growth.
• You will contribute to technical leadership by evaluating build vs. buy decisions for platform components, considering factors like scalability, cost, and flexibility. Championing modern development practices such as CI/CD, infrastructure-as-code, automated testing, and observability-first design will be paramount. Active participation in architecture reviews and design discussions will shape our technical direction and standards. Fostering technical excellence through rigorous code reviews, comprehensive documentation, and knowledge sharing will be a core aspect of the role.
• Designing high-performance, fault-tolerant systems capable of handling thousands of queries per second (QPS) as our infrastructure footprint expands is critical. You will build comprehensive monitoring, logging, and debugging capabilities with robust error handling. Implementing data migration strategies and carefully managing upstream/downstream dependencies during platform evolution will ensure system stability. You will own projects end-to-end, from concept through deployment, ensuring production readiness and operational excellence.

Skills & Technologies

Python

PostgreSQL

Redis

Docker

Terraform

DevOps

Onsite

$200k-250k

Degree Required

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

FluidStack Inc.

Visit Website

About FluidStack Inc.

FluidStack Inc. operates a distributed cloud platform that aggregates under-utilized GPUs in data centers and individual machines worldwide, renting them on-demand to AI researchers, startups, and enterprises for training and inference workloads. The company automates deployment, security, and billing, offering prices up to 80% below traditional hyperscalers while providing instant access to high-end NVIDIA A100, H100, and consumer GPUs through a single API and web console. Headquartered in London, FluidStack targets machine-learning engineers who need scalable, low-cost compute without long-term commitments, claiming thousands of active nodes and customers including Fortune 500 enterprises and leading research labs.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.