FluidStack Inc. logo

Site Reliability Engineer

Job Overview

Location

London

Job Type

Full-time

Category

Software Engineering

Date Posted

March 7, 2026

Full Job Description

đź“‹ Description

  • • Join Fluidstack Inc., a pioneering company at the forefront of building the infrastructure for abundant intelligence, and play a pivotal role in accelerating the future of AI.
  • • As a Site Reliability Engineer (SRE), you will be instrumental in ensuring the utmost reliability, performance, and scalability of our global GPU cloud, which powers cutting-edge AI research and enterprise solutions.
  • • You will operate at the intersection of software, hardware, and operations, collaborating closely with cross-functional teams including networking, platform engineering, and data center operations to architect and maintain systems capable of handling the immense demands of AI workloads.
  • • This is a hands-on role requiring deep systems knowledge, exceptional problem-solving skills, and strong communication abilities to tackle complex production issues, deploy resilient infrastructure, and continuously enhance the stability and observability of our platform.
  • • A typical day will involve deploying and managing large-scale GPU clusters, potentially numbering over 1,000 GPUs, utilizing and refining custom-written playbooks to meet specific customer requirements.
  • • You will be responsible for rigorously validating the correctness and performance of the underlying compute, storage, and networking infrastructure, working collaboratively with providers to optimize these critical subsystems.
  • • Contribute to significant data migration projects, moving petabytes of data from public cloud platforms to our local storage solutions with maximum speed and cost-effectiveness.
  • • Engage in deep-dive debugging across the entire technology stack, addressing issues ranging from physical hardware anomalies to complex software optimizations, such as improving S3 dataloader performance across different regions.
  • • Develop and implement internal tooling to significantly reduce deployment times and bolster cluster reliability, prioritizing automation where the customer benefits clearly justify the implementation effort.
  • • Participate in an on-call rotation, providing critical support for up to one week per month to ensure continuous operation of our global infrastructure.
  • • Embrace a customer-centric attitude, demonstrating an unwavering accountability mindset and a proactive bias to action in all your endeavors.
  • • Showcase a proven track record of shipping clean, well-documented code within complex and demanding production environments.
  • • Cultivate structure from chaos, adeptly navigate ambiguity, and remain adaptable to the ever-evolving and dynamic nature of the AI ecosystem.
  • • Leverage strong technical and interpersonal communication skills, maintain a low ego, and foster a positive mental attitude to contribute to a collaborative and high-performing team environment.
  • • This role offers a unique opportunity to work with leading AI labs and enterprises, contributing directly to the development of next-generation AI infrastructure and making a tangible impact on the future of intelligence.
  • • You will be empowered to make significant technical decisions and drive improvements that directly affect the performance and availability of our services, ensuring our customers can rely on Fluidstack for their most demanding AI computations.
  • • The role demands a proactive approach to identifying potential issues before they impact production, implementing preventative measures, and developing robust incident response plans.
  • • You will gain exposure to a wide array of technologies and challenges, from bare-metal hardware management to sophisticated distributed systems, providing continuous learning and professional growth opportunities.
  • • Contribute to the architectural design and implementation of new features and services, ensuring they meet our stringent reliability and performance standards.
  • • Collaborate with software engineers to integrate new features and applications into our production environment, ensuring seamless deployment and operation.
  • • Monitor system performance, identify bottlenecks, and implement optimizations to ensure efficient resource utilization and cost-effectiveness.
  • • Develop and maintain comprehensive documentation for systems, processes, and procedures, ensuring knowledge transfer and operational consistency.
  • • Participate in post-incident reviews to identify root causes, implement corrective actions, and share learnings across the team.
  • • Drive initiatives to improve the security posture of our infrastructure, working closely with security teams to implement best practices and mitigate risks.
  • • Contribute to the development and refinement of our CI/CD pipelines, ensuring efficient and reliable software delivery.
  • • You will be a key player in ensuring the stability and scalability of a platform that is critical to the advancement of artificial intelligence, working with a team that is passionate about pushing the boundaries of what's possible.

Skills & Technologies

Python
Kubernetes
Terraform
Onsite
$175k-320k

Ready to Apply?

You will be redirected to an external site to apply.

FluidStack Inc. logo
FluidStack Inc.
Visit Website

About FluidStack Inc.

FluidStack Inc. operates a distributed cloud platform that aggregates under-utilized GPUs in data centers and individual machines worldwide, renting them on-demand to AI researchers, startups, and enterprises for training and inference workloads. The company automates deployment, security, and billing, offering prices up to 80% below traditional hyperscalers while providing instant access to high-end NVIDIA A100, H100, and consumer GPUs through a single API and web console. Headquartered in London, FluidStack targets machine-learning engineers who need scalable, low-cost compute without long-term commitments, claiming thousands of active nodes and customers including Fortune 500 enterprises and leading research labs.

Similar Opportunities

❌ EXPIRED
Scale to Win LLC logo

Scale to Win LLC

Remote
Full-time
Expired Jan 22, 2026
Senior
Remote

3 months ago

Apply
USA
Full-time
Expires May 2, 2026
Senior
Remote

5 days ago

Apply
Dandy Technology, Inc. logo

Dandy Technology, Inc.

USA
Full-time
Expires May 3, 2026
REST
Remote

3 days ago

Apply
Canada
Full-time
Expires May 2, 2026
Go
MongoDB
Redis
+3 more

5 days ago

Apply