
Job Overview
Location
London
Job Type
Full-time
Category
Software Engineering
Date Posted
March 7, 2026
Full Job Description
đź“‹ Description
- • Join Fluidstack Inc., a pioneering company at the forefront of building the infrastructure for abundant intelligence, and play a pivotal role in accelerating the future of AI.
- • As a Site Reliability Engineer (SRE), you will be instrumental in ensuring the utmost reliability, performance, and scalability of our global GPU cloud, which powers cutting-edge AI research and enterprise solutions.
- • You will operate at the intersection of software, hardware, and operations, collaborating closely with cross-functional teams including networking, platform engineering, and data center operations to architect and maintain systems capable of handling the immense demands of AI workloads.
- • This is a hands-on role requiring deep systems knowledge, exceptional problem-solving skills, and strong communication abilities to tackle complex production issues, deploy resilient infrastructure, and continuously enhance the stability and observability of our platform.
- • A typical day will involve deploying and managing large-scale GPU clusters, potentially numbering over 1,000 GPUs, utilizing and refining custom-written playbooks to meet specific customer requirements.
- • You will be responsible for rigorously validating the correctness and performance of the underlying compute, storage, and networking infrastructure, working collaboratively with providers to optimize these critical subsystems.
- • Contribute to significant data migration projects, moving petabytes of data from public cloud platforms to our local storage solutions with maximum speed and cost-effectiveness.
- • Engage in deep-dive debugging across the entire technology stack, addressing issues ranging from physical hardware anomalies to complex software optimizations, such as improving S3 dataloader performance across different regions.
- • Develop and implement internal tooling to significantly reduce deployment times and bolster cluster reliability, prioritizing automation where the customer benefits clearly justify the implementation effort.
- • Participate in an on-call rotation, providing critical support for up to one week per month to ensure continuous operation of our global infrastructure.
- • Embrace a customer-centric attitude, demonstrating an unwavering accountability mindset and a proactive bias to action in all your endeavors.
- • Showcase a proven track record of shipping clean, well-documented code within complex and demanding production environments.
- • Cultivate structure from chaos, adeptly navigate ambiguity, and remain adaptable to the ever-evolving and dynamic nature of the AI ecosystem.
- • Leverage strong technical and interpersonal communication skills, maintain a low ego, and foster a positive mental attitude to contribute to a collaborative and high-performing team environment.
- • This role offers a unique opportunity to work with leading AI labs and enterprises, contributing directly to the development of next-generation AI infrastructure and making a tangible impact on the future of intelligence.
- • You will be empowered to make significant technical decisions and drive improvements that directly affect the performance and availability of our services, ensuring our customers can rely on Fluidstack for their most demanding AI computations.
- • The role demands a proactive approach to identifying potential issues before they impact production, implementing preventative measures, and developing robust incident response plans.
- • You will gain exposure to a wide array of technologies and challenges, from bare-metal hardware management to sophisticated distributed systems, providing continuous learning and professional growth opportunities.
- • Contribute to the architectural design and implementation of new features and services, ensuring they meet our stringent reliability and performance standards.
- • Collaborate with software engineers to integrate new features and applications into our production environment, ensuring seamless deployment and operation.
- • Monitor system performance, identify bottlenecks, and implement optimizations to ensure efficient resource utilization and cost-effectiveness.
- • Develop and maintain comprehensive documentation for systems, processes, and procedures, ensuring knowledge transfer and operational consistency.
- • Participate in post-incident reviews to identify root causes, implement corrective actions, and share learnings across the team.
- • Drive initiatives to improve the security posture of our infrastructure, working closely with security teams to implement best practices and mitigate risks.
- • Contribute to the development and refinement of our CI/CD pipelines, ensuring efficient and reliable software delivery.
- • You will be a key player in ensuring the stability and scalability of a platform that is critical to the advancement of artificial intelligence, working with a team that is passionate about pushing the boundaries of what's possible.
Skills & Technologies
About FluidStack Inc.
FluidStack Inc. operates a distributed cloud platform that aggregates under-utilized GPUs in data centers and individual machines worldwide, renting them on-demand to AI researchers, startups, and enterprises for training and inference workloads. The company automates deployment, security, and billing, offering prices up to 80% below traditional hyperscalers while providing instant access to high-end NVIDIA A100, H100, and consumer GPUs through a single API and web console. Headquartered in London, FluidStack targets machine-learning engineers who need scalable, low-cost compute without long-term commitments, claiming thousands of active nodes and customers including Fortune 500 enterprises and leading research labs.


