This job has expired

This position was posted on February 27, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Senior Data Center Operations System Engineer - Los Angeles, CA

Lambda Inc.

Job Overview

Location

Vernon, CA - Data Center

Job Type

Full-time

Full Job Description

📋 Description

• As a Senior Data Center Operations System Engineer at Lambda, you will be at the forefront of building and maintaining the world's most advanced AI cloud infrastructure. This critical role involves ensuring the seamless operation, deployment, and troubleshooting of cutting-edge server, storage, and network hardware within our state-of-the-art data centers. You will be instrumental in scaling our operations to meet the ever-increasing demands of AI research and enterprise hyperscalers, directly contributing to Lambda's mission of making compute as ubiquitous as electricity.
• Your primary responsibilities will include the meticulous installation and configuration of new infrastructure. This encompasses everything from the physical racking, precise labeling, and organized cabling of servers, storage arrays, and network devices, to the initial software configuration required to bring these systems online. You will ensure that every component is deployed according to stringent standards, laying the groundwork for reliable and efficient operations.
• A significant part of your role will involve deep-dive troubleshooting of complex hardware and software issues. You will tackle challenges within some of the most advanced GPU and Networking systems available, diagnosing problems, implementing solutions, and ensuring minimal downtime. This requires a proactive approach to identifying potential issues before they impact service.
• Maintaining accurate and up-to-date documentation is paramount. You will be responsible for meticulously documenting and updating the data center layout and network topology using our Data Center Infrastructure Management (DCIM) software. This ensures that all teams have a clear, real-time understanding of our physical and logical infrastructure, facilitating efficient management and rapid response.
• Collaboration with our supply chain and manufacturing teams is essential for the timely deployment of systems. You will work closely with these departments to align project plans for large-scale deployments, ensuring that hardware arrives on schedule and is ready for integration into our operational environment.
• You will manage a critical parts depot inventory, meticulously tracking equipment through its entire lifecycle – from delivery and storage to staging, deployment, and final handoff in each of our data centers. This ensures that spare parts are available when needed and that inventory levels are optimized.
• Partnering with Hardware Support teams, you will act as a technical escalation point for hardware incidents that present higher-level troubleshooting challenges. Your expertise will be crucial in resolving these complex issues, ensuring they are thoroughly reported, and disseminating effective solutions across the broader operations organization to prevent recurrence.
• You will work closely with the Returns Material Authorization (RMA) team to ensure that faulty parts are promptly returned and that necessary replacements are ordered and tracked, maintaining the integrity and availability of our hardware.
• Adherence to and improvement of installation standards, Method of Procedures (MOPs), and runbooks are key to driving consistency and discoverability across all Lambda data centers. You will not only follow existing best practices but also actively contribute to refining and enhancing them, ensuring our operational playbooks are robust and efficient.
• As a technical escalation point for data center infrastructure issues, you will be relied upon to provide expert guidance and resolution for critical problems. This includes participating in an on-call rotation, where you will serve as a primary escalation point for data center incidents, ensuring 24/7 support for our critical infrastructure.
• You will also collaborate with product management, support, and other cross-functional teams to ensure our operational capabilities align with company goals. This involves translating high-level business priorities into concrete technical and operational requirements, supporting projects where infrastructure plays a pivotal role.
• This role requires a proactive, action-oriented individual who is also willing to mentor and train junior staff on best practices, fostering a culture of continuous learning and operational excellence. You will also be expected to travel up to 30% out-of-state to support the bring-up of new data center locations and assist with critical deployments.

🎯 Requirements

• Proven experience in installing, configuring, and troubleshooting server, storage, and network hardware within a data center environment.
• Strong understanding of critical infrastructure systems including power distribution, airflow management, environmental monitoring, and DCIM software.
• Familiarity with network fundamentals, including structured cabling, fiber optics, and basic network testing procedures.
• Experience with hardware lifecycle management, including inventory tracking and RMA processes.

🏖️ Benefits

• Competitive salary and equity compensation.
• Comprehensive health, dental, and vision insurance for you and your dependents.
• Generous and flexible paid time off plan.
• 401k plan with a company match. (USA employees)
• Wellness and commuter stipends for select roles.

Skills & Technologies

Fiber

Linux

SSL

Senior

Onsite

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

Lambda Inc.

Visit Website

About Lambda Inc.

Lambda Inc. provides cloud-based GPU clusters and workstations for artificial-intelligence research and development. The company designs and operates high-performance hardware infrastructure optimized for machine-learning workloads, offering on-demand access to NVIDIA GPUs, pre-configured deep-learning software stacks, and scalable storage. Customers include AI labs, universities, and enterprises training large language and computer-vision models. Founded in 2012, Lambda is headquartered in San Francisco and maintains data centers across North America and Europe.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.