Role Overview

We are seeking a Senior Hardware Support Engineer to lead hardware reliability efforts in high-scale, production-critical environments. This individual will serve as a technical authority for diagnosing and resolving complex hardware and firmware issues, ensuring sustained system availability and performance. The role bridges engineering, operations, and vendor teams, focusing on root cause identification, corrective action, and long-term platform improvements.

Key Responsibilities

Lead in-depth investigations into hardware and firmware failures affecting production systems
Identify recurring failure patterns and drive systemic fixes to improve fleet reliability
Serve as the primary escalation point for critical hardware incidents impacting service
Collaborate with hardware vendors to expedite diagnostics, replacements, and firmware updates
Work closely with internal engineering to validate solutions and prevent future issues
Conduct pre-deployment testing of hardware and firmware for large-scale rollouts
Apply structured problem-solving frameworks to diagnose and document technical issues
Support on-site teams during urgent hardware events with technical guidance
Enhance tools and processes for monitoring, tracking, and reporting hardware failures
Contribute to strategic initiatives that improve long-term hardware resilience

Required Qualifications

Proven experience supporting server hardware in large-scale data center environments
Track record of diagnosing hardware and firmware root causes in production systems
Strong knowledge of server subsystems including CPU, memory, storage, power, and BMC
Experience managing technical escalations with hardware vendors and engineering teams
Proficiency in structured troubleshooting using incident or problem management practices
Ability to analyze logs, telemetry, and error data to identify failure trends
Experience coordinating with field operations during critical hardware events
Capacity to manage multiple high-impact investigations simultaneously
Clear and effective communication skills in cross-functional settings

Preferred Qualifications

Background in GPU-intensive, AI, or high-performance computing infrastructure
Experience managing firmware updates and validation at scale
Familiarity with Linux systems and infrastructure automation tools
History of improving hardware reliability metrics across large fleets

Benefits

Comprehensive medical, dental, and vision insurance
401(k) plan with employer contribution
Flexible paid time off policy
Paid parental leave
Support for professional growth and development

Work Environment

This role supports remote work within the United States. Occasional travel may be necessary for on-site coordination or urgent hardware events. The position includes participation in incident response rotations for production-impacting issues.

Compensation

Base salary range: $125,000 – $180,000 per year, with an annual performance-based bonus. Equity may be offered based on experience and role scope.

Company Culture

The team values initiative, technical depth, and innovation. We operate in a collaborative, fast-moving environment focused on advancing AI cloud infrastructure. The organization supports global expansion and encourages contributions to long-term technological advancement.

Nebius is hiring a Senior Hardware Support Engineer

Key Responsibilities

Required Qualifications

Preferred Qualifications

Benefits

Work Environment

Compensation

Company Culture

Similar Jobs

Network Engineer

Senior Site Reliability Engineer

Technical Support Engineer

Openstack Technical Support Engineer

Site Reliability Engineers

Senior Engineer - Site Reliability Engineering