Role Overview
We are seeking a Senior Hardware Support Engineer to lead hardware reliability efforts in high-scale, production-critical environments. This individual will serve as a technical authority for diagnosing and resolving complex hardware and firmware issues, ensuring sustained system availability and performance. The role bridges engineering, operations, and vendor teams, focusing on root cause identification, corrective action, and long-term platform improvements.
Key Responsibilities
- Lead in-depth investigations into hardware and firmware failures affecting production systems
- Identify recurring failure patterns and drive systemic fixes to improve fleet reliability
- Serve as the primary escalation point for critical hardware incidents impacting service
- Collaborate with hardware vendors to expedite diagnostics, replacements, and firmware updates
- Work closely with internal engineering to validate solutions and prevent future issues
- Conduct pre-deployment testing of hardware and firmware for large-scale rollouts
- Apply structured problem-solving frameworks to diagnose and document technical issues
- Support on-site teams during urgent hardware events with technical guidance
- Enhance tools and processes for monitoring, tracking, and reporting hardware failures
- Contribute to strategic initiatives that improve long-term hardware resilience
Required Qualifications
- Proven experience supporting server hardware in large-scale data center environments
- Track record of diagnosing hardware and firmware root causes in production systems
- Strong knowledge of server subsystems including CPU, memory, storage, power, and BMC
- Experience managing technical escalations with hardware vendors and engineering teams
- Proficiency in structured troubleshooting using incident or problem management practices
- Ability to analyze logs, telemetry, and error data to identify failure trends
- Experience coordinating with field operations during critical hardware events
- Capacity to manage multiple high-impact investigations simultaneously
- Clear and effective communication skills in cross-functional settings
Preferred Qualifications
- Background in GPU-intensive, AI, or high-performance computing infrastructure
- Experience managing firmware updates and validation at scale
- Familiarity with Linux systems and infrastructure automation tools
- History of improving hardware reliability metrics across large fleets
Benefits
- Comprehensive medical, dental, and vision insurance
- 401(k) plan with employer contribution
- Flexible paid time off policy
- Paid parental leave
- Support for professional growth and development
Work Environment
This role supports remote work within the United States. Occasional travel may be necessary for on-site coordination or urgent hardware events. The position includes participation in incident response rotations for production-impacting issues.
Compensation
Base salary range: $125,000 – $180,000 per year, with an annual performance-based bonus. Equity may be offered based on experience and role scope.
Company Culture
The team values initiative, technical depth, and innovation. We operate in a collaborative, fast-moving environment focused on advancing AI cloud infrastructure. The organization supports global expansion and encourages contributions to long-term technological advancement.