Role Overview
We're looking for a Senior Hardware Support Engineer to safeguard the reliability of our production hardware infrastructure at scale. In this role, you'll serve as a technical leader for resolving complex hardware and firmware issues, bridging operations, engineering, and vendor teams to maintain fleet stability and system uptime.
Key Responsibilities
- Lead in-depth investigations into hardware and firmware failures, tracing issues from symptom to root cause
- Identify systemic reliability trends by analyzing recurring errors and field data
- Serve as the primary technical contact during critical hardware incidents affecting performance or availability
- Collaborate with hardware vendors to expedite diagnostics, replacements, and corrective actions
- Work closely with internal engineering groups to test and validate fixes before broad deployment
- Conduct pre-deployment validation of server hardware and firmware updates
- Apply structured problem-solving frameworks to manage and document incident resolution
- Provide technical guidance to on-site teams during urgent hardware interventions
- Enhance tools and processes for monitoring, tracking, and reporting hardware failures
- Help shape long-term strategies to improve fleet-wide hardware resilience
Required Qualifications
- Proven experience supporting server hardware in large-scale or data center environments
- Track record of diagnosing and resolving complex hardware and firmware issues
- Solid knowledge of server subsystems including CPU, memory, storage, power, networking, and BMC
- Experience engaging with hardware vendors to resolve production escalations
- Proficiency in structured troubleshooting using formal incident or problem management practices
- Strong analytical skills with the ability to interpret system logs, telemetry, and error patterns
- Experience coordinating with field operations teams during hardware events
- Ability to manage multiple high-priority investigations simultaneously
- Clear and effective communication skills in cross-team settings
Preferred Qualifications
- Background in GPU-intensive, AI, or high-performance computing environments
- Experience managing firmware lifecycle and validating large-scale rollouts
- Familiarity with Linux-based infrastructure and operational tooling
- History of improving hardware reliability metrics across large server fleets
Benefits
- Comprehensive medical, dental, and vision insurance
- 401(k) plan with company match
- Flexible paid time off policy
- Paid parental leave
- Support for professional development and learning
Compensation
Annual salary range: $125,000 – $180,000. Additional compensation includes an annual performance-based bonus.
Work Mode
This position is remote within the United States. Occasional travel may be required for on-site coordination or critical hardware events. The role includes participation in incident escalation coverage for production-impacting issues.
Company Culture
Our environment encourages initiative and innovation, with a strong emphasis on collaboration and technical excellence. We support professional growth and work at the forefront of AI cloud infrastructure, building systems that power next-generation applications.