Role Overview
Hands-on technical support role focused on maintaining and troubleshooting high-performance compute (HPC) and GPU-based AI infrastructure in a lab/data center environment. This position supports Linux-based systems, hardware infrastructure, and compute clusters.
Roles and Responsibilities
- Rack, stack, cable, and maintain servers and AI lab/data center hardware.
- Troubleshoot and repair GPU/CPU servers, compute clusters, and networking (including InfiniBand).
- Perform hardware diagnostics and replace/install server components in racks.
- Monitor and troubleshoot Linux-based systems, storage, and networking issues.
- Use Linux command-line tools for system validation, debugging, and performance checks.
- Collaborate with engineers to ensure stable operation of AI infrastructure.
- Manage lab space and support hardware inventory...