We are seeking a Infra Support Engineer to join the Global Infrastructure team. This role focuses on GPU system delivery, incident detection, triage, basic remediation, runbook execution, monitoring and clear escalation to the SRE (Site Reliability Engineering) team while helping improve operational runbooks and observability.
Responsibilities
- Provide first/second-line technical support to customers for the AI Infrastructure (GPU/CPU nodes, networking, storage, orchestration, platform services) via ticketing systems, emails, Slack, or other messaging systems.
- Monitor system health and service-level indicators (alerts, dashboards); respond to alerts 24x7 as scheduled.
- Triage incidents, gather context, verify scope and impact, follow standard operating procedures and runbooks to perform immediate mitigations.
- Escalate to the global SRE engineers with clear, concise incident notes and relevant logs/traces.
- Maint...