πŸ‡ΊπŸ‡Έ USAJobs.work

America's Job Portal

← Back to USA Jobs

Senior System Architect, Infrastructure Reliability

Company

NVIDIA

Location

Santa Clara, CA

Posted

June 15, 2026

Position Overview

NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.

What you'll be doing:
+ Architect Failure Attribution Frameworks: Build a scalable flight recorder for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.
+ Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.
+ Distr...

Ready to Apply?

Join thousands of Americans building their careers

Apply Now