πŸ‡ΊπŸ‡Έ USAJobs.work

America's Job Portal

← Back to USA Jobs

Principal Software Engineer, At-Scale Reliability and Fleet Intelligence β€” CSP Engagements

Company

NVIDIA

Location

Santa Clara, CA

Posted

July 04, 2026

Position Overview

We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for fleet-scale reliability, working directly with engineering teams of key CSP / hyperscale customers to ensure NVIDIA platforms achieve target MTBI (Mean Time Between Interruptions) in production. In this role, you will augment NVIDIA's internal software/firmware and quality teams with a dedicated CSP-facing focus. You will drive work streams with CSP engineering teams to build shared understanding of reliability software/firmware architecture, methodology, incorporate their fleet telemetry and failure data into NVIDIA's improvement priorities, and validate that reliability improvements measured in the lab translate to real customer environments. Your cross-CSP visibility enables you to distinguish systemic architectural gaps from environmental or configuration-specific issues that no single customer engagement could identify alone.


What you'll be doing:
+ D...

Ready to Apply?

Join thousands of Americans building their careers

Apply Now