π Local Job Near You
Senior System Architect, Infrastructure Reliability
NVIDIA
π
Santa Clara, United States
Location
Santa Clara
Posted
June 06, 2026
Commute
Local Area
Local Opportunity Near You!
This job is in your area. Enjoy a short commute and work close to home.
Job Description
NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.
What you'll be doing:
+ Architect Failure Attribution Frameworks: Build a scalable flight recorder for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.
+ Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.
+ Distr...
What you'll be doing:
+ Architect Failure Attribution Frameworks: Build a scalable flight recorder for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.
+ Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.
+ Distr...