πŸ“ Jobs Near Me
πŸ“

HiringNearMe.work

Local Jobs, Zero Commute

πŸ“ Local Job Near You

Senior System Architect, Infrastructure Reliability

🏒
NVIDIA
πŸ“ Santa Clara, United States
πŸ“
Location Santa Clara
πŸ“…
Posted June 06, 2026
πŸš—
Commute Local Area
🎯
Local Opportunity Near You!

This job is in your area. Enjoy a short commute and work close to home.

πŸ“‹
Job Description

NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.

What you'll be doing:
+ Architect Failure Attribution Frameworks: Build a scalable flight recorder for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.
+ Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.
+ Distr...

Apply for This Job

Submit Application

Quick and secure application process

πŸ“ Location Details

πŸŒ†
City
Santa Clara
πŸ—ΊοΈ
Country
United States
πŸš—
Commute
Local Area

πŸ” More Jobs Nearby

Explore other opportunities in Santa Clara

View Local Jobs