Location
Redmond
Posted
June 22, 2026
Commute
Local Area
Local Opportunity Near You!
This job is in your area. Enjoy a short commute and work close to home.
Job Description
We are now looking for a Senior Software Engineer for AI Resiliency!
At NVIDIA, we are pushing the boundaries of whatβs possible in AI. We are currently seeking a Senior Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world. As a member of our AI Software Resiliency team, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs. Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times.
What Youβll Be Doing:
+ Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
+ Hands-On Coding & Optimization: Contribute to large-scale distributed syst...
At NVIDIA, we are pushing the boundaries of whatβs possible in AI. We are currently seeking a Senior Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world. As a member of our AI Software Resiliency team, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs. Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times.
What Youβll Be Doing:
+ Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
+ Hands-On Coding & Optimization: Contribute to large-scale distributed syst...