π Local Job Near You
Senior Software Engineer, DGX Cloud Production Engineering
NVIDIA
π
Santa Clara, United States
Location
Santa Clara
Posted
May 31, 2026
Commute
Local Area
Local Opportunity Near You!
This job is in your area. Enjoy a short commute and work close to home.
Job Description
NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable, and safe to run. This role is part of a production engineering team focused on Kubernetes-based infrastructure, GPU cluster operations, reliability, automation, GitOps, and Day 2 operability across DGX Cloud environments.
What youβll be doing:
+ Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments.
+ Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations.
+ Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations.
+ Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows.
+ Participate i...
What youβll be doing:
+ Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments.
+ Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations.
+ Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations.
+ Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows.
+ Participate i...