π Local Job Near You
Principal Software Engineer, DGX Cloud Production Engineering
NVIDIA
π
Santa Clara, United States
Location
Santa Clara
Posted
June 24, 2026
Commute
Local Area
Local Opportunity Near You!
This job is in your area. Enjoy a short commute and work close to home.
Job Description
NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner, and cloud environments. We are looking for Principal Software Engineers to help shape the technical direction for production engineering, Kubernetes-based operations, automation, and reliability across large-scale GPU clusters.
This role is for senior technical leaders who can define architecture, lead through influence, build critical systems, and turn ambiguous infrastructure problems into durable software and operating models.
What youβll be doing:
+ Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments.
+ Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness.
+ Establish patterns for Kubernetes-based GPU cluster operations acro...
This role is for senior technical leaders who can define architecture, lead through influence, build critical systems, and turn ambiguous infrastructure problems into durable software and operating models.
What youβll be doing:
+ Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments.
+ Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness.
+ Establish patterns for Kubernetes-based GPU cluster operations acro...