📍 Local Job Near You

AI Inference Performance Engineer

🏢

NVIDIA

📍 Santa Clara, United States

📍

Location Santa Clara

📅

Posted June 03, 2026

🚗

Commute Local Area

🎯

Local Opportunity Near You!

This job is in your area. Enjoy a short commute and work close to home.

📋
Job Description

                    We optimize and benchmark GenAI inference on NVIDIA's latest accelerators, defining the industry’s performance standards across language models, video generation, and speech workloads. We work directly within TensorRT-LLM, SGLang, and vLLM, building the tools that evaluate serving performance at scale. This team sits at the intersection of GPU performance engineering and public accountability. 
  
What You Will Be Doing:
+ Drive industry benchmark results: own the end-to-end optimization pipeline, implement and integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM.
+ Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks, multi-turn coding, agentic workflows, and other emerging AI use cases. Collaborate with framework and kernel teams to push performance to its extreme on large-scale LLM-MoE models, vision-language models, video diffusion mode...

Apply for This Job

Submit Application

Quick and secure application process

📍 Location Details

🌆

City

Santa Clara

🗺️

Country

United States

🚗

Commute

Local Area

🔍 More Jobs Nearby

Explore other opportunities in Santa Clara

View Local Jobs

AI Inference Performance Engineer

📋 Job Description

Apply for This Job

📍 Location Details

🔍 More Jobs Nearby

📋
Job Description