Principal TPM -AI Infrastructure
$90.1k - $199.5kOracle
Job Description
The AI Infrastructure GPU Operations Team drives deployment planning, execution governance, operational readiness, reliability, and business rhythm for OCI's rapidly expanding GPU infrastructure portfolio. As Principal Technical Program Manager, you will lead cross-functional programs that connect engineering, platform, operations, business, finance, observability, SRE, network, and leadership teams across complex GPU operations initiatives.
You will own operating mechanisms for regional deployment readiness, GPU fleet health, milestone tracking, executive reporting, incident and change governance, risk management, and operational handoff across multiple concurrent GPU operations programs. This role requires strong program discipline, business analytics capability, and the ability to turn ambiguous technical and operational inputs into clear priorities, metrics, decisions, and action plans.
You will also improve the way the organization scales by strengthening dashboards, telemetry, documentation, onboarding, playbooks, repeatable processes, and the practical use of AI to improve operations productivity. The ideal candidate brings crisp communication, strong ownership, and pragmatic simplification to high-visibility GPU operations programs where disciplined execution, customer impact, and measurable reliability outcomes matter.
You are a structured, data-driven program leader who values simplicity, scalability, reliability, and clear operational mechanisms. You thrive in collaborative environments, communicate crisply with senior stakeholders, and drive consistent execution through ownership, metrics, and disciplined follow-through. You combine strategic clarity with enough technical and operational depth to help teams deliver reliable OCI AI Infrastructure GPU Operations while continuously improving the processes, telemetry, and automation that support it.
Travel: as needed for cross-site coordination, stakeholder alignment, and partner engagements.
Responsibilities
Key Responsibilities GPU Fleet Operations & Reliability
-
Drive availability and reliability of large-scale GPU fleets, identifying systemic issues and leading cross-functional recovery efforts.
-
Support operational readiness and performance of distributed AI training and inference workloads across multi-region GPU clusters.
-
Lead GPU fleet health reviews across current and next-generation hardware, including NVIDIA H200, B200, GB200/GB300 platforms and AMD Instinct MI300X, MI325X, MI350X, MI355X, and related platforms.
Program Leadership & Execution
-
Own end-to-end execution of critical AI Infrastructure GPU Operations programs, ensuring alignment with business priorities, customer needs, and operational risk signals.
-
Set and run weekly operating cadences and governance forums across multiple concurrent initiatives, ensuring clear ownership, timelines, dependencies, decision points, and committed actions.
-
Coordinate cross-functional delivery across engineering, platform, operations, business operations, finance, observability, SRE, network, and senior leadership stakeholders.
Incident, Change & Deployment Governance
-
Manage deployment governance, change review, readiness tracking, stakeholder handoff, and operational execution processes.
-
Establish and scale structured incident management mechanisms, improving root cause analysis, corrective and preventive actions, and follow-through on durable fixes.
-
Serve as a primary escalation point between engineering and operations teams, resolving priority conflicts and accelerating issue resolution.
-
Lead Change Review Board processes for high-volume change activity, minimizing change-related incidents and protecting service quality.
Business Planning, Metrics & Executive Reporting
-
Build, model, and maintain business planning inputs, financial forecasts, analytical views, and operating reports for AI Infrastructure GPU Operations programs.
-
Own executive-level reporting, including monthly business reviews, weekly operational KPIs, critical project updates, risks, dependencies, decisions, and mitigation plans.
-
Provide data-driven insights into infrastructure performance, operational risk, customer impact, and measurable program outcomes for senior leadership.
Cross-Functional & Stakeholder Engagement
-
Strengthen partnerships with hardware vendors, cloud platform teams, SRE, cloud engineering, network teams, and other internal stakeholders to improve issue resolution and operational efficiency.
-
Translate complex technical, operational, and business situations into accurate narratives, recommendations, and action plans for senior stakeholders.
-
Drive structured escalation and bug reporting mechanisms that reduce time-to-resolution for critical issues.
Operational Excellence, Optimization & AI Productivity
-
Create and maintain documentation, playbooks, onboarding materials, runbooks, and repeatable processes that reduce ambiguity and improve execution quality.
-
Drive practical use of AI and automation to improve operations productivity, reduce manual toil, accelerate triage, improve ticket prioritization, and strengthen repeatability across GPU operations workflows.
-
Partner with observability and telemetry teams to improve infrastructure visibility, including RDMA telemetry, network fabric health, service health metrics, and operational dashboarding.
-
Lead continuous improvement efforts such as validation frameworks, version set validation, link flap analysis, and long-tail performance optimization.
-
Monitor and improve operational health across technologies such as RoCE, InfiniBand, and large-scale data center networks.
Qualifications / Experience
-
5+ years of experience in technical program management, program operations, business operations, data analysis, infrastructure operations, or a related discipline.
-
Demonstrated ability to lead complex, cross-functional initiatives with measurable outcomes across technical, operations, business, and customer-facing stakeholders.
-
Strong operational background with experience building cadences, governance mechanisms, KPI reporting, incident/change processes, risk management processes, or readiness programs.
-
Strong written and verbal communication skills; comfortable synthesizing complex technical and operational information into executive updates, recommendations, and decisions.
-
A high degree of organization and ability to manage multiple competing priorities independently through ambiguity.
-
Experience identifying, measuring, and adjusting execution plans against key business, operational, reliability, or delivery metrics.
-
Advanced Excel skills, including pivots, lookups, conditional logic, data modeling, and financial or operational analysis.
-
Experience developing dashboards, automated reporting, or analytical tools that provide reliable business and operational visibility.
-
Working knowledge of PowerPoint, Jira, Confluence, and related collaboration or delivery management tools.
Preferred / Nice to Have
-
Experience with cloud infrastructure, AI/ML infrastructure, GPU operations, data center deployment, capacity planning, or large-scale platform operations.
-
Experience supporting large GPU fleets, distributed AI training or inference workloads, or performance-sensitive infrastructure environments.
-
Experience with incident management, root cause analysis, corrective and preventive action tracking, Change Review Board processes, or high-volume change governance.
-
Familiarity with observability, telemetry, RDMA, RoCE, InfiniBand, network fabric health, service health metrics, ticket/incident analytics, or operational dashboarding.
-
Finance, business planning, workforce planning, or operational readiness experience in a technology organization.
-
Track record of influencing senior business and technology leaders without relying on direct authority.
Disclaimer:
Certain U.S. based or U.S. customer or client-facing roles may be required to comply with applicable requirements, such as immunization/occupational health mandates, and/or drug testing requirements.
Range and benefit information provided in this posting are specific to the stated locations only
US: Hiring Range in USD from: $90,100 to $199,500 per annum. May be eligible for bonus and equity.
Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business.
Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.
Oracle US offers a comprehensive benefits package which includes the following:
Medical, dental, and vision insurance, including expert medical opinion
Short term disability and long term disability
Life insurance and AD&D
Supplemental life insurance (Employee/Spouse/Child)
Health care and dependent care Flexible Spending Accounts
Pre-tax commuter and parking benefits
401(k) Savings and Investment Plan with company match
Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
11 paid holidays
Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
Paid parental leave
Adoption assistance
Employee Stock Purchase Plan
Financial planning and group legal
Voluntary benefits including auto, homeowner and pet insurance
The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.
Career Level - IC4
About Us
Only Oracle brings together the data, infrastructure, applications, and expertise to power everything from industry innovations to life-saving care. And with AI embedded across our products and services, we help customers turn that promise into a better future for all. Discover your potential at a company leading the way in AI and cloud solutions that impact billions of lives.
True innovation starts when everyone is empowered to contribute. That's why we're committed to growing a workforce that promotes opportunities for all with competitive benefits that support our people with flexible medical, life insurance, and retirement options. We also encourage employees to give back to their communities through our volunteer programs.
We're committed to including people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability at any point, let us know by emailing View email address on jobs.institutedata.com or by calling View phone number on jobs.institutedata.com in the United States.
Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans' status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
$112k - $163k
...Principal Infrastructure Transformation PMRemote - United StatesJR012851 At Ensono, our Purpose is to be a relentless ally, disrupting the status quo and unleashing our clients to Do Great Things ! We enable our clients to achieve key business outcomes that reshape...PrincipalFull timeTemporary workWork experience placementRemote workWork from homeFlexible hours- ...best practices and processes to help us deliver increasingly complicated projects as we grow. Your efforts will help develop the infrastructure that powers the digital operations of major tech firms, impacting billions of users worldwide. We currently have one opening...SuggestedFull timeWorldwideFlexible hours
- ...event-driven integrations Develop and support internal and public-facing web applications Collaborate with architecture, infrastructure, DevOps, and vendor partners Provide production support, troubleshooting, and root cause analysis Ensure applications...SuggestedFor contractorsWork at officeLocal areaFlexible hours
- ...Principal This is a general principal application. If you are interested in applying for one specific principal job opening, please go back to the job list, select that job, and submit an application directly for that school. However, if you are interested in applying...PrincipalWork at office
- ...Job Type Full-time Description Helias Catholic High School in Jefferson City, MO, is searching for an Assistant Principal. Under the direction of the principal, the assistant principal for student life oversees student conduct, admissions, teacher supervision...PrincipalFull time
$140k - $160k
...Senior Infrastructure Engineer - IAM & Automation At Polsinelli, What a Law Firm Should Be is not just our tagline, it is what we live... ...Microsoft Entra app registrations, enterprise applications, service principals, API permissions, client secret and certificate renewals, and...Full timeTemporary workPart timeRemote workFlexible hoursShift work- ...Middle School Assistant Principal Exempt Reports To: Principal Qualifications: Master's degree in school administration required. Educational specialist or doctorate degree preferred. A minimum of two (2) years of successful teaching and appropriate...PrincipalWork at office
- ...POSITION TITLE: Elementary Assistant Principal FLSA: Exempt Administrative REPORTS TO: Elementary School Principal QUALIFICATIONS: Appropriate state elementary administrative license/certificate. A minimum of 3 years successful teaching...PrincipalWork at office
- ...and want to directly influence product quality and patient safety, you'll love consulting at Parexel.Position OverviewThe Senior / Principal Regulatory Compliance Consultant serves as a high-level subject matter expert in QC Microbiology and aseptic sterile drug product...PrincipalRemote workWorldwide
$130k - $150k
...reason why diversity and inclusion are core to our business. Join Evolent for the mission. Stay for the culture. What You’ll Be Doing: Principal Product Solutions Architect, Specialty Product Management Role Overview The Principal Product Solutions Architect is a Director-...PrincipalTemporary workImmediate startFlexible hours- ...services, and operating system) Contractor must assist with forecasting, recommending, and planning changes to the underlying infrastructure which supports the enterprise database system, including the addition of storage and memory and configuration changes that...For contractors
- ...Job Summary The Epic Analyst - Cogito Principal Trainer primary responsibility is to configure and provide functional and technical help for specific applications to business and clinical users. Part of the responsibility is to partner with end users to interpret the...PrincipalWork experience placementImmediate start
$125k
...Maximus is currently seeking a Principal Technical Sourcer. The Principal Technical Sourcer leads the development and execution of innovative, data driven sourcing strategies for complex and niche technical roles, including cleared and highly specialized IT positions...PrincipalRemote work- A financial services company is seeking a Senior Manager in Data Science with a focus on Quantum Computing Research. This role involves defining and executing a research roadmap on quantum algorithms relevant to financial services, collaborating with quantum hardware vendors...Remote workFlexible hours
$147.2k - $294k
...Principal Talent Acquisition Program Manager Hungry, Humble, Honest, with Heart. The Opportunity Nutanix is looking for a strategic and execution-focused Principal Talent Acquisition Program Manager to join our Global Talent Acquisition team. This role...PrincipalRelocation package$150k - $190k
About Us Since 1989, SHI International Corp. has helped organizations change the world through technology. We've grown every year since, and today we're proud to be a $16 billion global provider of IT solutions and services. Over 17,000 organizations worldwide rely...Work experience placementWorldwideFlexible hoursShift work$72.7k
Company : Highmark Inc. Job Description : JOB SUMMARY This job collects, validates, analyzes, and organizes data into meaningful computerized reports by manipulating and extracting data to meet customer needs, generates reports through mainframe and/or PC applications...For contractorsWork at officeLocal area$120k - $135k
...Engineering organization, you will be part of a team responsible for managing the large footprint of our application suite and cloud infrastructure - your role will be heavily network focused. We're redefining how we approach cloud infrastructure, networking, and...Immediate start$3,000 per month
We appreciate you checking us out! Work At Home Data Entry Research Panelist Jobs - Part Time, Full Time This work-from-home position is ideal for anyone with a diverse professional background, including administrative assistants, data entry clerks and typists, customer...Full timeTemporary workPart timeSecond jobImmediate startRemote workWork from homeFlexible hours$126.3k - $173.7k
...Become a part of our caring community The Insurance Product Management Principal manages insurance product offerings for each market and customer need. The Insurance Product Management Principal provides strategic guidance to functional team(s). The Supplemental...PrincipalFull timeContract workTemporary workApprenticeshipRemote work$160k - $260k
...Solutions Engineer - Enterprise collaborates with account teams to assess customer data center environments and design tailored infrastructure solutions that align with business objectives. This role involves building technical relationships with OEMs, providing...Work experience placementWork at officeRemote workWorldwideFlexible hours$100k - $172.5k
...United States, Indianapolis, Indiana, United States {+ 23 more} Job Description: We are searching for the best talent for a Principal Product Security Engineer to be located in Danvers, MA or Raritan, NJ. Remote work options may be considered on a case-by-case basis...PrincipalFull timeTemporary workWork at officeLocal areaImmediate startRemote work3 days per week$90k - $110k
Business Intelligence Developer (Power BI) Pearson Virtual Schools (PVS) – Enabling Technology Location: Remote (U.S.-based) About Pearson Virtual Schools At Pearson Virtual Schools (PVS), we are committed to expanding access to high-quality education through...Full timeRemote work$297.5k - $357k
...foundational fabric that facilitates the movement of vast data volumes between customer sources and our data streaming platform. Our infrastructure ensures secure, private, and high-performance connectivity across AWS, Azure, and GCP with minimal latency and optimized costs....Full timeRemote work- We are seeking an experienced and skilled Clinical Programmer with a strong background within clinical programming. The ideal candidate will have 4-6 years of experience in clinical data programming, with a focus on clinical trial systems and data management solutions ...Work experience placement
$66.8k
Description & Requirements The Knowledge Content Manager will serve as a Subject Matter Expert to the Program Manager on the knowledge/content management services to deliver, operate and maintain knowledge management capabilities for the contact center. This position ...Remote jobMinimum wageFull timeContract workTemporary workWork experience placementWork at office$286.2k - $326.7k
Sr. Distinguished Engineer, Card Acquisitions & Growth Platforms As a Sr. Distinguished Engineer at Capital One, you will be a part of a community of technical experts working to define the future of banking in the cloud. You will work alongside our talented...Full timePart timeLocal areaRemote work$100.71k - $157.63k
The Bank Products and Services group provides a bank-wide product and services framework that prioritizes client needs, resources and capital demands for creating growth, providing TUCE and introducing greater innovation and efficiencies into our solutions and operating...Work experience placementWork at officeLocal areaRemote workFlexible hours- ...managers in accurately evaluating and factoring in complex variable costs-such as hardware, physical appliances, and SaaS cloud infrastructure into their pricing strategies to maintain target profitability. Work on complex issues requiring in-depth company knowledge,...Flexible hours
- ...living facilities seeks, an experienced full time PCM Care Coordinator. The Care Coordinator is to perform remote monitoring and principal care management for patients under the supervision of the medical team. This is a remote position that requires ongoing...PrincipalFull timeRemote work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Principal TPM -AI Infrastructure. Be the first to apply!
- senior principal cloud computing engineer Jefferson City, MO
- senior principal scientist Jefferson City, MO
- principal cloud computing engineer Jefferson City, MO
- principal Jefferson City, MO
- principal network administrator
- principal applied scientist
- principal medical writer
- senior principal cloud computing engineer
- principal software architect
- principal recruiter


