Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Infrastructure Ops Engineer

Baseten

Infrastructure Ops Engineer At Baseten

Baseten powers mission-critical inference for the world's most dynamic AI companies. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting-edge models into production. Join us and help build the platform engineers turn to to ship AI products.

As an Infrastructure Ops Engineer at Baseten, you are the operational engine of our global infrastructure. You will sit at the intersection of technical customer success and infrastructure engineering, partnering closely with our SRE and FDE teams to execute the complex hardware lifecycles that power our fleet.

This role is designed for high-energy, detail-oriented individuals who thrive on technical coordination and operational excellence. You aren't just managing clusters; you are acting as the technical glue between our customers' needs and our hardware reality, ensuring that high-level capacity strategies are translated into boots-on-the-ground results.

While you will be hands-on with Kubernetes and cloud-native tools, your success is measured by your ability to project-manage the resolution of capacity puzzles, ensuring our platform remains reliable, observable, and ready for the next massive AI deployment.

Example Initiatives
  • The "Lost Node" Investigation: Debugging cluster-level blockers to solve why pods aren't scheduling despite available capacity
  • Regional Compliance Guard: Auditing and correcting scheduling policies to ensure customer data stays within specified geographical constraints (e.g., EU-only vs US-only)
  • High-Stakes Maintenance Orchestration: Coordinating critical maintenance cycles both externally (with vendors) and internally (with Baseten SREs) to evacuate workloads from unhealthy nodes and integrate replacement hardware with zero customer disruption
Responsibilities
  • Fleet Maintenance: Manage daily node operations including tainting/untainting, node draining, and PVC repairs to ensure GPU fleet health and operational cost control
  • GTM & Capacity Fulfillment: Partner with Sales and account teams to scope and fulfill customer capacity requests, translating complex timelines into concrete infrastructure actions and clear ETAs
  • Process & Observability Engineering: Identify recurring gaps in the capacity lifecycle (intake, triage, comms) and drive fixes by defining lightweight processes and improving system observability
  • Technical Orchestration: Act as the operational bridge between SRE and Infra teams, executing discrete changes and verifying system status during high-stakes maintenance windows
  • Technical Documentation: Contribute to the internal knowledge base for GPU-specific issues (H100/A100/B200) to accelerate future incident resolution
  • Automation & Tooling: Identify repetitive workflows and partner with engineering to build scripts, dashboards, and internal tools that reduce manual intervention and shorten time-to-mitigation
  • Knowledge Excellence: Maintain a living database of GPU-specific intelligence (H100/B200) and market moves to accelerate incident resolution and support strategic briefings for leadership
Requirements
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field
  • 2+ years of professional work experience, ideally in a customer-facing technical role or as a junior SRE/Cloud Engineer
  • Strong familiarity with Kubernetes and the lifecycle of cloud-based container orchestration
  • Strong ownership mindset and attention to detail, demonstrated through fast detection, clear communication, and reliable follow-through
  • Demonstrated ability to communicate complex technical blockers clearly to both internal engineering teams and external vendors
  • Preference for SF or NYC-based candidates to foster a close-knit "family" atmosphere in the office
Benefits
  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employee and dependents
  • Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
  • Paid parental leave
  • Fertility and family-building stipend through Carrot
  • Company-facilitated 401(k)
  • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Apply now to embark on a rewarding journey in shaping the future of AI! If you are a motivated individual with a passion for machine learning and a desire to be part of a collaborative and forward-thinking team, we would love to hear from you.

At Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.

We are an Equal Opportunity Employer and will consider qualified applicants with criminal histories in a manner consistent with applicable law (by example, the requirements of the San Francisco Fair Chance Ordinance, where applicable).

Vacancy posted 13 hours ago
Similar jobs that could be interesting for youBased on the Infrastructure Ops Engineer in United States vacancy
  • Infrastructure Ops Engineer III page is loaded## Infrastructure Ops Engineer IIIlocations: US - Georgia - Atlanta Officetime type: Full timeposted on: Posted Yesterdayjob requisition id: R0001632Candescent is a forward-thinking technology company transforming how financial... 
    Suggested
    Remote work

    Candescent Technologies Corporation

    Atlanta, GA
    2 days ago
  •  ...Position Name: Network & SDWAN Ops Engineer Location: Plainsboro, New Jersey Work from Office Role, working hours will...  ...supporting, troubleshooting, and maintaining enterprise network infrastructure in a dynamic, multi-site environment. Required... 
    Suggested
    Work at office
    Shift work
    Weekend work

    Smart IMS

    Plainsboro, NJ
    4 days ago
  •  ...from a research notebook to a production API serving millions of requests is one of the hardest problems in AI. As an ML Ops Infrastructure Engineer at Deepgram, you will own the critical bridge between research and production -- building the pipelines, deployment systems... 
    Suggested
    Home office
    Flexible hours

    Deepgram

    United States
    4 days ago
  • $20 - $30 per hour

     ...OPS Web Infrastructure and Systems Support Specialist Job no: 540014 Work type: Temp Full-Time Location: Main Campus (Gainesville, FL) Categories: Computer Science, Information Technology, Office/Administrative/Fiscal Support Department... 
    Suggested
    Hourly pay
    Full time
    Temporary work
    Work at office

    University of Florida

    Gainesville, FL
    4 days ago
  • $54.9 - $91.49 per hour

    Staff Network Ops Engineer page is loaded## Staff Network Ops Engineerlocations: Chicago - 20 S. Wackertime type: Full timeposted on:...  ...Center (NOCC) team, providing crucial support for CME's network infrastructure; through monitoring, troubleshooting and change... 
    Suggested
    Worldwide
    Shift work
    Night shift

    CME Group Inc.

    Chicago, IL
    3 days ago
  •  ...Network/Operations - Dev Ops Engineer - Sr Join our team as a Senior Dev Ops Engineer in Plano, TX, where you will be responsible for integrating and optimizing network infrastructures. This contract role requires expertise in network protocols and cloud environments... 
    Contract work

    Mitchell Martin

    Plano, TX
    1 day ago
  • A leading technology firm in Chicago is seeking a Technical Commander to manage infrastructure operations. The ideal candidate will have expertise in Ansible, strong team leadership skills, and experience with vendor negotiations. This role is critical for ensuring system... 

    seoClarity

    Chicago, IL
    1 day ago
  •  ...Cloud Ops Engineer This position would provide engineering support and would primarily collaborate with cross functional teams that...  ...possess a functional understanding of core IT concepts, such as – networking, databases, infrastructure, and software engineering.... 

    E-Solutions

    United States
    2 days ago
  •  ...Requirement – Dev/Ops Cloud Engineer Client is seeking a Cloud Engineer for a contract position in Raleigh, NC. This position will be...  ...responsible for analysing application requirements as they relate to infrastructure services and either designing a solution or implementing a... 
    Contract work

    Suncap Technology

    Raleigh, NC
    1 day ago
  •  ...Cloud Ops Engineer Apply Online Responsibilities Design, implement, and maintain cloud infrastructure, primarily within AWS. Manage and optimize CI/CD pipelines to support efficient software delivery. Monitor system performance, availability, and reliability... 
    Local area

    Tyler Technologies

    Yarmouth, ME
    13 hours ago
  •  ...Job Title: Cloud Ops Engineer Location: Rockville MD / Remote This is 100% percent hands-on role. Control Tower, Organization policies and management Multi-Account deployment and management AWS Backups and SSM Patching process - in detail... 
    Remote work

    Jobs via Dice

    Rockville, MD
    13 hours ago
  •  ...Compute And Cloud Ops Engineer This is Harsh from Jconnect INC. Below is the requirement with my client. Location: Manhattan,...  ...servers. - Manage VMware environments. - Support cloud infrastructure operations. - Perform patching, monitoring, and maintenance... 
    Full time
    Immediate start
    Relocation

    JConnect Infotech

    Scarsdale, NY
    1 day ago
  • $125k - $180k

     ...individual to join our growing team as Dev Ops Cloud Developer . In this role, you...  ...such as Platform as a Service (PaaS), Infrastructure as a Service (IaaS), Software as a...  ...bachelor's degree in computer science, engineering, or related field; or equivalent work experience... 
    Work experience placement

    Planet Technologies

    Washington DC
    2 days ago
  •  ...responsible for designing, building, and supporting scalable AWS infrastructure to enable AI and machine learning workloads. This role focuses...  ...security, and system architecture ~ Experience with AWS AI Ops tools including Amazon Bedrock, CloudWatch, X-Ray, Model... 

    HTC Global Services

    Charlotte, NC
    7 days ago
  •  ...part in something special!The Software Engineer works in an Agile team in a multi-technology...  ...being built. We are seeking a Cloud Ops Engineer with 2+ years of experience to...  ...Microsoft Azure Build and manage Infrastructure as Code using Terraform Operate... 
    Full time
    Local area
    Work from home

    Markel

    Richmond, VA
    13 hours ago
  •  ...Role: Cloud Ops Engineer Location: Remote Type: Full Time Job Description: Strong working knowledge...  ...and CI/CD automation. Experience with Terraform for infrastructure as code. Scripting and coding experience in Python or Bash... 
    Full time
    Remote work

    Futran Tech Solutions Pvt. Ltd.

    United States
    4 days ago
  •  ...Azure Infrastructure Ops Engineer Job Location: Omaha, NE or Berkeley Heights, NJ # Positions: 1 Employment Type: FTE Key Technology: Azure, Agile, Scrum Job Responsibilities: Infrastructure Design and Planning: Work closely with the development and... 

    Veracity

    Omaha, NE
    1 day ago
  • $55 per hour

     ...Cloud Ops/DevOps Engineer Location: Onsite in Charlotte, NC (locals required for onshore), onshore resource required to come to office 3 days per week. Duration: long term. Client: Eastdil Secured Rate: $55/hr on CTC. Study, analyse and understand business requirements... 
    Work at office
    Local area
    Shift work
    3 days per week

    Keylent Inc

    Charlotte, NC
    1 day ago
  •  ...Systems Software Engineer — Machine Learning Ops What if your systems engineering skills could directly shape the infrastructure powering the world's most advanced AI models? We're looking for a senior C++ engineer to build and optimize the data pipelines, annotation... 
    Hourly pay
    Ongoing contract
    Contract work
    Freelance
    Remote work
    Flexible hours

    Alignerr

    United States
    19 hours ago
  •  ...Lead Cloud Ops Engineer Durham North Carolina/Frisco, TX (onsite) Contract Responsibilities Design, maintenance, and administration of infrastructure and applications in Microsoft Azure. Continuous collaboration with other team members and other teams distributed... 
    Contract work
    Work experience placement

    Samprasoft

    Durham, NC
    2 days ago
  • $132k - $153k

     ...Dev Ops Engineer - Hybrid/Azure Innova Solutions has an immediate need for a Dev Ops Engineer with Azure experience! This job...  ...will be responsible for overseeing and enhancing the infrastructure, processes, and tools that support software development and... 
    Contract work
    Temporary work
    Work experience placement
    Local area
    Immediate start
    Remote work
    Worldwide
    Flexible hours

    Innova Solutions

    Alpharetta, GA
    4 days ago
  •  ...work share on a Project with a Major Prime) for the following position for a federal agency. Job Title : Cloud/Ops Platform Engineer - an Active Top SECRET required Compensation : Negotiable with standard benefits based on experience, education,... 
    Contract work
    3 days per week

    Visualsoft

    Huntsville, AL
    2 days ago
  • $175k

     ...Azure Dev Ops Architect Little Falls, DE, or Santa Clara, CA Dev Ops Architect...  ...artifact management, containerization, infrastructure-as-code, monitoring, and security....  ...with a Bachelor's or Master's degree in Engineering, Computer Science, or a related field,... 
    Full time
    Contract work
    For contractors
    Remote work
    Relocation
    Shift work

    Computer Recruiters Inc

    Wilmington, DE
    17 hours ago
  •  ...automation adoption roadmaps, architect DevSecOps tooling infrastructure, and provide mentorship to DevOps Engineers. Your primary responsibilities will include: •...  ...roadmaps for clients, covering both Dev and Ops requirements. • Architect Tooling Infrastructure:... 
    Worldwide

    IBM

    Columbus, OH
    13 hours ago
  •  ...Systems Software Engineer - Machine Learning Ops (AI Infrastructure) About the Role What if your expertise in systems programming could directly shape the infrastructure powering the next generation of AI? We're looking for seasoned C++ engineers to build high... 
    Hourly pay
    Contract work
    Freelance
    Remote work

    Alignerr

    Seattle, WA
    2 days ago
  • Iambic Therapeutics, Inc is seeking a talented Cloud Ops Engineer to support the technology and drug discovery operations. This hybrid role mainly focuses on AWS cloud infrastructure, CI/CD pipelines, and Infrastructure as Code. The ideal candidate will have extensive experience... 

    Iambic Therapeutics, Inc

    Boston, MA
    2 days ago
  • $147.76k - $221.64k

     ...a better world, so we can all enjoy living in it. Engineering Manager, IAM Platform (Ops, SRE & AI Enablement) We are seeking a strategic Engineering...  ...overall health and security of the global identity infrastructure. Serve as the final point of escalation for complex... 
    Hourly pay
    Temporary work
    Part time
    Relocation
    Relocation package
    Flexible hours

    Caterpillar

    Irving, TX
    16 hours ago
  • $48k - $168k

     ...Job Title Infrastructure Automation & Management CI/CD Pipeline Management Monitoring...  ...Experience: 7+ years of experience as a DevOps Engineer or Cloud Engineer, with hands-on...  ...Identification 24371 Job Category Dev & Run Ops Posting Date 11/10/2025, 03:01 PM... 
    Full time

    Photon

    Washington DC
    1 day ago
  •  ...Overview: Job Title: AI/ML Ops & Infrastructure Engineer Company: R2 Technologies Location: Alpharetta, GA (Hybrid / Remote Options Available) Employment Type: Full-Time / Contractual About R2 Technologies: R2 Technologies is a Certified Minority... 
    Full time
    Remote work
    Shift work

    R2 Technologies

    Alpharetta, GA
    2 days ago
  • A tech company specializing in airfield operations is seeking a Senior Software Engineer to lead the infrastructure powering their HALO platform. The role requires at least 4 years of experience building and operating production systems and involves designing cloud architecture... 

    Moonware

    Los Angeles, CA
    13 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Infrastructure Ops Engineer. Be the first to apply!