ML Ops Engineer

International Recruiting LLC

About the Role

Our client's Vision AI platform runs where the data is generated — on-premises, inside government facilities, and at the network edge — not in a hyperscaler cloud. That means the infrastructure has to be bulletproof: GPU clusters provisioned correctly, Kubernetes workloads scheduled efficiently across heterogeneous compute, storage performing at the throughput AI training and inference demands, and the network capable of handling high-bandwidth, low-latency sensor data at scale.

As a MLOps / AI Infrastructure Engineer, you will own all of it. You will rack, configure, and operate the on-premises compute and GPU infrastructure that powers the platform, build and maintain the Kubernetes clusters that orchestrate AI workloads, design the networking fabric that ties edge nodes to core compute, and implement the MLOps pipelines that take models from development to production. You will work directly with our AI/ML engineers, the Lead Architect, and on-site client technical teams to ensure the platform runs reliably in environments that are often air-gapped, physically secured, and subject to strict government compliance requirements.

Key Responsibilities

GPU Compute & Hardware Infrastructure

• Deploy, configure, and maintain on-premises GPU servers — primarily NVIDIA H200 and A100 nodes — including driver management, CUDA toolkit versioning, NVLink/NVSwitch topology, and firmware updates.

• Implement and tune NVIDIA-specific tooling: DCGM (Data Center GPU Manager) for health monitoring and telemetry, MIG (Multi-Instance GPU) partitioning for multi-tenant workloads, and NVIDIA Container Toolkit for GPU-aware containerization.

• Manage bare-metal provisioning workflows (iPXE, PXE, or tools such as MAAS/Foreman) to enable repeatable, auditable server builds at client sites.

• Monitor hardware health, capacity utilization, and thermal/power envelopes; define alerting thresholds and respond to hardware failures with minimal service disruption.

Kubernetes & Container Orchestration

• Build, upgrade, and maintain production-grade Kubernetes clusters (kubeadm or Rancher RKE2) on bare-metal infrastructure, with GPU node pools configured via the NVIDIA GPU Operator.

• Design and operate cluster networking using CNI plugins appropriate for high-throughput AI workloads — Calico, Cilium, or SR-IOV for RDMA-capable networking where required.

• Configure and manage MetalLB or equivalent bare-metal load balancing, ingress controllers, and service mesh components (Istio or Linkerd) for secure intra-cluster communication.

• Implement resource quotas, LimitRanges, PriorityClasses, and node affinity/taints to ensure AI training jobs, inference services, and platform workloads coexist without resource contention.

• Maintain cluster security posture: RBAC policies, Pod Security Admission, network policies, secrets management (HashiCorp Vault or Sealed Secrets), and CIS Kubernetes Benchmark compliance.

MLOps Pipelines & AI Workload Management

• Deploy and operate MLOps platforms (MLflow, Kubeflow, or equivalent) for experiment tracking, model versioning, and pipeline orchestration across training and inference workloads.

• Configure and manage NVIDIA Triton Inference Server for multi-model serving, dynamic batching, and model ensemble execution on GPU nodes.

• Build CI/CD pipelines for model deployment (GitOps with ArgoCD or Flux), including automated model validation, canary rollouts, and rollback mechanisms.

• Optimize GPU utilization for both batch training jobs (Volcano or KUEUE scheduler) and latency-sensitive inference services, tracking efficiency metrics via DCGM and Prometheus.

• Manage model artifact storage and versioning using software-defined storage backends (Ceph RBD/CephFS or MinIO) integrated with the MLOps toolchain.

Networking & Storage Architecture

• Design and implement the high-bandwidth network fabric required for GPU cluster interconnects — InfiniBand, RoCE v2, or high-speed Ethernet — and ensure RDMA is correctly configured for distributed training workloads.

• Deploy and operate software-defined storage solutions (Ceph or equivalent) providing block, object, and file storage tiers for training datasets, model checkpoints, and platform telemetry.

• Configure network segmentation, VLANs, and firewall policies to meet NIST 800-171 requirements in on-premises and air-gapped environments; document network topology for client system security plans.

• Establish and maintain VPN or secure tunneling solutions for hybrid connectivity between edge nodes, on-premises clusters, and any permitted cloud services.

Security, Compliance & Documentation

• Implement infrastructure controls mapped to NIST SP 800-171 and CMMC requirements: access control, audit logging, configuration management, incident response readiness, and media protection.

• Maintain hardened OS baselines (RHEL/Rocky STIG or Ubuntu CIS benchmarks) across all infrastructure nodes; automate compliance scanning with OpenSCAP or equivalent.

• Produce and maintain infrastructure documentation required for government procurement: network diagrams, hardware inventories, system security plan (SSP) contributions, and disaster recovery runbooks.

• Support penetration testing engagements by providing accurate infrastructure context and remediating findings within agreed timelines.

Required Qualifications

• 6+ years of infrastructure engineering experience, with at least 3 years managing GPU compute clusters or HPC environments in production.

• Deep hands-on expertise with NVIDIA GPU infrastructure: driver lifecycle management, CUDA, DCGM, MIG, NVLink topologies, and the NVIDIA GPU Operator for Kubernetes.

• Production-level Kubernetes administration experience on bare-metal: cluster provisioning, upgrades, CNI/CSI configuration, RBAC, and day-2 operations.

• Strong networking fundamentals: BGP, VLAN segmentation, RDMA/RoCE or InfiniBand configuration, load balancing, and firewall policy management.

• Hands-on experience with software-defined storage (Ceph, Rook-Ceph, or MinIO) in AI/HPC workload contexts — performance tuning, capacity planning, and failure recovery.

• Practical MLOps experience: model serving infrastructure (Triton or equivalent), experiment tracking (MLflow or Kubeflow), and GitOps-based model deployment pipelines.

• Working knowledge of NIST SP 800-171 controls and the ability to translate them into concrete infrastructure configurations and audit evidence.

• Proficiency with infrastructure-as-code tooling: Terraform or Ansible for reproducible, auditable infrastructure builds.

• Strong Linux systems administration skills (RHEL/Rocky Linux or Ubuntu) including kernel tuning, storage I/O optimization, and systemd service management.

• Excellent written communication for producing infrastructure runbooks, network diagrams, and compliance documentation in a remote-first environment.

Nice to Have

• Experience with air-gapped or classified network environments and the operational discipline they require (offline package mirrors, USB-controlled media transfers, etc.).

• Familiarity with CMMC Level 2/3 assessment processes and evidence collection.

• Experience with NVIDIA DGX Systems, BasePOD reference architectures, or NVIDIA AI Enterprise software stack.

• Knowledge of distributed training frameworks (PyTorch DDP, DeepSpeed, Megatron-LM) and their infrastructure requirements — useful for supporting AI/ML engineering teammates.

• Experience deploying Kubernetes at the edge: K3s, MicroK8s, or NVIDIA Jetson-based edge clusters.

• Familiarity with observability stacks: Prometheus, Grafana, Loki, OpenTelemetry, and DCGM Exporter for GPU telemetry dashboards.

• US Person status or active security clearance — advantageous for certain client site engagements.

• Background in SCADA, ICS, or OT network environments relevant to critical infrastructure clients.

What Client Offer

• Hands-on ownership of some of the most demanding AI infrastructure in the public sector — H200 GPU clusters, high-bandwidth interconnects, and purpose-built on-premises deployments.

• A technically rigorous environment where your infrastructure decisions directly affect the reliability of mission-critical government operations.

• Competitive, globally benchmarked compensation including base salary, equity, and performance bonus.

• Fully remote with async-first culture; periodic travel to client facilities and team on-sites for cluster deployments and planning.

• Access to cutting-edge NVIDIA hardware, early access to new GPU generations, and budget for relevant certifications (NVIDIA, CKA/CKS, RHCSA, etc.).

• Collaboration with a Lead Architect and engineering team who understand infrastructure as a product — not just a cost center.

Apply

Vacancy posted a month ago

Similar jobs that could be interesting for youBased on the ML Ops Engineer in Seattle, WA vacancy

ML/Ops & DevOps Engineer - Telecom Industry
Job Title :- ML/Ops & DevOps Engineer - Telecom Industry Employment Type :- W2 Duration :- Long Term Visa Type :- All Visa applicable which are ready for W2 Location- Bellevue, WA (Day-1 Onsite) Industry :- Telecom Job Description: We are seeking a highly skilled...
Suggested
Highbrow LLC
Bellevue, WA
2 days ago
Telecom ML/Ops & DevOps Engineer — AI Infra
A telecom industry leader is seeking a highly skilled ML/Ops & DevOps Engineer to join their dynamic team. The role involves designing and maintaining scalable machine learning infrastructure and optimizing operational processes. Candidates should possess 10+ years of...
Suggested
Highbrow LLC
Bellevue, WA
2 days ago
ML Ops Engineer: Labels Platform for Autonomous Data
I did my part and supported the Regular Toilet is seeking a Software Engineer to join the Machine Learning Data Engine: Labels Platform team. In this role, you will enhance platform reliability, develop labeling application features, and collaborate with teams to drive...
Suggested
I did my part and supported the Regular Toilet
Seattle, WA
1 day ago
ML Engineer
$106.9k - $160.4k
...timberlands, wood products, and corporate functions. As we continue to scale AI across the enterprise, we are seeking a skilled ML Engineer to design, build, and operationalize machine learning solutions that are reliable, scalable, secure, and delivering measurable business...
Suggested
Full time
Temporary work
Weyerhauser Co
Seattle, WA
1 day ago
ML Engineer
$190k - $230k
...Machine Learning Engineer Location: Hybrid - Seattle, WA Comp: $190-230k base + startup equity Our client, a fastgrowing... ...inference platform. If you enjoy working at the intersection of ML, systems, and performance engineering - and want to shape core infrastructure...
Suggested
Contract work
Local area
Prime Team Partners
Seattle, WA
2 days ago
Machine Learning Engineer II
$118.5k - $189.5k
...Machine Learning Engineer II Chewy is seeking a Machine Learning Engineer II to join our Legal Department in one of our hubs (Plantation... ...(usage, time saved, error reduction) Partner with Legal Ops, IT, Security, and Data teams to deliver scalable, secure...
Contract work
Local area
Flexible hours
Chewy
Bellevue, WA
1 day ago
Senior AI/ML Engineer
$176.76k - $232k
...enterprise efficiency. Core responsibilities As a Senior AI/ML Engineer, you will lead the delivery of scalable AI/ML solutions to... ...systems, production-ready pipelines and APIs, and ML Ops for monitoring models or solutions in production. Define ML...
Permanent employment
Contract work
Part time
Work visa
lululemon
Seattle, WA
4 days ago
Personalization ML Engineer: Edge-Ready, Privacy-First
A leading technology company in Seattle seeks a pioneering engineer in Machine Learning and AI. The role involves architecting distributed feature access systems and building large-scale data pipelines while ensuring privacy-optimized solutions. Candidates should possess...
Apple Inc.
Seattle, WA
3 days ago
Machine Learning (ML) Engineer
InDev is seeking an experienced Machine Learning (ML) Engineer to design, build, and operationalize scalable AI/ML solutions across a variety of mission‑critical applications. This role combines hands‑on model development, MLOps engineering, cloud‑native deployment, and...
Full time
Flexible hours
Indev
Seattle, WA
3 days ago
On-Device ML Optimization Engineer (LLM & Diffusion)
$139.5k - $258.1k
A leading technology company in Seattle is seeking a Large Machine Learning Model Optimization Engineer. You will drive the development of on-device ML models, collaborate across teams, and implement optimization techniques for performance improvement. A BS degree and...
Apple Inc.
Seattle, WA
2 days ago
Autonomous Earthmoving Perception & SLAM ML Engineer
...industry, performing research on machine learning methods, and applying hands-on work in the field. Ideal candidates have experience with ML systems, localization, and strong software skills in Python and frameworks like Tensorflow or PyTorch. AIM offers a collaborative...
AIM
Seattle, WA
2 days ago
Generative NLP ML Engineer — Global Privacy-Preserving
$139.5k - $258.1k
Apple Inc. is seeking a Machine Learning Engineer in Seattle, Washington, to develop next-generation ML models that enhance user experience across languages and writing styles. Join a collaborative team to maintain data pipelines, build toolkits for model quality, and...
Apple Inc.
Seattle, WA
23 hours ago
ML Engineer
...& Responsibilities Design, build, and deploy production‑grade ML systems with end‑to‑end ownership of the model lifecycle from conception... ...a related field. 1-6 years of professional experience in ML engineering. Strong programming skills in Python (TypeScript experience is...
Full time
Catalyst Labs
Seattle, WA
5 days ago
ML Engineer: Build Automated Evaluation & Adversarial Tests
$139.5k - $258.1k
Apple Inc. is seeking an ML Engineer for its Seattle location to build and scale automated evaluation systems for AI features. The ideal candidate will have a Bachelor's degree in a relevant field and over 4 years of experience in ML evaluation. Responsibilities include...
Apple Inc.
Seattle, WA
23 hours ago
Perception & SLAM ML Engineer for Autonomous Earthmoving
...hands-on applications of machine learning methods in challenging environments. Ideal candidates should have experience in deploying ML-based systems, familiarity with localization techniques, along with strong Python skills. Benefits include significant autonomy, opportunities...
AIM Intelligent Machines
Seattle, WA
3 days ago
ML Engineer - Automated Evaluation and Adversarial Design
...in Computer Science, Machine Learning, Statistics, or a related field ~4+ years of experience building or significantly extending ML evaluation systems, including designing evaluation benchmarks or quality assessment frameworks including evaluation of sequential or...
Shift work
Apple
Seattle, WA
1 day ago
ML Engineer - Evaluation Analysis, Metric and Data Strategy
...Experience designing evaluation or quality metrics for AI-powered or ML-driven features in consumer-facing products Familiarity with... ...and genuinely useful AI outputs Experience partnering with engineering or data teams to define data collection requirements and...
Apple
Seattle, WA
1 day ago
ML Engineer (Senior)
...Machine Learning Engineer (Senior) About AZX Our mission is to accelerate positive impact in critical industries through AI transformation. We specialize in physics-informed ML and enterprise AI solutions that directly address climate and sustainability challenges...
Full time
Remote work
Work visa
Flexible hours
Shift work
AZX
Seattle, WA
2 days ago
Entry Level ML Engineer
$100k
...looking for entry-level software programmers, Java Full stack developers, Python/Java developers, Data analysts/ Data Scientists, Data Engineers, Machine Learning engineers for full time positions with clients. Who Should Apply Recent Computer science/Engineering /...
Full time
SynergisticIT
Seattle, WA
1 day ago
ML Engineers with Unity Catalog working with Databricks
...ML Engineer - Unity Catalog Need to have experience to feature store and operationalization of a feature store We are looking for an ML engineer with expertise in Unity Catalog and Feature Store in Databricks to help us build and maintain a solid foundation...
Equiliem
Seattle, WA
5 days ago
Sr. ML Optimization Engineer, iCloud
...efficiently utilize resources at scale. This team also focuses on ML-driven forecasting, capacity planning, resource optimization,... ...models for iCloud's large-scale services. As a Sr. ML Optimization Engineer, you will work at the intersection of systems engineering,...
Apple
Seattle, WA
5 days ago
ML Compiler Engineer II - Neuron Kernel Interface , Annapurna Labs
$143.7k - $194.4k
...The AWS Neuron Compiler team is actively seeking skilled compiler engineers to join our efforts in developing a state-of-the-art deep... ...Trainium, which represent the forefront of AWS innovation for advanced ML capabilities, powering solutions like Generative AI. Key job...
Relocation
Flexible hours
Amazon
Seattle, WA
2 days ago
Remote Senior ML Engineer - Vision & Ads AI
A leading technology solutions provider is seeking a Senior Software Engineer (Machine Learning) to drive innovation in advertising using AI and machine learning. This long-term remote contract position emphasizes collaboration and the development of scalable solutions...
Long term contract
Remote work
INSPYR Solutions
Seattle, WA
4 days ago
AWS ML Engineer
...as SageMaker, and frameworks like TensorFlow. ETL and Data Preparation: Work with AWS Glue, Redshift, Textract and other data engineering tools to preprocess, transform, and manage data for machine learning purposes. Machine Learning Pipeline Development: Develop...
JConnect Infotech
Seattle, WA
1 day ago
AIML - ML Engineer, Responsible AI
$139.5k - $210.1k
...AIML - ML Engineer, Responsible AI Would you like to play a part in building the next generation of generative AI applications at Apple? We're looking for scientists and engineers to work on ambitious projects that will impact the future of Apple, our products, and...
Relocation
Apple
Seattle, WA
3 days ago
Machine Learning Engineer, E-commerce Governance Algorithms
$148.2k - $300.96k
...contents/products/sellers/creators • Collaborate with strategy team, product managers, policy team and ops team to help define products and drive initiatives from engineering viewpoints Qualifications Minimum Qualifications: • Bachelor's degree in Computer Science...
Temporary work
Work experience placement
Local area
Tik Tok
Seattle, WA
5 days ago
Machine Learning Engineer (LLM)- E-commerce Risk Control
$148.2k - $300.96k
...Cross-functional collaboration with product, ops, security, and trust teams. What You'll... ...knowledge integration - Design prompt engineering and reasoning workflows that connect structured... ...mitigation - Be part of a cutting-edge ML + LLM team shaping the future of risk...
Temporary work
Local area
Worldwide
Tik Tok
Seattle, WA
5 days ago
ML Infra/Systems Engineer
$250k
...ML Infra/Systems Engineer Title of Role: ML Infra/Systems Engineer Location: Seattle, onsite Company Stage of Funding: Seed — Software Development, AI Office Type: Onsite Salary: $250K–$450K Company Description We're representing a dynamic startup...
Work at office
Recruiting from Scratch
Seattle, WA
6 days ago
Data ML Engineer
...Job ID : 83962-1 Job Title: Data ML Engineer Location : Bellevue, WA - Onsite Duration : 6 months + possible extension Rate Range: $50 - $54/hour on W2 (All inclusive) Note: Applicants must be willing to work on W2 only. Role Summary We are seeking...
Artech
Bellevue, WA
5 days ago
AIML - Senior ML Engineer, Responsible AI and Safety
$181.1k - $318.4k
...AIML - Senior ML Engineer, Responsible AI and Safety Join Us in Shaping the Future of Generative AI at Apple! Are you passionate about making AI systems safer, more inclusive, and globally representative? Apple is seeking an expert Machine Learning Engineer to shape...
Relocation
Apple
Seattle, WA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to ML Ops Engineer. Be the first to apply!