ML Systems Engineer, Infrastructure & Cloud
Basis Research Institute
About Basis Basis is a nonprofit applied AIresearch organization with two mutually reinforcing goals. The first is to understand and build intelligence. This means to establish the mathematical principles of what it means to reason, to learn, to make decisions, to understand, and to explain; and to construct software that implements these principles. The second is to advance society’s ability to solve intractable problems . This means expanding the scale, complexity, and breadth of problems that we can solve today, and even more importantly, accelerating our ability to solve problems in the future. To achieve these goals, we’re building both a new technological foundation that draws inspiration from how humans reason, and a new kind of collaborative organization that puts human values first. About the Role ML Systems Engineers at Basis ensure training and evaluation infrastructure is fast, reliable, and scalable. You will own the full stack from distributed training frameworks through cloud administration, making it possible for researchers to iterate quickly on complex models while managing computational resources efficiently. We are looking for engineers who combine deep understanding of ML systems with operational excellence. The ideal ML Systems Engineer has experience with distributed training at scale, understands the intricacies of debugging numerical instabilities, and can manage cloud infrastructure that scales from experiments to production. You will be the guardian of training stability, the optimizer of compute costs, and the enabler of reproducible research. This role spans traditional ML engineering and cloud/DevOps responsibilities. You will manage GPU clusters, optimize cloud spending, ensure security and compliance, and build the infrastructure that lets researchers focus on algorithms rather than operations. We seek individuals who aspire to build robust ML infrastructure, maintain “logbook culture” for documenting issues and solutions, and treat operational excellence as a first-class concern. We expect you to: Have demonstrated expertise in ML systems engineering . Examples include: Managing distributed training jobs across hundreds of GPUs Debugging and fixing numerical instabilities in large-scale training Building infrastructure for reproducible ML experiments Optimizing training throughput and resource utilization Possess deep knowledge of distributed training frameworks including PyTorch/JAX distributed strategies (DDP, FSDP, ZeRO), gradient accumulation, mixed precision training, and checkpoint/recovery systems. Have strong cloud administration skills including AWS/GCP/Azure services, infrastructure as code (Terraform), Kubernetes orchestration, cost optimization, security best practices, and compliance requirements. Understand the full ML stack from hardware (GPUs, interconnects, storage) through frameworks (PyTorch, JAX) to high-level training loops and evaluation pipelines. Be skilled at debugging complex failures across the stack—GPU/NCCL issues, data loading bottlenecks, memory leaks, gradient explosions, and convergence problems. Value documentation and knowledge sharing . You maintain comprehensive logs of issues encountered, solutions found, and lessons learned, building institutional knowledge. Progress with autonomy while coordinating closely with researchers. You can anticipate infrastructure needs, prevent problems before they occur, and respond quickly when issues arise. In addition, the following would be an advantage: Experience at organizations training large models (OpenAI, Anthropic, Google, Meta). Background in both ML research and production systems. Contributions to ML frameworks or distributed training libraries. Experience with on‑premise GPU cluster management. Knowledge of optimization theory and numerical methods. Understanding of robotics‑specific infrastructure requirements. Responsibilities: Own distributed training infrastructure including job launchers, checkpointing systems, recovery mechanisms, and monitoring that ensures experiments run reliably at scale. Debug and resolve training failures by diagnosing issues across GPUs, networking, numerics, and data pipelines, maintaining detailed logs of problems and solutions. Profile and optimize training performance by identifying bottlenecks in data loading, gradient computation, communication overhead, and implementing solutions that improve step time. Manage cloud infrastructure and costs including capacity planning, spot instance strategies, storage optimization, and building tools that give researchers visibility into resource usage. Implement security and compliance measures including access controls, data encryption, audit logging, and ensuring infrastructure meets requirements for handling sensitive data. Build evaluation and benchmarking infrastructure that enables consistent, reproducible measurement of model performance across different conditions and datasets. Develop monitoring and alerting systems that detect anomalies in training metrics, resource utilization, or system health, enabling rapid response to issues. Maintain development environments including containerization, dependency management, and tools that ensure researchers can reproduce results across different systems. Document and share knowledge through runbooks, post‑mortems, and training materials that help the team understand and operate ML infrastructure effectively. Collaborate with researchers to understand requirements, suggest infrastructure solutions, and ensure systems support rather than constrain research goals. Role Details Exceptional candidates who may not meet all of the following criteria are still encouraged to apply. FT/PT: Full‑time. In‑person Policy: We are in the office four days a week. Be prepared to attend multi‑day Basis‑wide in‑person events. Location: New York City or Cambridge, MA. Salary range: Competitive salary. Privacy Notice By submitting your application, you grant Basis permission to use your materials for both hiring evaluation and recruitment‑related research and development purposes. Your information may be processed in different countries, including the US. You retain copyright while providing Basis a license to use these materials for the stated purposes. Read our full Global Data Privacy Notice here. #J-18808-Ljbffr Basis Research Institute
- ...nonprofit AI research organization in New York City seeks a full-time ML Systems Engineer. This role involves managing distributed training infrastructure, debugging complex issues, and optimizing cloud resources to enhance operational efficiency. Ideal candidates will...SuggestedFull time
$141.1k - $262.1k
F. Hoffmann-La Roche AG is seeking a motivated ML Engineer for its Genentech team in New York. The role focuses on designing and maintaining ML infrastructure to support drug discovery initiatives. The ideal candidate will have a strong background in AWS, Python, and C++...Suggested$175k - $250k
...Senior Machine Learning Engineer (ML Infrastructure & Data Systems) Our client is an early-stage robotics and AI company building autonomous systems that operate in real-world industrial environments. Their platform focuses on automating complex, mission-critical workflows...Suggested- ...shopping platform, is seeking an AI/ML Platform Engineer in the Seattle area to develop core infrastructure for machine learning... ...role involves designing scalable systems, productionizing ML architectures... ..., with a focus on Python and cloud technologies. Attractive...SuggestedFlexible hours
- ...the construction of large-scale infrastructure around the globe. Gritt’s systems are already deployed commercially... ...by marquee VCs. Role: Software - ML & Cloud Infrastructure Location: SF Bay Area... ...ML & Cloud Infrastructure Engineer to join our team. As an early member...Suggested
$180k - $220k
...AI startup is seeking a Machine Learning Engineer to contribute to scalable AI pipelines. This role entails developing infrastructure for model training and optimization, requiring extensive experience with ML frameworks and cloud services. The ideal candidate possesses...Remote work- ...XWELL is seeking a Systems Engineer to support both on‑premises and cloud‑based systems. This role is hands-on, and you will contribute to the design, implementation, and maintenance of modern infrastructure technologies. The ideal candidate boasts experience in IT systems...Work at officeRemote work
$200k
...Position: Infrastructure, Platform & Systems Engineers (High-Performance & Distributed Systems) Location: New York / London / U.S. Compensation: ~... ...environments Building and maintaining Linux-based systems and cloud/on-prem infra Improving system performance across...- ...is looking for exceptional generalist engineers who thrive with autonomy. This fully remote... ...to designing distributed orchestration systems. Ideal candidates will have a Bachelor'... ...track record in systems programming or ML infrastructure. Competitive compensation and benefits...Remote work
- ...foundation models are accessed through cloud APIs or as self-hosted and on-... ...of the hardest problems in AI. As an ML Ops Infrastructure Engineer at Deepgram, you will own the critical... ...- building the pipelines, deployment systems, and testing infrastructure that take...Home officeFlexible hours
- ...A leading automotive company is seeking a Senior ML Engineer to design and build scalable AI/ML platform infrastructure. In this role, you will collaborate with machine learning engineers and research scientists to create advanced AI solutions for intelligent driving...
$216.7k - $303.4k
...Senior Machine Learning Systems Engineer Remote - United States Reddit is... ...high-impact team that owns the infrastructure that powers recommendations,... ...What You’ll Do: As a Senior ML Infrastructure Engineer, you... ...Deep experience with cloud-based technologies for supporting...For contractorsWork experience placementRemote work- Modal Labs is seeking strong engineers to train production machine learning models and contribute... ...with high-performance code and ML training optimization, working in our NYC... ...years of experience and enjoy evolving our infrastructure to enhance the next generation of...
$200.2k - $357.5k
...of the Connected Operations Cloud, a platform that enables organizations... ...Staff Machine Learning Infrastructure Engineer to lead the design and evolution of our end‑to‑end ML platform powering Safety AI... ...build, deploy, and scale ML systems that improve real‑world safety...Full timeWork at officeRemote workFlexible hours- ...ServiceNow is looking for a Staff Machine Learning Engineer specialized in VoIP Infrastructure to design and implement telephony platforms and AI-driven... ...candidate should have hands-on experience building VoIP systems, integrating applications with LLMs, and over 4 years...Work experience placement
- ...New York, NY is seeking a Machine Learning Engineer focused on Data & Training Infrastructure. In this role, you'll build the core systems that transform hardware problems into high... ...background in distributed systems and ML infrastructure. Benefits include full medical...
- ...financial technology company is seeking a Senior Cloud and Platform Engineer to design and operate cloud-native infrastructure for AI development. The ideal candidate has... ...practices. You will lead the operational side of AI systems, ensuring reliability and security. This...Remote workFlexible hours
- ...An established industry player is seeking a talented SRE Engineer to join their innovative team. This role focuses on leveraging infrastructure automation tools and cloud services to enhance system reliability and performance. You will work closely with cross-functional...
- ...Malwarebytes is seeking a Senior DevOps Engineer to own the VPN infrastructure and contribute to AWS cloud infrastructure. The role requires strong expertise in Linux system administration and automation via Terraform and Ansible. The ideal candidate will have over 6 years...
- ...Jobright.ai is seeking a remote Software Development Engineer (SDE) to join their innovative team. The role involves designing, developing, and maintaining efficient software solutions while collaborating with cross-functional teams to meet business requirements. The...Remote work
$216.7k - $303.4k
...Senior Machine Learning Systems Engineer Remote - United States Reddit is... ...high-impact team that owns the infrastructure that powers recommendations,... ...What You’ll Do As a Senior ML Infrastructure Engineer, you... .... Deep experience with cloud‑based technologies for supporting...Remote work- ...Eastern Services is looking for a Senior Software Engineer (Machine Learning) to build and deploy high-performance machine learning systems. The ideal candidate has a strong skill set... ...engineering. You will manage the entire ML lifecycle, process massive data, and ensure...
$320k - $405k
...Machine Learning Systems Engineer, Research Tools San Francisco, CA | New York City, NY | Seattle... ...teams, you'll build critical infrastructure that directly impacts how our models learn... ...machine learning systems, data pipelines, or ML infrastructure Are proficient in...Work at officeVisa sponsorshipFlexible hours- ...Machine Learning Systems Engineer, RL Engineering San Francisco, CA | New York City, NY | Seattle... ..., reliable and steerable AI. As an ML Systems Engineer on our Reinforcement... ...responsible for the critical algorithms and infrastructure that our researchers depend on to train...Work at officeVisa sponsorshipFlexible hours
- A leading data platform company is seeking a Senior Systems Engineer to assist customers with evaluations and installations while collaborating... ...contributions in a rapidly growing company pivotal to AI infrastructure. Apply now to join a dynamic and innovative team. #J-18808-...
$145k - $160k
...a mission to make high-performance cloud infrastructure easy to use, affordable, and locally... ...skilled and experienced Staff AI/ML Infrastructure Engineer to drive the design, performance, and... ...expert with deep GPU systems knowledge, strong automation experience...Work at officeImmediate startRemote workFlexible hours- A fast-growing infrastructure company in New York is seeking a Senior Systems Engineer. The role involves assisting clients with evaluations and installations, collaborating with R&D for product requirements, and demonstrating technical expertise in storage products. Candidates...
- ...with Windows and Linux server engineering 2. Solid networking... ...We are seeking an experienced Systems Engineer with strong expertise in enterprise IoT infrastructure, server engineering, and networking... ...APIs, SaaS platforms, and hybrid cloud environments to enable secure...
- ...Systems: Infrastructure Systems Engineer (Windows) The D. E. Shaw group seeks a systems engineer proficient in Windows to join its Infrastructure Engineering team. This is a hands-on technical role within a cross-platform engineering team responsible for core technology...Work experience placement
- ...Cloud Infrastructure Engineer (Site Reliability Engineer) Location: (3 days Manhattan office) MoI: Video and Final F2F MUST HAVE STRONG AWS... ...development, staging, and production environments. System Administration and Maintenance: Monitor system performance...Work at office
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to ML Systems Engineer, Infrastructure & Cloud. Be the first to apply!
- senior ml engineer New York, NY
- data scientist machine learning engineer New York, NY
- machine learning ai engineer New York, NY
- junior machine learning research engineer New York, NY
- computer vision machine learning engineer New York, NY
- graduate machine learning engineer New York, NY
- machine learning software engineer New York, NY
- ai ml engineer New York, NY
- machine learning engineer New York, NY
- healthcare systems engineer New York, NY

