Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

ML Systems Engineer, Infrastructure & Cloud

Basis Research Institute

About Basis Basis is a nonprofit applied AIresearch organization with two mutually reinforcing goals. The first is to understand and build intelligence. This means to establish the mathematical principles of what it means to reason, to learn, to make decisions, to understand, and to explain; and to construct software that implements these principles. The second is to advance society’s ability to solve intractable problems . This means expanding the scale, complexity, and breadth of problems that we can solve today, and even more importantly, accelerating our ability to solve problems in the future. To achieve these goals, we’re building both a new technological foundation that draws inspiration from how humans reason, and a new kind of collaborative organization that puts human values first. About the Role ML Systems Engineers at Basis ensure training and evaluation infrastructure is fast, reliable, and scalable. You will own the full stack from distributed training frameworks through cloud administration, making it possible for researchers to iterate quickly on complex models while managing computational resources efficiently. We are looking for engineers who combine deep understanding of ML systems with operational excellence. The ideal ML Systems Engineer has experience with distributed training at scale, understands the intricacies of debugging numerical instabilities, and can manage cloud infrastructure that scales from experiments to production. You will be the guardian of training stability, the optimizer of compute costs, and the enabler of reproducible research. This role spans traditional ML engineering and cloud/DevOps responsibilities. You will manage GPU clusters, optimize cloud spending, ensure security and compliance, and build the infrastructure that lets researchers focus on algorithms rather than operations. We seek individuals who aspire to build robust ML infrastructure, maintain “logbook culture” for documenting issues and solutions, and treat operational excellence as a first-class concern. We expect you to: Have demonstrated expertise in ML systems engineering . Examples include: Managing distributed training jobs across hundreds of GPUs Debugging and fixing numerical instabilities in large-scale training Building infrastructure for reproducible ML experiments Optimizing training throughput and resource utilization Possess deep knowledge of distributed training frameworks including PyTorch/JAX distributed strategies (DDP, FSDP, ZeRO), gradient accumulation, mixed precision training, and checkpoint/recovery systems. Have strong cloud administration skills including AWS/GCP/Azure services, infrastructure as code (Terraform), Kubernetes orchestration, cost optimization, security best practices, and compliance requirements. Understand the full ML stack from hardware (GPUs, interconnects, storage) through frameworks (PyTorch, JAX) to high-level training loops and evaluation pipelines. Be skilled at debugging complex failures across the stack—GPU/NCCL issues, data loading bottlenecks, memory leaks, gradient explosions, and convergence problems. Value documentation and knowledge sharing . You maintain comprehensive logs of issues encountered, solutions found, and lessons learned, building institutional knowledge. Progress with autonomy while coordinating closely with researchers. You can anticipate infrastructure needs, prevent problems before they occur, and respond quickly when issues arise. In addition, the following would be an advantage: Experience at organizations training large models (OpenAI, Anthropic, Google, Meta). Background in both ML research and production systems. Contributions to ML frameworks or distributed training libraries. Experience with on‑premise GPU cluster management. Knowledge of optimization theory and numerical methods. Understanding of robotics‑specific infrastructure requirements. Responsibilities: Own distributed training infrastructure including job launchers, checkpointing systems, recovery mechanisms, and monitoring that ensures experiments run reliably at scale. Debug and resolve training failures by diagnosing issues across GPUs, networking, numerics, and data pipelines, maintaining detailed logs of problems and solutions. Profile and optimize training performance by identifying bottlenecks in data loading, gradient computation, communication overhead, and implementing solutions that improve step time. Manage cloud infrastructure and costs including capacity planning, spot instance strategies, storage optimization, and building tools that give researchers visibility into resource usage. Implement security and compliance measures including access controls, data encryption, audit logging, and ensuring infrastructure meets requirements for handling sensitive data. Build evaluation and benchmarking infrastructure that enables consistent, reproducible measurement of model performance across different conditions and datasets. Develop monitoring and alerting systems that detect anomalies in training metrics, resource utilization, or system health, enabling rapid response to issues. Maintain development environments including containerization, dependency management, and tools that ensure researchers can reproduce results across different systems. Document and share knowledge through runbooks, post‑mortems, and training materials that help the team understand and operate ML infrastructure effectively. Collaborate with researchers to understand requirements, suggest infrastructure solutions, and ensure systems support rather than constrain research goals. Role Details Exceptional candidates who may not meet all of the following criteria are still encouraged to apply. FT/PT: Full‑time. In‑person Policy: We are in the office four days a week. Be prepared to attend multi‑day Basis‑wide in‑person events. Location: New York City or Cambridge, MA. Salary range: Competitive salary. Privacy Notice By submitting your application, you grant Basis permission to use your materials for both hiring evaluation and recruitment‑related research and development purposes. Your information may be processed in different countries, including the US. You retain copyright while providing Basis a license to use these materials for the stated purposes. Read our full Global Data Privacy Notice here. #J-18808-Ljbffr Basis Research Institute

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the ML Systems Engineer, Infrastructure & Cloud in New York, NY vacancy
  •  ...nonprofit AI research organization in New York City seeks a full-time ML Systems Engineer. This role involves managing distributed training infrastructure, debugging complex issues, and optimizing cloud resources to enhance operational efficiency. Ideal candidates will... 
    Suggested
    Full time

    Basis Research Institute

    New York, NY
    3 days ago
  • $141.1k - $262.1k

    F. Hoffmann-La Roche AG is seeking a motivated ML Engineer for its Genentech team in New York. The role focuses on designing and maintaining ML infrastructure to support drug discovery initiatives. The ideal candidate will have a strong background in AWS, Python, and C++... 
    Suggested

    F. Hoffmann-La Roche AG

    New York, NY
    5 days ago
  • $110 per hour

     ...Responsibilities Guide research and engineering teams to close knowledge gaps and...  ...model performance in MLOps , training infrastructure, and ML framework-level topics . Design...  ...structured solutions to MLOps and ML systems problems . Evaluate MLOps tasks... 
    Suggested
    Remote job
    Contract work
    Summer work
    Weekday work

    Mercor

    New York, NY
    3 days ago
  •  ...the construction of large-scale infrastructure around the globe. Gritt’s systems are already deployed commercially...  ...by marquee VCs. Role: Software - ML & Cloud Infrastructure Location: SF Bay Area...  ...ML & Cloud Infrastructure Engineer to join our team. As an early member... 
    Suggested

    Gritt Robotics Inc

    Brooklyn, NY
    3 days ago
  • Job Title: ML Platform Engineer - GPU Infrastructure Support team by designing, implementing, and maintaining the...  ...Bachelor's or Master's degree in Systems Engineering, Computer Science, Computer...  ...simulation workloads Exposure to cloud platforms (AWS, Azure, or GCP)... 
    Suggested

    Optimal

    Brooklyn, NY
    5 days ago
  •  ...is looking for exceptional generalist engineers who thrive with autonomy. This fully remote...  ...to designing distributed orchestration systems. Ideal candidates will have a Bachelor'...  ...track record in systems programming or ML infrastructure. Competitive compensation and benefits... 
    Remote job

    Inferact

    New York, NY
    3 days ago
  •  ...foundation models are accessed through cloud APIs or as self-hosted and on-...  ...of the hardest problems in AI. As an ML Ops Infrastructure Engineer at Deepgram, you will own the critical...  ...- building the pipelines, deployment systems, and testing infrastructure that take... 
    Home office
    Flexible hours

    Deepgram

    New York, NY
    3 days ago
  • $216.7k - $303.4k

    Senior Machine Learning Systems Engineer Remote - United States Reddit is...  ...high-impact team that owns the infrastructure that powers recommendations,...  ...What You’ll Do: As a Senior ML Infrastructure Engineer, you...  ...Deep experience with cloud-based technologies for supporting... 
    Remote job
    For contractors
    Work experience placement

    reddit

    New York, NY
    3 days ago
  •  ...Description As the first and founding ML Operations Engineer at Tennr, you’ll play a crucial role...  ...foundational Machine Learning and AI systems. You’ll own building machine learning...  ...models at scale. Develop and maintain infrastructure that supports efficient ML operations... 
    Work at office

    Tennr

    New York, NY
    5 days ago
  • A cutting-edge AI company is seeking an experienced ML Ops Infrastructure Engineer to bridge research and production. This role focuses on designing and building CI/CD pipelines and deploying ML models for real-time applications. With a strong emphasis on automation, monitoring... 

    Deepgram

    New York, NY
    3 days ago
  •  ...New York, NY is seeking a Machine Learning Engineer focused on Data & Training Infrastructure. In this role, you'll build the core systems that transform hardware problems into high...  ...background in distributed systems and ML infrastructure. Benefits include full medical... 

    Arena Physica

    New York, NY
    3 days ago
  • Anysphere is seeking a skilled data infrastructure engineer in New York to enhance its coding automation tools. The role involves overseeing the...  ...processes, implementing effective solutions, and ensuring system performance meets business objectives. Ideal candidates... 
    Work at office

    Anysphere

    New York, NY
    2 days ago
  • $200.2k - $357.5k

     ...Connected Operations™ Cloud, which is a platform that...  ...industries are the infrastructure of our planet, including...  ...Infrastructure Engineer to lead the design and...  ...evolution of our end-to-end ML platform powering...  ...deploy, and scale ML systems that improve real-world... 
    Full time
    Work at office
    Remote work
    Flexible hours

    Samsara

    New York, NY
    3 days ago
  • A tech company specializing in IoT solutions is seeking a Staff/Senior Staff Machine Learning Infrastructure Engineer to design and evolve their ML platform. This remote position requires strong expertise in distributed computing frameworks like Ray and Spark, as well as... 
    Remote job
    Flexible hours

    Samsara

    New York, NY
    3 days ago
  • $200.2k - $357.5k

     ...a Staff/Senior Staff Machine Learning Infrastructure Engineer to lead the design of their ML platform. This role involves building scalable systems for safety AI and requires over 10 years...  ...computing, and proficiency in cloud infrastructure. Candidates will work in... 
    Remote job

    Samsara

    New York, NY
    3 days ago
  • $227.2k - $417k

    Software Engineer, ML Infra & Distributed Systems (Staff & Principal) About the Role: As a Software Engineer on the ML Infrastructure team, you will collaborate closely with the Machine Learning...  ...with AWS or an equivalent cloud platform Experience building online... 
    Full time
    Temporary work
    Local area
    Flexible hours

    Tubi Tv

    New York, NY
    4 days ago
  • $141.1k - $262.1k

    F. Hoffmann-La Roche AG in New York is seeking a Machine Learning Engineer to design and maintain scalable ML infrastructure on AWS. The role involves collaboration with ML engineers and automating workflows, ensuring security best practices while mentoring junior engineers... 

    F. Hoffmann-La Roche AG

    New York, NY
    4 days ago
  •  ...combination of inventive research, design, and engineering. Our organization is very flat, and our...  ...improvement, evals, and experimentation. Data infrastructure is what turns them into something teams can trust. A lot of systems here started simple so we could move fast.... 
    Contract work

    Anysphere

    New York, NY
    3 days ago
  • $100 per hour

     ...Our client is seeking a Senior Systems Infrastructure Engineer to play a critical role in supporting and evolving a large-scale, enterprise infrastructure...  ...-looking improvements across the Microsoft ecosystem and cloud platforms. Starting at $100/hr. Senior Systems... 

    The Right Click, Inc.

    New York, NY
    5 days ago
  •  ...their Data Platform, focusing on building platforms that support the data science development lifecycle. You'll collaborate with AI engineers and data scientists to build and integrate key components of their Analytics and Machine Learning Platforms. Ideal candidates... 
    Flexible hours

    Dormont Manufacturing Company

    New York, NY
    5 days ago
  • A fast-growing infrastructure company in New York is seeking a Senior Systems Engineer. The role involves assisting clients with evaluations and installations, collaborating with R&D for product requirements, and demonstrating technical expertise in storage products. Candidates... 

    VAST Data

    New York, NY
    2 days ago
  •  ...applications, while supporting end-to-end system reliability, real-time inference...  ...software integration, and the resilient cloud infrastructure required for our international...  ...product evolution: Partner with our Engineering and ML teams to ensure the lessons learned in... 

    AI Chopping Block, Inc.

    New York, NY
    4 days ago
  • Drive Capital is seeking a Senior Systems Engineer based in Oklahoma to support enterprise storage sales to large IT organizations. The...  ...presents a unique opportunity to contribute to a fast-growing company in the AI-driven infrastructure sector. #J-18808-Ljbffr Drive Capital

    Drive Capital

    Brooklyn, NY
    2 days ago
  • Mistral AI in New York is seeking a Systems Engineer/System Administrator to design and operate the infrastructure supporting its AI platforms. Candidates should have a strong...  ...-scale environments such as HPC clusters or cloud infrastructure. The role involves maintaining... 

    Mistral AI

    New York, NY
    3 days ago
  • Varonis is seeking a DevSecOps Engineer to join our DevOps team in New York City, responsible for securing cloud platforms. You will implement security features, define...  ...platforms (Azure, AWS, GCP), and expertise in Infrastructure as Code tools. We support a hybrid work... 

    Varonis

    New York, NY
    5 days ago
  • A leading company in the financial technology sector is seeking a Senior Software Engineer to enhance trading systems through machine learning. The ideal candidate will have extensive software development experience and a strong skill set in Python, managing data pipelines... 

    The Hagen Ricci Group

    New York, NY
    4 days ago
  •  ...professional to build and maintain our core GTM infrastructure, heavily utilizing Salesforce and Clay. This position entails designing systems for reliability and robustness as the...  ...experience in Technical RevOps or GTM Engineering, with a strong Salesforce background and... 

    Clay Labs

    New York, NY
    1 day ago
  • GEM Technologies is looking for a Senior Systems Engineer in New York, NY to maintain and enhance IT infrastructure, including cloud systems. This role involves providing expert technical support and working collaboratively with clients to ensure systems are secure and... 
    Remote job

    GEM Technologies

    New York, NY
    5 days ago
  • Prsala is looking for a reliable Systems Administrator to manage and maintain their infrastructure and IT systems. This role supports a growing AI platform serving...  ...and monitored. Responsibilities include managing cloud infrastructure, handling IAM, and implementing security... 
    Remote job
    Flexible hours

    Prsala

    New York, NY
    3 days ago
  • $200.8k - $251k

     ...member to build and optimize a machine learning framework for large language models. Candidates should have system optimization experience and solid software engineering skills, particularly in tools like CUDA and Pytorch. This full-time position offers a competitive salary... 
    Full time

    Scale AI

    New York, NY
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to ML Systems Engineer, Infrastructure & Cloud. Be the first to apply!