Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

(Storm3) Research Scientist, Agentic Data & Benchmarking

$150k
Full-time

Institute of Foundation Models

About the Institute of Foundation Models The Institute of Foundation Models (IFM) is a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. As part of our team, you'll work at the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You'll help build groundbreaking AI systems with the potential to reshape entire industries, and contribute to establishing MBZUAI as a global hub for high-performance computing and deep learning. About the role The Agents team trains advanced agentic language models that use reasoning and tool use to complete real tasks on a computer. This is a specialist role at the center of the loop that drives those models: the data we train on and the benchmarks we measure against. You'll own the agentic data pipeline end-to-end — sourcing and generating high-quality trajectories, tool-use data, and RL environments — and the evaluation suite that tells us, rigorously and reproducibly, what our agents can actually do. These two halves are inseparable: benchmarks expose where models fail, and targeted data closes the gap. The agents are only as good as the data they learn from and the evals that keep us honest, and this role owns both. This is a research scientist position for someone who wants depth in data and measurement rather than breadth across the whole stack. You should be the kind of person who reads through datasets line by line, distrusts a metric until it's been validated, and gets satisfaction from making an eval suite that nobody questions. \n Key responsibilities Benchmarking & evaluation Design and run evaluations of agentic capabilities — multi-step reasoning, tool use, long-horizon planning, computer use, and safety properties — turning ambiguous notions of "intelligence" into defensible, reproducible metrics. Build and harden evaluation harnesses so benchmarks run reliably at scale against training checkpoints, with clear signal on regressions and model health. Run experiments characterizing how prompting, sampling, scaffolding, and environment design affect agentic performance on internal and public benchmarks. Diagnose anomalous eval results mid-training-run — determine whether the cause is the model, the data, the harness, or the infrastructure — and communicate the answer clearly. Agentic data Source, generate, and curate high-quality agentic training data: trajectories, tool-use traces, and task datasets for new capabilities. Design and scale RL environments and reward signals, and measure their impact on model performance. Manage technical relationships with external data vendors and domain experts, evaluating data quality and iterating quickly on feedback. Develop QA frameworks that catch reward hacking, label noise, and contamination, keeping data and benchmark quality high. Across both Contribute to technical reports, research publications, and open-source benchmarks and tooling. Partner with research and product teams to translate capability goals into measurable data and evaluation artifacts. Qualifications Academic qualifications BS, MS, or PhD (or equivalent experience) in Computer Science, Machine Learning, or a related field. Minimum qualifications 2+ years of experience with a clear emphasis on evaluations and/or training-data curation for ML systems (related areas: LLM training/fine-tuning, RL, or distributed ML systems). Strong Python and PyTorch development experience. Demonstrated experience designing and deep-diving into evaluations, or curating and generating training datasets — ideally both. Hands-on experience using LLM agents in your personal or professional work. A habit of reading through raw data and trajectories to understand them and spot issues, and an instinct to distrust a metric until it's validated. Preferred qualifications Experience with reinforcement learning, reward design, or RL environment construction for LLMs. Background in statistics and experimental design — a feel for signal-to-noise, statistical power, and contamination in evaluations. Experience with large-scale dataset sourcing, curation, and processing, including working with external vendors or domain experts. Strong knowledge of the literature on agent evaluation, RL, LLM reasoning, and tool use. Experience building or operating data pipelines and evaluation infrastructure reliable at scale (e.g., PyTorch, Ray). Experience evaluating or generating data for software-engineering or computer-use agents. Contributions to published research, public benchmarks, and/or open-source ML software. Representative projects Stand up a new agentic benchmark from scratch — define the task, build the dataset and scoring, validate against known signals, and ship a view that makes the result legible to researchers and leadership. Build an RL environment for a new high-value capability: design the reward, generate and QA the trajectory data, and measure the lift on model performance. Diagnose a mid-training regression: an eval suite returns anomalous numbers and you determine whether it's the model, the harness, the data, or the infrastructure. Partner with an external data vendor or domain expert to source high-quality trajectories, then build the QA framework that keeps reward hacking and contamination out. Take a flaky distributed eval pipeline and make it reliable — better retries, better observability, faster feedback to researchers. \n $150,000 - $450,000 a year Salary Range The posted salary range represents the company’s good faith estimate of the compensation for this position upon hire. The actual compensation offered may vary within this range depending on individual qualifications, including but not limited to relevant skills, experience, education, certifications, geographic location, and specific business needs. \n We encourage you to apply even if you don't meet every qualification listed. Strong candidates rarely match every line, and we'd rather hear from you than have you rule yourself out.

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the (Storm3) Research Scientist, Agentic Data & Benchmarking in Sunnyvale, CA vacancy
  • $150k

     ...Foundation Models We are a dedicated research lab for building,...  ...alongside world-class researchers, data scientists, and engineers, tackling the...  ...of the PAN (Physical, Agentic, and Networked) world models...  ...Develop metrics and evaluation benchmarks to better assess model... 
    Data
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    3 days ago
  • $187k

     ...technology platform powered by data and machine learning...  ...in AI-driven agentic systems. We're dedicated...  ...safety across diverse benchmarks and production scenariosCollaborate...  ..., engineering, and research teams to translate...  ...ML engineers and scientists, providing technical... 
    Data
    Local area
    Worldwide
    Flexible hours

    Expedia Group

    San Jose, CA
    3 days ago
  • $187k

     ...Senior Machine Learning Scientist Introduction to the...  ...technology platform powered by data and machine learning...  ...in AI-driven agentic systems. We're dedicated...  ...multi-agent collaboration Research and implement state-of...  ...safety across diverse benchmarks and production... 
    Data
    Local area
    Worldwide
    Flexible hours

    Expedia , Inc.

    San Jose, CA
    3 days ago
  • $174k - $252k

     ...experience leading a research agenda. Experience with...  ...the following areas: agentic AI (planning, tool use...  ...implementing novel evaluation benchmarks for LLMs, generative...  ...work. As a Research Scientist, you'll setup large-...  ...(and deep) learning, data mining, natural... 
    Data
    Full time

    Google Inc.

    Mountain View, CA
    1 day ago
  • $207k - $300k

    Research Scientist, Evaluations, Security and Privacy, DeepMind DeepMind Mountain...  ...designing and implementing benchmarking frameworks for machine...  ...machine (and deep) learning, data mining, natural language processing...  ...in Gemini as well as agentic products from Google. This is... 
    Data
    Full time

    Google Inc.

    Mountain View, CA
    4 days ago
  • $150k

     ...Foundation Models We are a dedicated research lab for building,...  ...alongside world-class researchers, data scientists, and engineers, tackling the...  ..., reasoning, and agentic capabilities. You will work...  ...post-training, and evaluation benchmarks. The role combines cutting-edge... 
    Data

    Institute of Foundation Models

    Sunnyvale, CA
    4 days ago
  • $160k - $350k

    Collinear is a research-focused AI company building systems...  ...founders, research scientists, and engineering leads...  ...generate high-quality data to close them. We...  ...Responsibilities Build Agentic Environments: Design and...  ...safety beyond simple benchmarks. Close the Loop: Design... 
    Data
    Full time
    Work at office
    Local area
    Immediate start
    Relocation package
    Flexible hours

    Collinear AI, Inc.

    Sunnyvale, CA
    3 days ago
  • $150k

     ...of Foundation Models We are a dedicated research lab for building, understanding, using,...  ..., alongside world‑class researchers, data scientists, and engineers, tackling the most fundamental...  ...pipelines for code, mathematics, and agentic reasoning datasets. Trace the impact of... 
    Data
    Worldwide
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    3 days ago
  • $150k

     ...of Foundation Models We are a dedicated research lab for building, understanding, using,...  ..., alongside world‑class researchers, data scientists, and engineers, tackling the most fundamental...  ...‑play for foundation model training, agentic tasks, and imbuing models with the... 
    Data
    Visa sponsorship
    Shift work

    Institute of Foundation Models

    Sunnyvale, CA
    4 days ago
  • $190k - $275k

     ...States Role As an Applied Scientist (ML) at Samaya, you will...  ..., and use cutting‐edge ML research to transform how users...  ...formulation and system analysis to data collection, benchmark development, model training...  ...will enable expert‐level agentic workflows to automate... 
    Data
    Temporary work
    Work at office
    Visa sponsorship
    Flexible hours

    Samaya AI

    Mountain View, CA
    3 days ago
  •  ...Applied Machine Learning Research Scientist Sunnyvale CA or Toronto Canada Cerebras Systems...  ...increasing intelligence via additional agentic computation. About The Role As an...  ...improving model quality, and iterating on data and evaluation strategies. Your... 
    Data
    Internship

    CEREBRAS SYSTEMS INC.

    Sunnyvale, CA
    1 day ago
  •  ...the order of listing. What you’ll do As a Research Scientist at Simular, you will: Shape the future of agentic AI by pioneering new research directions...  ...and execute experiments end-to-end: from data collection and benchmarking, to model training and evaluation.... 
    Data

    Simular Inc.

    Palo Alto, CA
    2 days ago
  • $150k

     ...A leading AI research lab in Sunnyvale is seeking a professional to work on advanced agentic language models. The role involves developing algorithms, contributing to research, and collaborating on state-of-the-art models. Ideal candidates hold a degree in Computer Science... 
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    3 days ago
  • $147k - $211k

     ...Robotics On-Device (our Gemini model that runs without a data network). You will also develop reasoning and agentic systems for the physical world, including Gemini...  ...robot capabilities. Write software to implement research ideas and iterate. Participate in research,... 
    Data
    Full time

    Google Inc.

    Mountain View, CA
    4 days ago
  •  ...Role We’re looking for Applied Scientists to join Wayve Labs and help...  ...Wayve, we are a high‑conviction research team with the strategic...  ...long contexts, and scale with data and compute. Cross‑Embodiment...  ...evolve Evaluation Frameworks and Benchmarks for long‑horizon prediction,... 
    Data
    Full time
    Work at office
    Work from home
    Visa sponsorship
    Relocation package
    Flexible hours

    Icehouseventures

    Sunnyvale, CA
    4 days ago
  • $119.8k - $234.7k

     ...seeking an AI Discovery + Web Personalization Product Manager (Agentic Web Lead) to define and scale the next generation of Microsoft's...  ...retrieval, or adjacent systems). ~ Strong understanding of structured data (Schema.org), page semantics, content extractability, and how AI... 
    Data
    Ongoing contract
    Local area

    Microsoft Corporation

    Mountain View, CA
    2 days ago
  • $184k - $253k

     ...design. We work closely with scientists, engineers, and product leaders to translate frontier research into practical, high‑value applications...  ...and materials science data, literature, and workflows. Innovate...  ...curate scientific datasets, benchmarks, and evaluation protocols for... 
    Data
    Full time
    Relocation

    Applied Materials

    Santa Clara, CA
    1 day ago
  •  ...and increasing intelligence via additional agentic computation. About The Team Cerebras...  ...wafer-scale systems. About The Role Most AI research today is shaped by the constraints of...  ...relationship between model structure and data statistics. Track record of published research... 
    Data

    Dormont Manufacturing Co

    Sunnyvale, CA
    4 days ago
  • $150k

     ...of Foundation Models We are a dedicated research lab for building, understanding, using,...  ..., alongside world‑class researchers, data scientists, and engineers, tackling the most fundamental...  ...challenges to train advanced agentic language models that are adept at using... 
    Data
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    5 days ago
  • $192k - $304.75k

     ...computing. We're looking for a passionate AI research scientist with deep quantum computing expertise...  ...models, curated datasets, and rigorous benchmarks that advance the state of the art and...  ...and hardware-derived syndrome data, enabling the community to train and evaluate... 
    Data

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $165k - $195k

    Senior AI Research Scientist- Time-Series Foundational Models Full-time The Bosch Research and Technology...  ...& Mixed Reality, Cloud Robotics, Big Data Visual Analytics, Explainable AI (XAI),...  ...centric AI, synthetic data generation, agentic AI. Proficiency with version control... 
    Data
    Full time
    Work experience placement
    Local area
    Worldwide

    Robert Bosch Group

    Sunnyvale, CA
    4 days ago
  • $140k - $215k

     ...with you. About the Role: As an AI Threat Researcher, you will lead the charge in identifying,...  ...the security of complex AI Workflows and Agentic Loops, uncovering how multi-step...  ...specifically regarding insecure output handling, data integrity, and model robustness. Security... 
    Data
    Full time
    Work experience placement
    Work at office
    Local area
    2 days per week
    3 days per week

    CrowdStrike

    Sunnyvale, CA
    2 days ago
  • $192k - $304.75k

     ...'re looking for a passionate scientist at the intersection of quantum...  .... As a Sr. Quantum Applied Research Scientist, you will help design...  ...develop physics-informed data synthesis pipelines, post-trainable...  ...architectures, and practical benchmarks that the quantum community... 
    Data
    Full time

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $137.7k - $275.4k

     ...Incubation team is seeking a Research Scientist to contribute to the next...  ...model alignment, reasoning, and agentic intelligence. As a Research...  ...implementing evaluation and benchmarking frameworks for model...  ...information of how we use your data. Zoomies help people stay connected... 
    Data
    Full time
    Work at office
    Remote work

    Zoom

    San Jose, CA
    1 day ago
  • $168k - $264.5k

    We are now seeking a Research Scientist in Generative AI for Graphics and Gaming! NVIDIA is seeking...  ...improvements, not incremental benchmark gains. You’ll work across multiple organizations...  ...and iteration. Build and own data pipelines and datasets for foundation models... 
    Data

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $174k - $252k

     ...experience with software engineering/ data science. 2 years of...  ...of professional or academic research experience applying machine learning...  ...types of work. As a Research Scientist, you'll set up large‑scale...  ...functions, and evaluation benchmarks grounded in realistic security... 
    Data
    Full time

    Google DeepMind

    Mountain View, CA
    1 day ago
  • $150k

     ...A leading AI research institution is seeking a Research Scientist specializing in large language models in Sunnyvale, California. The role involves advancing LLM capabilities, designing data pipelines, and collaborating on innovative AI solutions. Candidates should have... 
    Data

    Institute of Foundation Models

    Sunnyvale, CA
    3 days ago
  • $150k

     ...A leading AI research institution in Sunnyvale is seeking a Research Scientist to curate web-scale data crucial for developing foundation models. This role involves pioneering data collection methods and collaborating with cross-functional teams to enhance AI capabilities... 
    Data

    Institute of Foundation Models

    Sunnyvale, CA
    3 days ago
  •  ...pretraining to deployment on real robotic hardware. This is foundational research with direct physical impact. No hand-offs, no bureaucracy, just...  ..., not shallow associations Run end-to-end training loops: data curation, experiment design, failure diagnosis, and iteration... 
    Data

    Prime Recruitment Partners

    Santa Clara, CA
    1 day ago
  • $207k - $300k

    Staff AI Research Scientist, Applied AI, Google Cloud corporate_fare Google place Sunnyvale, CA,...  ..., such as machine (and deep) learning, data mining, natural language processing, hardware...  ..., develop, and deploy scalable and agentic AI solutions for enterprise use cases... 
    Data
    Full time

    Google Inc.

    Sunnyvale, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to (Storm3) Research Scientist, Agentic Data & Benchmarking. Be the first to apply!