Research Scientist, Agentic Data & Benchmarking

Institute of Foundation Models

Job Description

About the Institute of Foundation Models

The Institute of Foundation Models (IFM) is a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you'll work at the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You'll help build groundbreaking AI systems with the potential to reshape entire industries, and contribute to establishing MBZUAI as a global hub for high-performance computing and deep learning.

About the role

The Agents team trains advanced agentic language models that use reasoning and tool use to complete real tasks on a computer. This is a specialist role at the center of the loop that drives those models: the data we train on and the benchmarks we measure against.

You'll own the agentic data pipeline end-to-end — sourcing and generating high-quality trajectories, tool-use data, and RL environments — and the evaluation suite that tells us, rigorously and reproducibly, what our agents can actually do. These two halves are inseparable: benchmarks expose where models fail, and targeted data closes the gap. The agents are only as good as the data they learn from and the evals that keep us honest, and this role owns both.

This is a research scientist position for someone who wants depth in data and measurement rather than breadth across the whole stack. You should be the kind of person who reads through datasets line by line, distrusts a metric until it's been validated, and gets satisfaction from making an eval suite that nobody questions.

Key responsibilities

Benchmarking & evaluation

Design and run evaluations of agentic capabilities — multi-step reasoning, tool use, long-horizon planning, computer use, and safety properties — turning ambiguous notions of "intelligence" into defensible, reproducible metrics.

Build and harden evaluation harnesses so benchmarks run reliably at scale against training checkpoints, with clear signal on regressions and model health.

Run experiments characterizing how prompting, sampling, scaffolding, and environment design affect agentic performance on internal and public benchmarks.

Diagnose anomalous eval results mid-training-run — determine whether the cause is the model, the data, the harness, or the infrastructure — and communicate the answer clearly.

Agentic data

Source, generate, and curate high-quality agentic training data: trajectories, tool-use traces, and task datasets for new capabilities.

Design and scale RL environments and reward signals, and measure their impact on model performance.

Manage technical relationships with external data vendors and domain experts, evaluating data quality and iterating quickly on feedback.

Develop QA frameworks that catch reward hacking, label noise, and contamination, keeping data and benchmark quality high.

Across both

Contribute to technical reports, research publications, and open-source benchmarks and tooling.

Partner with research and product teams to translate capability goals into measurable data and evaluation artifacts.

Qualifications

Academic qualifications

BS, MS, or PhD (or equivalent experience) in Computer Science, Machine Learning, or a related field.

Minimum qualifications

2+ years of experience with a clear emphasis on evaluations and/or training-data curation for ML systems (related areas: LLM training/fine-tuning, RL, or distributed ML systems).

Strong Python and PyTorch development experience.

Demonstrated experience designing and deep-diving into evaluations, or curating and generating training datasets — ideally both.

Hands-on experience using LLM agents in your personal or professional work.

A habit of reading through raw data and trajectories to understand them and spot issues, and an instinct to distrust a metric until it's validated.

Preferred qualifications

Experience with reinforcement learning, reward design, or RL environment construction for LLMs.

Background in statistics and experimental design — a feel for signal-to-noise, statistical power, and contamination in evaluations.

Experience with large-scale dataset sourcing, curation, and processing, including working with external vendors or domain experts.

Strong knowledge of the literature on agent evaluation, RL, LLM reasoning, and tool use.

Experience building or operating data pipelines and evaluation infrastructure reliable at scale (e.g., PyTorch, Ray).

Experience evaluating or generating data for software-engineering or computer-use agents.

Contributions to published research, public benchmarks, and/or open-source ML software.

Representative projects

Stand up a new agentic benchmark from scratch — define the task, build the dataset and scoring, validate against known signals, and ship a view that makes the result legible to researchers and leadership.

Build an RL environment for a new high-value capability: design the reward, generate and QA the trajectory data, and measure the lift on model performance.

Diagnose a mid-training regression: an eval suite returns anomalous numbers and you determine whether it's the model, the harness, the data, or the infrastructure.

Partner with an external data vendor or domain expert to source high-quality trajectories, then build the QA framework that keeps reward hacking and contamination out.

Take a flaky distributed eval pipeline and make it reliable — better retries, better observability, faster feedback to researchers.

Salary Range

The posted salary range represents the company’s good faith estimate of the compensation for this position upon hire. The actual compensation offered may vary within this range depending on individual qualifications, including but not limited to relevant skills, experience, education, certifications, geographic location, and specific business needs.

We encourage you to apply even if you don't meet every qualification listed. Strong candidates rarely match every line, and we'd rather hear from you than have you rule yourself out.

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Research Scientist, Agentic Data & Benchmarking in Sunnyvale, CA vacancy

Senior Research Scientist, Nemotron Post-training
$192k - $304.75k
...generative AI. We are looking for a research scientist / engineer who is passionate... ...intersection of the areas: 1) Synthetic data and algorithmic research for agentic RL 2) Data and training... ...models by developing training data, benchmarks, LLMs and software (including NeMo...
Data
NVIDIA Gruppe
Santa Clara, CA
3 days ago
Research Scientist - Vision Language Model
$150k
...Foundation Models We are a dedicated research lab for building,... ...alongside world-class researchers, data scientists, and engineers, tackling the... ..., reasoning, and agentic capabilities. You will work... ...post-training, and evaluation benchmarks. The role combines cutting-edge...
Data
Institute of Foundation Models
Sunnyvale, CA
1 day ago
Advanced Technology: AI/ML Research Scientist
...intelligence via additional agentic computation. Cerebras works... ...development. The ML Performance Benchmarking team plays a pivotal role in... ..., and visualizes performance data used to inform business... ...source their cutting‑edge AI research. Work on one of the fastest...
Data
Cerebras Systems, Inc.
Sunnyvale, CA
2 days ago
Research Scientist/ Research Engineer
$160k - $350k
Collinear is a research-focused AI company building systems... ...founders, research scientists, and engineering leads... ...generate high-quality data to close them. We... ...Responsibilities Build Agentic Environments: Design and... ...safety beyond simple benchmarks. Close the Loop: Design...
Data
Full time
Work at office
Local area
Immediate start
Relocation package
Flexible hours
Collinear AI, Inc.
Sunnyvale, CA
15 hours ago
Sr. AI/LLM Threat Researcher, Agentic Systems - AI Detection and Response (Hybrid)
$140k - $215k
...with you. About the Role As an AI Threat Researcher, you will lead the charge in identifying,... ...the security of complex AI Workflows and Agentic Loops, uncovering how multi-step reasoning... ...regarding insecure output handling, data integrity, and model robustness. Proficiency...
Data
Work at office
Local area
2 days per week
3 days per week
CrowdStrike Holdings, Inc.
Sunnyvale, CA
15 hours ago
Research Scientist
...the order of listing. What you’ll do As a Research Scientist at Simular, you will: Shape the future of agentic AI by pioneering new research directions... ...and execute experiments end-to-end: from data collection and benchmarking, to model training and evaluation....
Data
Simular Inc.
Palo Alto, CA
4 days ago
Quantum Topological Qubits Research Scientist
$143k - $275k
## Quantum Topological Qubits Research ScientistApplylocations:... ...Quantum Topological Research Scientist - Technology Architecture is... ...Contribute to experimental design, data analysis, and failure... ...integration)* Establish technical benchmarks, requirements, and success metrics...
Data
Local area
GlobalFoundries
Santa Clara, CA
15 hours ago
Research Scientist, Financial Innovation Lab Research Scientist, Financial Innovation Lab HITAC[...]
$113.8k - $142.25k
Research Scientist, Financial Innovation Lab Location: Santa Clara, CA, United States Job ID: R0... ...manufacturing, healthcare, digital engineering, data analytics, and more. Guided by Digital,... .... Familiarity with modern AI and agentic AI is a plus. Knowledge of time series...
Data
Full time
Work at office
Remote work
Seeds Renewables
Santa Clara, CA
1 day ago
Research Scientist, Wayve Labs
...Role We’re looking for Applied Scientists to join Wayve Labs and help... ...Wayve, we are a high‑conviction research team with the strategic... ...long contexts, and scale with data and compute. Cross‑Embodiment... ...evolve Evaluation Frameworks and Benchmarks for long‑horizon prediction,...
Data
Full time
Work at office
Work from home
Visa sponsorship
Relocation package
Flexible hours
Icehouseventures
Sunnyvale, CA
1 day ago
Applied Scientist (ML)
$190k - $275k
...States Role As an Applied Scientist (ML) at Samaya, you will collaborate... ..., and use cutting‑edge ML research to transform how users... ...and system analysis to data collection, benchmark development, model training... ...You will enable expert‑level agentic workflows to automate...
Data
Temporary work
Work at office
Visa sponsorship
Flexible hours
Alumni Ventures
Mountain View, CA
4 days ago
AI Research Scientist (Generative Models for Scientific Discovery)
$184k - $253k
...scientific and materials science data, literature, and workflows.... ...design. Collaborate with scientists, engineers, and cross‑functional... ...curate scientific datasets, benchmarks, and evaluation protocols for... ..., and publish original research in top venues. Mentor junior...
Data
Applied Materials
Santa Clara, CA
2 days ago
Research Scientist - World Modeling
...Models We are a dedicated research lab for building,... ...alongside world-class researchers, data scientists, and engineers, tackling the... ...development of the PAN (Physical, Agentic, and Networked) world models... ...metrics and evaluation benchmarks to better assess model performance...
Data
Visa sponsorship
Institute of Foundation Models
Sunnyvale, CA
a month ago
Research Scientist - Agents
$150k
...of Foundation Models We are a dedicated research lab for building, understanding, using,... ..., alongside world‑class researchers, data scientists, and engineers, tackling the most fundamental... ...challenges to train advanced agentic language models that are adept at using...
Data
Visa sponsorship
Institute of Foundation Models
Sunnyvale, CA
2 days ago
Senior AI Research Scientist- Time-Series Foundational Models
$165k - $195k
Senior AI Research Scientist- Time-Series Foundational Models Full-time The Bosch Research and Technology... ...& Mixed Reality, Cloud Robotics, Big Data Visual Analytics, Explainable AI (XAI),... ...centric AI, synthetic data generation, agentic AI Proficiency with version control...
Data
Full time
Work experience placement
Local area
Worldwide
Robert Bosch Group
Sunnyvale, CA
2 days ago
Lead AI Research Scientist, Hardware AI Systems
...generation computing experiences—from AI and data centers, to PCs, gaming and embedded... .... THE ROLE We are hiring a Lead AI Research Scientist, Hardware AI Systems, to develop AI systems... ...one‑off wins into reusable models, benchmarks, and transfer protocols that compound across...
Data
Shift work
Advanced Micro Devices
Santa Clara, CA
3 days ago
Research Scientist, AI for Graphics and Gaming - New College Grad 2026
$168k - $264.5k
We are now seeking a Research Scientist in Generative AI for Graphics and Gaming! NVIDIA is seeking... ...improvements, not incremental benchmark gains. You’ll work across multiple organizations... ...and iteration. Build and own data pipelines and datasets for foundation models...
Data
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Internship - 2024 Summer Intern, PhD Research Scientist, Generative AI
...step of our exciting journey. The mission of the Waymo Research team is to develop machine learning solutions addressing open problems... ...learning, etc) to these problems; scale them to Google-sized data pipelines; and streamline them to run in real-time on the cars....
Data
Internship
Summer internship
Local area
Waymo
Mountain View, CA
1 day ago
Staff AI Research Scientist, Applied AI, Google Cloud
$207k - $300k
Staff AI Research Scientist, Applied AI, Google Cloud corporate_fare Google place Sunnyvale, CA,... ..., such as machine (and deep) learning, data mining, natural language processing, hardware... ..., develop, and deploy scalable and agentic AI solutions for enterprise use cases...
Data
Full time
Google Inc.
Sunnyvale, CA
1 day ago
AI Researcher
...About the Role As an AI Researcher for Computer Vision & Autonomous Robots... ...interdisciplinary teams of researchers, data scientists, and roboticists to explore,... ...Transformers, Diffusion Models, Agentic AI frameworks). Test and benchmark algorithms on physical robot...
Data
Full time
Tata Consultancy Services
Santa Clara, CA
3 days ago
Senior Machine Learning Scientist
$187k - $261.5k
...Senior Machine Learning Scientist Introduction to the... ...technology platform powered by data and machine learning... ...in AI-driven agentic systems. We're dedicated... ...multi-agent collaboration Research and implement state-of... ...safety across diverse benchmarks and production...
Data
Local area
Worldwide
Flexible hours
Expedia , Inc.
San Jose, CA
16 hours ago
Senior NLP Research Scientist - AI for Meetings
...product development and improving conversational technologies. The ideal candidate has at least 10 years of experience in large scale data processing, a Master's or Ph.D. in relevant fields, and hands-on experience with transformer models. Join us in shaping the future...
Data
Otter.ai
Mountain View, CA
15 hours ago
Remote Senior Research Scientist - Post-Training AI
NVIDIA Corporation is seeking a Senior Research Scientist to develop the post-training pipelines for our innovative Nemotron models. This role focuses on synthetic data, reinforcement learning, and implementing large-scale production models. Ideal candidates will possess...
Data
Remote job
NVIDIA Corporation
Santa Clara, CA
4 days ago
Senior Research Scientist, Fundamental Generative AI
$192k - $304.75k
NVIDIA is searching for a world-class generative AI researcher to join the fundamental generative AI research team at NVIDIA Research. We... ...molecules, molecular dynamics, proteins, RNA, or other scientific data.Excellent programming skills in some prototyping environments...
Data
NVIDIA Corporation
Santa Clara, CA
2 days ago
Research Scientist, Reinforcement Learning (LLM) and Post-training
...accelerate next‑generation computing experiences—from AI and data centers to PCs, gaming, and embedded systems. Grounded in a culture... ..., we advance your career. The Role We are hiring a Lead AI Research Scientist, Reinforcement Learning (LLM) and Post‑Training, specializing...
Data
Advanced Micro Devices
Santa Clara, CA
3 days ago
Scientist, Computational Protein Design
$156.8k - $165.76k
Role: Scientist, Computational Protein Design Adimab is the leading technology provider for... ...biology team in Mountain View, CA. Data-driven approaches have been central to the... ...prediction models. Train and rigorously benchmark resulting models against internal and external...
Data
Adimab, LLC
Mountain View, CA
4 days ago
Senior Research Scientist, Multi-Modal Language Models
$192k - $304.75k
We’re now looking for a Senior Research Scientist, Multi-Modal Language Models! NVIDIA is seeking a Senior Research Scientist passionate about... ...in that we strive for open models, open weights, open data. We want to deliver models that work amazingly well in the real...
Data
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Applied Research Scientist, Multi-Modal Perception (PhD New Grad)
...optimization and integration into the Waymo Driver. We conduct our own research to address real-world problems and collaborate with research teams at Alphabet. We have access to millions of miles of driving data from a diverse set of sensors, enabling engineers like you to (1...
Data
Full time
Temporary work
Remote work
Somi AI
Mountain View, CA
1 day ago
Research Scientist, ML Systems - PhD New College Grad 2026
$168k - $264.5k
We are now looking for a Research Scientist New Graduate with a focus on Machine Learning Systems (MLSys). NVIDIA Research seeks exceptional... ...such as AI/ML systems, operating systems, distributed systems, data management, cloud computing, or computer architecture. What...
Data
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Research Scientist, Data
...generation of creative infrastructure built around real-time, multimodal generation and intelligent agentic platforms. We are looking for a staff or lead-level Research Engineer, Data to architect and scale data engineering systems supporting model training for our advanced...
Data
Remote work
Flexible hours
Pika
Palo Alto, CA
3 days ago
Research Scientist, RL for Autonomous Planning & World Modeling
$204k - $259k
...initiate and foster collaborations with other research teams in Alphabet. AI Foundations areas... ...role, you will report to a Principal Scientist. You will: Participate in Waymo’s Foundation... ...and performant manner such as Data parallel, FSDP and other sharding approaches...
Data
Temporary work
Remote work
Neura Market
Mountain View, CA
2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Research Scientist, Agentic Data & Benchmarking. Be the first to apply!