Staff ML Research Scientist, LLM Evaluations & Benchmarks
Scale AI, Inc.
A leading AI evaluation company is looking for a Staff Machine Learning Research Scientist to advance LLM evaluation methodologies. This role involves designing benchmarks, collaborating with teams, and mentoring others. Ideal candidates have significant experience in NLP and have published research in major conferences. Competitive compensation includes salary, equity, and comprehensive benefits. Preferred location is San Francisco. #J-18808-Ljbffr Scale AI, Inc.
$264.8k - $331k
...leading data and evaluation partner for... ...evaluation and benchmarking of large language... ...industry-leading LLM evals, setting... .... Our Research teams work with... ...research. As a Staff Machine Learning Research Scientist on the LLM Evals... ...pipelines using modern ML frameworks....SuggestedFull time- A leading AI evaluation firm based in San Francisco seeks a Machine Learning Scientist to foster understanding of AI model performance. You'll engage in designing and analyzing comprehensive experiments while collaborating across teams. Applicants should possess a PhD in...Suggested
$197.4k - $246.75k
Scale Labs, Research Scientist — Frontier Risk Evaluations As the leading data and evaluation partner... ...and instrumenting ML pipelines, writing evaluation... ...crafting evaluations and benchmarks, or a background in data science roles related to LLM technologies. Experience...SuggestedFull time$302.4k - $378k
...building upon our prior model evaluation work with enterprise... ...team, part of Scale's Research organization, brings... ...and RL reward signals, benchmarking autonomous agent performance... ...not only expertise in LLM agents and planning... ...published research in top ML venues (e.g., ACL,...SuggestedFull time$250k - $350k
...of what's possible in LLM post-training. If you love... ..., and turning research insights into products... ...distillation, training, evaluation, and planet-scale hosting... ...settings Create robust benchmarks and evaluation frameworks... ...performance Stay current with ML research and identify...SuggestedWork at office$252k - $315k
...Machine Learning Research Scientist, Reasoning San Francisco, CA; Seattle... ...Building on our history of model evaluation with enterprise and... ...types critical for advancing LLM-based agents, including browser... ...of published research in top ML and NLP venues (e.g., ACL, EMNLP...Full time- ...Staff / Principal ML Training Systems Engineer We are building... ...heavily in research, infrastructure, and... ...Develop automated benchmarking and regression detection... ...directly with research scientists and ML engineers in... ...variable-length data Evaluation cadence and rollout...
$250k - $325k
...RAG [2023] Large-scale LLM‑based legal fact... ...What, and Who Why: AI Researchers are the engine of innovation... ...legal‑specific tasks. Evaluate emerging work in agentic... ...and maintain datasets, benchmarks, and evals for training... ...Publications at top venues—e.g., ML/AI conferences (NeurIPS...Contract workImmediate start- Fleet AI, Inc. is seeking a Research Scientist to join their core research team in San Francisco. This role focuses on investigating... ...with leading labs. Key responsibilities include generating benchmarks to evaluate frontier models, automating environment construction for...
- ...agents, and we are hiring a Research Scientist to advance the neuro-symbolic... .... Publish at top AI and ML conferences (NeurIPS, ICML,... ...with full observability and benchmarking. Engage with the Bay Area academic... ..., etc.). Familiarity with LLM reasoning, retrieval-augmented...Work at officeRelocation
$197.4k - $246.75k
Scale Labs, Research Scientist — AI Controls and Monitoring As the leading data and evaluation partner for frontier AI companies, Scale plays... ...to establish standards and benchmarks for AI monitoring and... ...experience addressing sophisticated ML problems, whether in a...$315k - $340k
[Expression of Interest] Research Scientist/Engineer, Honesty About Anthropic... ...comprehensive honesty benchmarks and evaluation frameworks Implement techniques... .../PhD in Computer Science, ML, or related field Possess... ...: Currently, we expect all staff to be in one of our offices...Full timeWork at officeVisa sponsorshipFlexible hours$350k
...growing group of committed researchers, engineers, policy... ...full stack of audio ML, developing audio codecs... ..., from performance benchmarking to kernel optimization... ...audio Creating robust evaluation methodologies for hard... ...Currently, we expect all staff to be in one of our...Full timeWork at officeVisa sponsorshipFlexible hoursShift work$280k
...group of committed researchers, engineers, policy... ...yourself as both a scientist and an engineer. As... ...tooling to efficiently evaluate the effectiveness of novel LLM-generated... ...significant software, ML, or research engineering... ..., we expect all staff to be in one of our...Contract workFor contractorsFor subcontractorWork at officeRelocationVisa sponsorshipWork visaFlexible hours$357k
...Responsibilities Workato's AI Research Lab is seeking an... ...Lead AI Research Scientist to join our growing team... ...workflow graphs, and agent evaluation frameworks, while... ...PyTorch or JAX and modern LLM frameworks. ~ Proven... ...publication record in top-tier ML venues (NeurIPS, ICML,...Work at officeRemote workFlexible hours- ...Cohere is a team of researchers, engineers, designers,... ...not a typical "Applied Scientist" or "ML Engineer" role. As a Member of Technical Staff, Applied ML, you will:... ...domains, design custom LLM solutions, and deliver... ...agent integrations, model evaluations, and SOTA modeling...Full timeWork at officeRemote workFlexible hours
- ...Principal AI Security & Risk Researcher to join our founding research... ...methodologies Stay current with LLM vulnerabilities, adversarial techniques... ...agentic systems Develop risk evaluation methodologies that adapt as... ..., with 2+ years focused on AI/ML security, red teaming, or...Part timeRemote workFlexible hours
$240k - $300k
...Data teams to create thought‑leading research, craft benchmark reports, and publish insights that make... ...subject‑matter experts, data scientists, and customers to surface exclusive insights... ...Experience analyzing markets and evaluating data from multiple sources to produce...Work at officeRemote work3 days per week$240k - $300k
### Staff Economist#### San Francisco, California, United StatesStaff... ...to create thought-leading research, craft benchmark reports, and publish... ...‑matter experts, data scientists, and customers to surface exclusive... ...analyzing markets and evaluating data from multiple sources...Work at officeRemote workWork from home3 days per week$200k - $280k
...performance computing for ML. Are comfortable... .... Have a solid research foundation in your area... ...rollout collection and evaluation cheaper. Use these... ...Establish metrics, benchmarks, and experimentation frameworks... ...technical leadership (Staff level) Set...Full time$150k - $250k
...goods, and global social organizations. We research and deploy technologies that power AI-... ...to drive incremental improvements on benchmarks or optimize an existing process but... ...progress is measured. Researchers design evaluation frameworks that capture reasoning depth...Work at office3 days per week- ...care. Founded by Stanford AI scientists with deep clinical... ...Overview We are seeking an ML Scientist (Research) to advance Knowtex's voice... ...focuses on developing and evaluating novel machine learning approaches... ...Transformer-based LLM architectures Triton Inference...
$141.1k - $262.1k
...development. Roche's Research and Early... ...(AI) to assist our scientists in both pRED and gRED... ...machine learning (ML) techniques. We are... ...training signals, and evaluation criteria. Evaluation & Benchmarks: Design and implement... ...work experience. LLM Expertise: Experience...Work experience placementLocal areaWorldwideRelocation package$176k - $304k
...ML Scientist I / II, Foundation Models for Life Sciences San Francisco... ...the Foundation Models team researches and develops large-scale... ...formulation, model design, training, evaluation, and integration into Lila's... ...ML tools, frameworks, or benchmark datasets for scientific...Full timeWork at officeLocal areaFlexible hours- ...Senior / Principal ML Scientist Merge Labs is a frontier research lab with the mission of bridging biological and artificial intelligence to maximize... ...acquisition-strategy using internal and public datasets; benchmark and validate model performance. Integrate ML...
$150k - $300k
...across time, causality, and context. As a Research Scientist, you will tackle fundamental problems in... .... Key Responsibilities Build LLM-powered information extraction pipelines... ...architectural exploration. ~ Strong ML and NLP foundation, particularly in information...Relocation packageFlexible hours- Member of Technical Staff - Research Scientist Patronus AI is a frontier lab developing... ...influential research in AI evaluation like FinanceBench , Lynx ,... ...outcomes such as papers, benchmarks, datasets, and platform... ...code in Python and modern ML frameworks. Ability to execute...
$160k - $250k
Machine Learning Researcher, Audio Location: San Francisco, CA or Remote... ...metrics and perceptual evaluations. Validate ideas quickly through... ...Comfortable working with both offline benchmarks and live production metrics.... ...or telephony. PhD in ML, AI, or a related field, or...Work at officeRemote work$180k - $260k
BLAND is seeking a Machine Learning Researcher focused on multimodal LLM technology. The role involves developing conversational AI models that integrate speech, text, and real-time reasoning. Candidates should possess a strong background in machine learning, LLMs, and...$259.2k - $324k
Staff Machine Learning Research Scientist/ Engineer, Agents Join the team shaping the future of... ...upon our prior model evaluation work with enterprise customers... ...published research in top ML venues (e.g., ACL, EMNLP,... ...experience with open source LLM fine-tuning or involvement...Full time
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Staff ML Research Scientist, LLM Evaluations & Benchmarks. Be the first to apply!
- assistant scientist San Francisco, CA
- downstream processing scientist San Francisco, CA
- machine learning research scientist San Francisco, CA
- drug safety scientist San Francisco, CA
- remote scientist San Francisco, CA
- graduate scientist San Francisco, CA
- operations research scientist San Francisco, CA
- senior scientist San Francisco, CA
- research associate scientist San Francisco, CA
- scientist assay development San Francisco, CA



