Evaluations Engineer

PetsApp

About the Role We are looking for strong engineers to join our team and own the leaderboards that appear on Vals AI. You will be responsible for testing and benchmarking new models as they are released on tasks in law, tax, coding, finance, and more. You will analyze error modes of models, evaluate their strengths and weaknesses, and work with our communications team to release results. Our results are used by startups, enterprises, and research labs alike. We work with all the major foundation model labs, some of the largest financial institutions, and hospital systems in the world. Our work has been featured by the Wall Street Journal, Washington Post, and Bloomberg. We are building the standard for evaluating the ability of LLMs to perform real‑world tasks. You will contribute directly to the leaderboards that make this possible. What You’ll Do Evaluate new LLM model releases across the Vals AI suite of benchmarks Work directly with both open‑source and closed‑source foundation model labs in evaluating model performance Use tools like Docent to analyze common failure modes and patterns in model performance Work directly with our social media team to post interesting findings and results Add new models and maintain integrations in our model library Help improve and maintain the infrastructure we use to run benchmarks (agentic and non‑agentic). Collaborate closely with our research team on the creation of new benchmarks This role follows the rhythm of model releases. Expect intense sprints in the days following a major launch, and calmer stretches in between releases. Requirements Familiarity with the LLMs: You should already be familiar with the space - the current leading models, relative performance across them, how to use large language models in practice. Strong engineering fundamentals: You can build and ship quickly with high quality. You should have a track record of building things of significant scope (at jobs, side projects, open source, etc.) Python expertise: Significant experience in Python, especially in a professional setting. Team collaboration: Experience working in development sprints, Git workflows, and pull request reviews. Location: We are an in‑person team based in San Francisco. We will support your relocation or transportation as needed. Nice‑to‑Haves Previous experience with benchmarking large language models, or creating benchmarks Previous experience working at a startup or starting your own company Technical writing experience and ability Machine learning research experience What We Offer Highly competitive salary and meaningful ownership. Excellence is well rewarded. Relocation and transportation support Health/dental insurance coverage Lunch and dinner provided, free snacks/coffee/drinks 401K plan Unlimited PTO About Us Founding team : The core methodology behind this platform comes from NLP evaluation research we had done at Stanford. We raised a $5M seed from some of the top institutional and angel investors in the valley. Our team has prior work experience at NVIDIA, Meta, Microsoft, Palantir and HRT. Collectively, we have over 300 citations in our published work. Our early team include Stanford PhDs, ex‑Jane Street quants, and the first designer at Snorkel. Tech stack : We use Python for most things at Vals. Our platform is built on Django, with a React frontend. All of the infra is on AWS using CDK for IaC. What We're Looking For Learning velocity: The role encompasses a wide variety of tasks. Rather than expecting you to be an expert on Day 1, we are looking for someone who can learn new skills and technologies extremely quickly. Ownership: Working in a small, talent‑dense team, we expect everyone to show initiative to build where it's needed, not where it's asked. We strive for autonomy over consensus. This is especially true for this role. Intensity: The LLM landscape is constantly changing. Foundation model labs are continuously pushing the frontier. The unicorn companies that will emerge from this technology shift are being built now. Those that win will have an incredibly high speed of execution. Solution‑oriented mindset: We're looking for people who see opportunities to craft solutions at each juncture, not those who pass hard problems to others or admit defeat. Further Reading: Hugging Face blog on evaluation Anthropic’s blog on challenges in evaluation New York Times article on issues in benchmarking Stanford HAI report showing hallucinations in legal tech tools Referral Bonus Know someone who would be a good fit? Connect them with View email address on click.appcast.io. If we hire them and they stay on for 90 days you’ll get a $10,000 referral bonus and Vals AI merch! Please mention the bonus in your email. #J-18808-Ljbffr PetsApp

Apply

Vacancy posted 3 days ago

Similar jobs that could be interesting for youBased on the Evaluations Engineer in San Francisco, CA vacancy

Autonomy Safety Evaluations Research Engineer
$315k
We are looking for Research Engineers to build “gold standard” evaluations for catastrophic risks, in order to understand what AI Safety Level (ASL) to assign to models. Research leads on this team collaborate with engineers in one of our focus areas: CBRN, Cyber, Autonomy...
Suggested
Currently hiring
Work at office
Immediate start
Home office
Visa sponsorship
Relocation package
Anthropic
San Francisco, CA
22 hours ago
Software Engineer, Agent Evaluation and Quality Engineering · · San Francisco; New York Apply →
...programmers, using a combination of inventive research, design, and engineering. Our organization is very flat, and our team is small and... ...Agent Quality team at Cursor, you’ll build the measurement, evaluation, and feedback-loop infrastructure that makes the Cursor core...
Suggested
Anysphere
San Francisco, CA
4 days ago
AI Quality Engineer: Agent Evaluation & Metrics
Anysphere is seeking a Software Engineer for the Agent Quality team in San Francisco, CA. In this role, you will design and build infrastructure to evaluate and improve ML agents. Responsibilities include creating evaluation systems, defining quality metrics, and collaborating...
Suggested
Anysphere
San Francisco, CA
4 days ago
Senior AI Quality Engineer — Agent Evaluation & Testing
$176k - $253k
...quality into quantifiable metrics, ensuring high standards through robust evaluation processes. You'll build capability regression evaluation suites, design grading systems, and work directly with engineers to ensure our AI systems excel. Ideal candidates have 3-6 years of...
Suggested
Harper
San Francisco, CA
2 days ago
Applied AI Systems Engineer - ML Infra & Evaluation
$150k
Tzafon is seeking a skilled engineer to enhance their machine intelligence systems in San Francisco. As part of the team, you'll be responsible for building evaluation infrastructure, designing data pipelines, and implementing fine-tuning processes. Ideal candidates have...
Suggested
Tzafon
San Francisco, CA
2 days ago
Remote Computational Mechanical Engineer for AI Evaluation
$70 - $100 per hour
Join Mercor as a STEM Computational Scientific Software & Evaluation Design Engineer, working remotely from anywhere in the United States. You will design computational problems and collaborate on AI strategies. The ideal candidate holds a graduate-level degree in a STEM...
Remote job
Hourly pay
Contract work
Mercor
San Francisco, CA
22 hours ago
LLM Evaluation & Benchmark Engineer
PetsApp is looking for strong engineers to evaluate and benchmark LLM models at their San Francisco office. The role involves analyzing model performance and working closely with both open-source and closed-source model labs. Candidates should have significant Python experience...
Work at office
Relocation package
PetsApp
San Francisco, CA
3 days ago
AI Safety Evaluation Engineer
Aimling is seeking a professional in San Francisco to develop evaluation infrastructure focused on AI safety. The ideal candidate will have proficiency in Python and a strong ability to analyze data. The role includes responsibilities such as building evaluation datasets...
Flexible hours
Aimling
San Francisco, CA
4 days ago
LLM Evaluation & Benchmarking Engineer
Dynamo AI is seeking a candidate to lead LLM evaluation and benchmarking in San Francisco, California. You will generate high-quality data and develop innovative methods for assessing the safety and helpfulness of LLMs. The role requires domain knowledge in evaluation...
Capitolis
San Francisco, CA
22 hours ago
Harness Engineer - AI Prompt & Evaluation Architect
$375k
Virio is seeking a Harness Engineer in San Francisco. You will have the responsibility for the intelligence layer that integrates with... .... Your role involves refining system prompts, designing evaluation frameworks, and collaborating with product teams. The position...
Virio
San Francisco, CA
22 hours ago
Technical Support Field Engineer - San Francisco, CA
$36.06 - $40.87 per hour
...Technical Support Field Engineer - San Francisco, CA Dentsply Sirona is the world’s largest manufacturer of professional dental products... ...Treatment Center customers and authorized dealer technicians. Evaluates and analyzes hardware and software issues and use technology...
Hourly pay
Work experience placement
Work at office
Remote work
Worldwide
Flexible hours
Night shift
Wellspect HealthCare
San Francisco, CA
1 day ago
Principal / Staff Engineer - Manufacturing Automation
...markets. We're seeking a hands-on Principal or Staff-level Engineer with deep domain expertise to lead process development,... ...environments (e.g., consumer electronics, semiconductor back-end), to evaluate competing suppliers and drive meaningful improvements in yield,...
First Principle
San Francisco, CA
1 day ago
Construction Project Engineer
$85k - $105k
...Under the direction of theProjectManager, the exempt Project Engineer will be responsible for the duties summarized below.We are seeking... ...development activities. Essential Job Functions Evaluates, tracks and updates job schedules. Communicates with subs and...
Contract work
For contractors
For subcontractor
Shift work
S.J. Amoroso Construction
San Francisco, CA
18 days ago
Electrical Lab engineering
$150k - $300k
...intensive industries. We are on the lookout for extraordinary engineers and scientists to join our team. Your previous experience in... ..., environmental, and custom sensing solutions. Design and evaluate embedded and mixed-signal systems (MCUs, SoCs, FPGAs, power, comms...
Contract work
Foundation Robotics Lab
San Francisco, CA
22 hours ago
Construction Innovation Engineer
...experienced construction leader who operates across projects. You will evaluate upcoming work, identify where standard approaches will break... ...customer's business and needs and work with our internal engineering and business teams to refine and roadmap necessary changes to...
For contractors
Work at office
Local area
Zipline
San Francisco, CA
22 days ago
Field Service Engineer
...Fluorescence Microscopy, and Canopy. Responsibilities The Field Service Engineer (FSE) provides technical service and support on equipment... ..., and more. Bruker is an equal-opportunity employer. We evaluate qualified applicants without regard to race, color, religion, sex...
Full time
Temporary work
Local area
Night shift
Unavailable
San Francisco, CA
1 day ago
Construction field engineering
$75k - $95k
Overview Join Shimmick's Northwest Division as a Field Engineer and play a key role in critical infrastructure projects accross Northern... ...drawings and visual aids. Assisting with field inspections to evaluate the work performed and materials used. Assisting with...
Internship
Work at office
Shimmick Construction Company
San Francisco, CA
1 day ago
Field Engineering
Be the technical face of Olostep — helping customers evaluate the platform through proofs of concept, tailored code examples, and hands... ...value in their workflows. Why this role matters Combine deep engineering chops with customer empathy — win deals by showing real value,...
Olostep Technologies Inc.
San Francisco, CA
4 days ago
Field Engineering Tech
Olostep Technologies Inc. is looking for a Field Engineer in San Francisco, California. In this role, you will help customers evaluate the platform through tailored code examples and hands-on guidance. The ideal candidate will combine strong software engineering skills...
Olostep Technologies Inc.
San Francisco, CA
4 days ago
Principal Systems Engineer - IAM
$184k - $230k
...employment Visa sponsorship. Overall Purpose As a Principal Engineer in the Identity and Access Management (IAM) team, you will play... ...access management. Establishes vendor relationships to evaluate various technologies and associated costs to ensure technology...
Hourly pay
For contractors
Work experience placement
Work at office
Immediate start
Visa sponsorship
Work visa
Flexible hours
Early Warning Services
San Francisco, CA
22 hours ago
Senior Staff Engineer - AI Safety & Model Evaluation
...standards, and research advanced defenses against emerging threats. Applicants should have deep expertise in LLM safety, strong software engineering skills, and relevant academic qualifications in AI or related fields. This position is pivotal for paving the way for safe AI...
Xcede
San Francisco, CA
1 day ago
Beamline Controls Engineer
$179k - $218k
Lawrence Berkeley National Laboratory is hiring a Beamline Controls Engineer within the Engineering division. This position is responsible... ...is subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities...
Full time
Remote work
Lawrence Berkeley National Laboratory
San Francisco, CA
3 days ago
Electrical Field Engineer Intern
$26 - $29 per hour
Overview Shimmick is actively seeking a highly motivated Field Engineer Intern for employment in our Electrical Division (Axia). Location... ...drawings and visual aids Assisting with field inspections to evaluate the work performed and materials used Assisting with...
Hourly pay
Internship
Work at office
Shimmick Construction Company
San Francisco, CA
1 day ago
Data Center Controls Network Engineer
...infrastructure foundation for the next generation of AI. The Data Center Engineering team defines the strategy, reference architectures, technical... ...operational requirements into practical OT network designs, evaluates vendor solutions, and drives technical decisions across...
For contractors
Work at office
Remote work
OpenAI
San Francisco, CA
1 day ago
Senior Quality Engineer - Design Assurance (Firmware / Electrical Systems)
$130k - $160k
...Senior Quality Engineer – Design Assurance (Firmware / Electrical Systems) An innovative, well-funded medical device company is developing... ...is required. ~ Experience developing test methodologies and evaluating the impact of design changes on product performance. ~...
Contract work
SciPro
San Francisco, CA
22 hours ago
Software Engineer - Robotics & Autonomous Systems Engineering San Francisco, CA
$248.4k - $310.5k
...Software Engineer - Robotics & Autonomous Systems Scale's Robotics business unit is dedicated to solving the data bottleneck in Physical... ...for robotics data collection, model training pipelines, and evaluation infrastructure. You'll have the opportunity to own critical...
Full time
Scale AI
San Francisco, CA
1 day ago
Manager, Electrical Engineering
...expertise and dedication of our team members. As Manager, Electrical Engineering at Hayden AI, your role is to design and guide development of... .... Worked with cross functional teams for system bring up, evaluation of AI processor platforms, cameras and telematics sensors....
Contract work
Work experience placement
Work at office
3 days per week
Hayden AI Technologies, Inc.
San Francisco, CA
3 days ago
Process Engineer: Electronics Manufacturing & CI
Sanmina is looking for a Process Engineer in San Francisco, CA. The role involves establishing and evaluating manufacturing processes, focusing on soldering and fixture development among other duties. The ideal candidate should possess a Bachelor's degree in Engineering...
Sanmina
San Francisco, CA
2 days ago
Staff Engineer, Trustworthy ML Evaluation Platform
...seeking a Member of Technical Staff to build and maintain the evaluation platform used across Magic. You will develop infrastructure... ...research decisions. The ideal candidate should have strong software engineering skills, attention to detail, and experience with machine...
AI Chopping Block, Inc.
San Francisco, CA
4 days ago
Senior Test Automation Engineer
...mentor team members, and serve as a technical resource for quality engineering. You will continue hands on test execution while taking on... ...to identify edge cases, dependencies, and quality risks. Evaluate and recommend testing tools through POCs and technical assessments...
Work experience placement
Antler
San Francisco, CA
3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Evaluations Engineer. Be the first to apply!