Evaluations Engineer
PetsApp
About the Role We are looking for strong engineers to join our team and own the leaderboards that appear on Vals AI. You will be responsible for testing and benchmarking new models as they are released on tasks in law, tax, coding, finance, and more. You will analyze error modes of models, evaluate their strengths and weaknesses, and work with our communications team to release results. Our results are used by startups, enterprises, and research labs alike. We work with all the major foundation model labs, some of the largest financial institutions, and hospital systems in the world. Our work has been featured by the Wall Street Journal, Washington Post, and Bloomberg. We are building the standard for evaluating the ability of LLMs to perform real‑world tasks. You will contribute directly to the leaderboards that make this possible. What You’ll Do Evaluate new LLM model releases across the Vals AI suite of benchmarks Work directly with both open‑source and closed‑source foundation model labs in evaluating model performance Use tools like Docent to analyze common failure modes and patterns in model performance Work directly with our social media team to post interesting findings and results Add new models and maintain integrations in our model library Help improve and maintain the infrastructure we use to run benchmarks (agentic and non‑agentic). Collaborate closely with our research team on the creation of new benchmarks This role follows the rhythm of model releases. Expect intense sprints in the days following a major launch, and calmer stretches in between releases. Requirements Familiarity with the LLMs: You should already be familiar with the space - the current leading models, relative performance across them, how to use large language models in practice. Strong engineering fundamentals: You can build and ship quickly with high quality. You should have a track record of building things of significant scope (at jobs, side projects, open source, etc.) Python expertise: Significant experience in Python, especially in a professional setting. Team collaboration: Experience working in development sprints, Git workflows, and pull request reviews. Location: We are an in‑person team based in San Francisco. We will support your relocation or transportation as needed. Nice‑to‑Haves Previous experience with benchmarking large language models, or creating benchmarks Previous experience working at a startup or starting your own company Technical writing experience and ability Machine learning research experience What We Offer Highly competitive salary and meaningful ownership. Excellence is well rewarded. Relocation and transportation support Health/dental insurance coverage Lunch and dinner provided, free snacks/coffee/drinks 401K plan Unlimited PTO About Us Founding team : The core methodology behind this platform comes from NLP evaluation research we had done at Stanford. We raised a $5M seed from some of the top institutional and angel investors in the valley. Our team has prior work experience at NVIDIA, Meta, Microsoft, Palantir and HRT. Collectively, we have over 300 citations in our published work. Our early team include Stanford PhDs, ex‑Jane Street quants, and the first designer at Snorkel. Tech stack : We use Python for most things at Vals. Our platform is built on Django, with a React frontend. All of the infra is on AWS using CDK for IaC. What We're Looking For Learning velocity: The role encompasses a wide variety of tasks. Rather than expecting you to be an expert on Day 1, we are looking for someone who can learn new skills and technologies extremely quickly. Ownership: Working in a small, talent‑dense team, we expect everyone to show initiative to build where it's needed, not where it's asked. We strive for autonomy over consensus. This is especially true for this role. Intensity: The LLM landscape is constantly changing. Foundation model labs are continuously pushing the frontier. The unicorn companies that will emerge from this technology shift are being built now. Those that win will have an incredibly high speed of execution. Solution‑oriented mindset: We're looking for people who see opportunities to craft solutions at each juncture, not those who pass hard problems to others or admit defeat. Further Reading: Hugging Face blog on evaluation Anthropic’s blog on challenges in evaluation New York Times article on issues in benchmarking Stanford HAI report showing hallucinations in legal tech tools Referral Bonus Know someone who would be a good fit? Connect them with View email address on click.appcast.io. If we hire them and they stay on for 90 days you’ll get a $10,000 referral bonus and Vals AI merch! Please mention the bonus in your email. #J-18808-Ljbffr PetsApp
$315k
We are looking for Research Engineers to build “gold standard” evaluations for catastrophic risks, in order to understand what AI Safety Level (ASL) to assign to models. Research leads on this team collaborate with engineers in one of our focus areas: CBRN, Cyber, Autonomy...SuggestedCurrently hiringWork at officeImmediate startHome officeVisa sponsorshipRelocation package- ...programmers, using a combination of inventive research, design, and engineering. Our organization is very flat, and our team is small and... ...Agent Quality team at Cursor, you’ll build the measurement, evaluation, and feedback-loop infrastructure that makes the Cursor core...Suggested
- Anysphere is seeking a Software Engineer for the Agent Quality team in San Francisco, CA. In this role, you will design and build infrastructure to evaluate and improve ML agents. Responsibilities include creating evaluation systems, defining quality metrics, and collaborating...Suggested
$176k - $253k
...quality into quantifiable metrics, ensuring high standards through robust evaluation processes. You'll build capability regression evaluation suites, design grading systems, and work directly with engineers to ensure our AI systems excel. Ideal candidates have 3-6 years of...Suggested$150k
Tzafon is seeking a skilled engineer to enhance their machine intelligence systems in San Francisco. As part of the team, you'll be responsible for building evaluation infrastructure, designing data pipelines, and implementing fine-tuning processes. Ideal candidates have...Suggested$70 - $100 per hour
Join Mercor as a STEM Computational Scientific Software & Evaluation Design Engineer, working remotely from anywhere in the United States. You will design computational problems and collaborate on AI strategies. The ideal candidate holds a graduate-level degree in a STEM...Remote jobHourly payContract work- PetsApp is looking for strong engineers to evaluate and benchmark LLM models at their San Francisco office. The role involves analyzing model performance and working closely with both open-source and closed-source model labs. Candidates should have significant Python experience...Work at officeRelocation package
- Aimling is seeking a professional in San Francisco to develop evaluation infrastructure focused on AI safety. The ideal candidate will have proficiency in Python and a strong ability to analyze data. The role includes responsibilities such as building evaluation datasets...Flexible hours
- Dynamo AI is seeking a candidate to lead LLM evaluation and benchmarking in San Francisco, California. You will generate high-quality data and develop innovative methods for assessing the safety and helpfulness of LLMs. The role requires domain knowledge in evaluation...
$375k
Virio is seeking a Harness Engineer in San Francisco. You will have the responsibility for the intelligence layer that integrates with... .... Your role involves refining system prompts, designing evaluation frameworks, and collaborating with product teams. The position...$36.06 - $40.87 per hour
...Technical Support Field Engineer - San Francisco, CA Dentsply Sirona is the world’s largest manufacturer of professional dental products... ...Treatment Center customers and authorized dealer technicians. Evaluates and analyzes hardware and software issues and use technology...Hourly payWork experience placementWork at officeRemote workWorldwideFlexible hoursNight shift- ...markets. We're seeking a hands-on Principal or Staff-level Engineer with deep domain expertise to lead process development,... ...environments (e.g., consumer electronics, semiconductor back-end), to evaluate competing suppliers and drive meaningful improvements in yield,...
$85k - $105k
...Under the direction of theProjectManager, the exempt Project Engineer will be responsible for the duties summarized below.We are seeking... ...development activities. Essential Job Functions Evaluates, tracks and updates job schedules. Communicates with subs and...Contract workFor contractorsFor subcontractorShift work$150k - $300k
...intensive industries. We are on the lookout for extraordinary engineers and scientists to join our team. Your previous experience in... ..., environmental, and custom sensing solutions. Design and evaluate embedded and mixed-signal systems (MCUs, SoCs, FPGAs, power, comms...Contract work- ...experienced construction leader who operates across projects. You will evaluate upcoming work, identify where standard approaches will break... ...customer's business and needs and work with our internal engineering and business teams to refine and roadmap necessary changes to...For contractorsWork at officeLocal area
- ...Fluorescence Microscopy, and Canopy. Responsibilities The Field Service Engineer (FSE) provides technical service and support on equipment... ..., and more. Bruker is an equal-opportunity employer. We evaluate qualified applicants without regard to race, color, religion, sex...Full timeTemporary workLocal areaNight shift
$75k - $95k
Overview Join Shimmick's Northwest Division as a Field Engineer and play a key role in critical infrastructure projects accross Northern... ...drawings and visual aids. Assisting with field inspections to evaluate the work performed and materials used. Assisting with...InternshipWork at office- Be the technical face of Olostep — helping customers evaluate the platform through proofs of concept, tailored code examples, and hands... ...value in their workflows. Why this role matters Combine deep engineering chops with customer empathy — win deals by showing real value,...
- Olostep Technologies Inc. is looking for a Field Engineer in San Francisco, California. In this role, you will help customers evaluate the platform through tailored code examples and hands-on guidance. The ideal candidate will combine strong software engineering skills...
$184k - $230k
...employment Visa sponsorship. Overall Purpose As a Principal Engineer in the Identity and Access Management (IAM) team, you will play... ...access management. Establishes vendor relationships to evaluate various technologies and associated costs to ensure technology...Hourly payFor contractorsWork experience placementWork at officeImmediate startVisa sponsorshipWork visaFlexible hours- ...standards, and research advanced defenses against emerging threats. Applicants should have deep expertise in LLM safety, strong software engineering skills, and relevant academic qualifications in AI or related fields. This position is pivotal for paving the way for safe AI...
$179k - $218k
Lawrence Berkeley National Laboratory is hiring a Beamline Controls Engineer within the Engineering division. This position is responsible... ...is subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities...Full timeRemote work$26 - $29 per hour
Overview Shimmick is actively seeking a highly motivated Field Engineer Intern for employment in our Electrical Division (Axia). Location... ...drawings and visual aids Assisting with field inspections to evaluate the work performed and materials used Assisting with...Hourly payInternshipWork at office- ...infrastructure foundation for the next generation of AI. The Data Center Engineering team defines the strategy, reference architectures, technical... ...operational requirements into practical OT network designs, evaluates vendor solutions, and drives technical decisions across...For contractorsWork at officeRemote work
$130k - $160k
...Senior Quality Engineer – Design Assurance (Firmware / Electrical Systems) An innovative, well-funded medical device company is developing... ...is required. ~ Experience developing test methodologies and evaluating the impact of design changes on product performance. ~...Contract work$248.4k - $310.5k
...Software Engineer - Robotics & Autonomous Systems Scale's Robotics business unit is dedicated to solving the data bottleneck in Physical... ...for robotics data collection, model training pipelines, and evaluation infrastructure. You'll have the opportunity to own critical...Full time- ...expertise and dedication of our team members. As Manager, Electrical Engineering at Hayden AI, your role is to design and guide development of... .... Worked with cross functional teams for system bring up, evaluation of AI processor platforms, cameras and telematics sensors....Contract workWork experience placementWork at office3 days per week
- Sanmina is looking for a Process Engineer in San Francisco, CA. The role involves establishing and evaluating manufacturing processes, focusing on soldering and fixture development among other duties. The ideal candidate should possess a Bachelor's degree in Engineering...
- ...seeking a Member of Technical Staff to build and maintain the evaluation platform used across Magic. You will develop infrastructure... ...research decisions. The ideal candidate should have strong software engineering skills, attention to detail, and experience with machine...
- ...mentor team members, and serve as a technical resource for quality engineering. You will continue hands on test execution while taking on... ...to identify edge cases, dependencies, and quality risks. Evaluate and recommend testing tools through POCs and technical assessments...Work experience placement
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Evaluations Engineer. Be the first to apply!




