Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Evaluations Engineer

PetsApp

About the Role We are looking for strong engineers to join our team and own the leaderboards that appear on Vals AI. You will be responsible for testing and benchmarking new models as they are released on tasks in law, tax, coding, finance, and more. You will analyze error modes of models, evaluate their strengths and weaknesses, and work with our communications team to release results. Our results are used by startups, enterprises, and research labs alike. We work with all the major foundation model labs, some of the largest financial institutions, and hospital systems in the world. Our work has been featured by the Wall Street Journal, Washington Post, and Bloomberg. We are building the standard for evaluating the ability of LLMs to perform real‑world tasks. You will contribute directly to the leaderboards that make this possible. What You’ll Do Evaluate new LLM model releases across the Vals AI suite of benchmarks Work directly with both open‑source and closed‑source foundation model labs in evaluating model performance Use tools like Docent to analyze common failure modes and patterns in model performance Work directly with our social media team to post interesting findings and results Add new models and maintain integrations in our model library Help improve and maintain the infrastructure we use to run benchmarks (agentic and non‑agentic). Collaborate closely with our research team on the creation of new benchmarks This role follows the rhythm of model releases. Expect intense sprints in the days following a major launch, and calmer stretches in between releases. Requirements Familiarity with the LLMs: You should already be familiar with the space - the current leading models, relative performance across them, how to use large language models in practice. Strong engineering fundamentals: You can build and ship quickly with high quality. You should have a track record of building things of significant scope (at jobs, side projects, open source, etc.) Python expertise: Significant experience in Python, especially in a professional setting. Team collaboration: Experience working in development sprints, Git workflows, and pull request reviews. Location: We are an in‑person team based in San Francisco. We will support your relocation or transportation as needed. Nice‑to‑Haves Previous experience with benchmarking large language models, or creating benchmarks Previous experience working at a startup or starting your own company Technical writing experience and ability Machine learning research experience What We Offer Highly competitive salary and meaningful ownership. Excellence is well rewarded. Relocation and transportation support Health/dental insurance coverage Lunch and dinner provided, free snacks/coffee/drinks 401K plan Unlimited PTO About Us Founding team : The core methodology behind this platform comes from NLP evaluation research we had done at Stanford. We raised a $5M seed from some of the top institutional and angel investors in the valley. Our team has prior work experience at NVIDIA, Meta, Microsoft, Palantir and HRT. Collectively, we have over 300 citations in our published work. Our early team include Stanford PhDs, ex‑Jane Street quants, and the first designer at Snorkel. Tech stack : We use Python for most things at Vals. Our platform is built on Django, with a React frontend. All of the infra is on AWS using CDK for IaC. What We're Looking For Learning velocity: The role encompasses a wide variety of tasks. Rather than expecting you to be an expert on Day 1, we are looking for someone who can learn new skills and technologies extremely quickly. Ownership: Working in a small, talent‑dense team, we expect everyone to show initiative to build where it's needed, not where it's asked. We strive for autonomy over consensus. This is especially true for this role. Intensity: The LLM landscape is constantly changing. Foundation model labs are continuously pushing the frontier. The unicorn companies that will emerge from this technology shift are being built now. Those that win will have an incredibly high speed of execution. Solution‑oriented mindset: We're looking for people who see opportunities to craft solutions at each juncture, not those who pass hard problems to others or admit defeat. Further Reading: Hugging Face blog on evaluation Anthropic’s blog on challenges in evaluation New York Times article on issues in benchmarking Stanford HAI report showing hallucinations in legal tech tools Referral Bonus Know someone who would be a good fit? Connect them with View email address on click.appcast.io. If we hire them and they stay on for 90 days you’ll get a $10,000 referral bonus and Vals AI merch! Please mention the bonus in your email. #J-18808-Ljbffr PetsApp

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Evaluations Engineer in San Francisco, CA vacancy
  • $315k

    We are looking for Research Engineers to build “gold standard” evaluations for catastrophic risks, in order to understand what AI Safety Level (ASL) to assign to models. Research leads on this team collaborate with engineers in one of our focus areas: CBRN, Cyber, Autonomy... 
    Suggested
    Currently hiring
    Work at office
    Immediate start
    Home office
    Visa sponsorship
    Relocation package

    Anthropic

    San Francisco, CA
    22 hours ago
  •  ...programmers, using a combination of inventive research, design, and engineering. Our organization is very flat, and our team is small and...  ...Agent Quality team at Cursor, you’ll build the measurement, evaluation, and feedback-loop infrastructure that makes the Cursor core... 
    Suggested

    Anysphere

    San Francisco, CA
    4 days ago
  • Anysphere is seeking a Software Engineer for the Agent Quality team in San Francisco, CA. In this role, you will design and build infrastructure to evaluate and improve ML agents. Responsibilities include creating evaluation systems, defining quality metrics, and collaborating... 
    Suggested

    Anysphere

    San Francisco, CA
    4 days ago
  • $176k - $253k

     ...quality into quantifiable metrics, ensuring high standards through robust evaluation processes. You'll build capability regression evaluation suites, design grading systems, and work directly with engineers to ensure our AI systems excel. Ideal candidates have 3-6 years of... 
    Suggested

    Harper

    San Francisco, CA
    2 days ago
  • $150k

    Tzafon is seeking a skilled engineer to enhance their machine intelligence systems in San Francisco. As part of the team, you'll be responsible for building evaluation infrastructure, designing data pipelines, and implementing fine-tuning processes. Ideal candidates have... 
    Suggested

    Tzafon

    San Francisco, CA
    2 days ago
  • $70 - $100 per hour

    Join Mercor as a STEM Computational Scientific Software & Evaluation Design Engineer, working remotely from anywhere in the United States. You will design computational problems and collaborate on AI strategies. The ideal candidate holds a graduate-level degree in a STEM... 
    Remote job
    Hourly pay
    Contract work

    Mercor

    San Francisco, CA
    22 hours ago
  • PetsApp is looking for strong engineers to evaluate and benchmark LLM models at their San Francisco office. The role involves analyzing model performance and working closely with both open-source and closed-source model labs. Candidates should have significant Python experience... 
    Work at office
    Relocation package

    PetsApp

    San Francisco, CA
    3 days ago
  • Aimling is seeking a professional in San Francisco to develop evaluation infrastructure focused on AI safety. The ideal candidate will have proficiency in Python and a strong ability to analyze data. The role includes responsibilities such as building evaluation datasets... 
    Flexible hours

    Aimling

    San Francisco, CA
    4 days ago
  • Dynamo AI is seeking a candidate to lead LLM evaluation and benchmarking in San Francisco, California. You will generate high-quality data and develop innovative methods for assessing the safety and helpfulness of LLMs. The role requires domain knowledge in evaluation... 

    Capitolis

    San Francisco, CA
    22 hours ago
  • $375k

    Virio is seeking a Harness Engineer in San Francisco. You will have the responsibility for the intelligence layer that integrates with...  .... Your role involves refining system prompts, designing evaluation frameworks, and collaborating with product teams. The position... 

    Virio

    San Francisco, CA
    22 hours ago
  • $36.06 - $40.87 per hour

     ...Technical Support Field Engineer - San Francisco, CA Dentsply Sirona is the world’s largest manufacturer of professional dental products...  ...Treatment Center customers and authorized dealer technicians. Evaluates and analyzes hardware and software issues and use technology... 
    Hourly pay
    Work experience placement
    Work at office
    Remote work
    Worldwide
    Flexible hours
    Night shift

    Wellspect HealthCare

    San Francisco, CA
    1 day ago
  •  ...markets. We're seeking a hands-on Principal or Staff-level Engineer with deep domain expertise to lead process development,...  ...environments (e.g., consumer electronics, semiconductor back-end), to evaluate competing suppliers and drive meaningful improvements in yield,... 

    First Principle

    San Francisco, CA
    1 day ago
  • $85k - $105k

     ...Under the direction of theProjectManager, the exempt Project Engineer will be responsible for the duties summarized below.We are seeking...  ...development activities. Essential Job Functions Evaluates, tracks and updates job schedules. Communicates with subs and... 
    Contract work
    For contractors
    For subcontractor
    Shift work

    S.J. Amoroso Construction

    San Francisco, CA
    18 days ago
  • $150k - $300k

     ...intensive industries. We are on the lookout for extraordinary engineers and scientists to join our team. Your previous experience in...  ..., environmental, and custom sensing solutions. Design and evaluate embedded and mixed-signal systems (MCUs, SoCs, FPGAs, power, comms... 
    Contract work

    Foundation Robotics Lab

    San Francisco, CA
    22 hours ago
  •  ...experienced construction leader who operates across projects. You will evaluate upcoming work, identify where standard approaches will break...  ...customer's business and needs and work with our internal engineering and business teams to refine and roadmap necessary changes to... 
    For contractors
    Work at office
    Local area

    Zipline

    San Francisco, CA
    22 days ago
  •  ...Fluorescence Microscopy, and Canopy. Responsibilities The Field Service Engineer (FSE) provides technical service and support on equipment...  ..., and more. Bruker is an equal-opportunity employer. We evaluate qualified applicants without regard to race, color, religion, sex... 
    Full time
    Temporary work
    Local area
    Night shift

    Unavailable

    San Francisco, CA
    1 day ago
  • $75k - $95k

    Overview Join Shimmick's Northwest Division as a Field Engineer and play a key role in critical infrastructure projects accross Northern...  ...drawings and visual aids. Assisting with field inspections to evaluate the work performed and materials used. Assisting with... 
    Internship
    Work at office

    Shimmick Construction Company

    San Francisco, CA
    1 day ago
  • Be the technical face of Olostep — helping customers evaluate the platform through proofs of concept, tailored code examples, and hands...  ...value in their workflows. Why this role matters Combine deep engineering chops with customer empathy — win deals by showing real value,... 

    Olostep Technologies Inc.

    San Francisco, CA
    4 days ago
  • Olostep Technologies Inc. is looking for a Field Engineer in San Francisco, California. In this role, you will help customers evaluate the platform through tailored code examples and hands-on guidance. The ideal candidate will combine strong software engineering skills... 

    Olostep Technologies Inc.

    San Francisco, CA
    4 days ago
  • $184k - $230k

     ...employment Visa sponsorship. Overall Purpose As a Principal Engineer in the Identity and Access Management (IAM) team, you will play...  ...access management. Establishes vendor relationships to evaluate various technologies and associated costs to ensure technology... 
    Hourly pay
    For contractors
    Work experience placement
    Work at office
    Immediate start
    Visa sponsorship
    Work visa
    Flexible hours

    Early Warning Services

    San Francisco, CA
    22 hours ago
  •  ...standards, and research advanced defenses against emerging threats. Applicants should have deep expertise in LLM safety, strong software engineering skills, and relevant academic qualifications in AI or related fields. This position is pivotal for paving the way for safe AI... 

    Xcede

    San Francisco, CA
    1 day ago
  • $179k - $218k

    Lawrence Berkeley National Laboratory is hiring a Beamline Controls Engineer within the Engineering division. This position is responsible...  ...is subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities... 
    Full time
    Remote work

    Lawrence Berkeley National Laboratory

    San Francisco, CA
    3 days ago
  • $26 - $29 per hour

    Overview Shimmick is actively seeking a highly motivated Field Engineer Intern for employment in our Electrical Division (Axia). Location...  ...drawings and visual aids Assisting with field inspections to evaluate the work performed and materials used Assisting with... 
    Hourly pay
    Internship
    Work at office

    Shimmick Construction Company

    San Francisco, CA
    1 day ago
  •  ...infrastructure foundation for the next generation of AI. The Data Center Engineering team defines the strategy, reference architectures, technical...  ...operational requirements into practical OT network designs, evaluates vendor solutions, and drives technical decisions across... 
    For contractors
    Work at office
    Remote work

    OpenAI

    San Francisco, CA
    1 day ago
  • $130k - $160k

     ...Senior Quality Engineer – Design Assurance (Firmware / Electrical Systems) An innovative, well-funded medical device company is developing...  ...is required. ~ Experience developing test methodologies and evaluating the impact of design changes on product performance. ~... 
    Contract work

    SciPro

    San Francisco, CA
    22 hours ago
  • $248.4k - $310.5k

     ...Software Engineer - Robotics & Autonomous Systems Scale's Robotics business unit is dedicated to solving the data bottleneck in Physical...  ...for robotics data collection, model training pipelines, and evaluation infrastructure. You'll have the opportunity to own critical... 
    Full time

    Scale AI

    San Francisco, CA
    1 day ago
  •  ...expertise and dedication of our team members. As Manager, Electrical Engineering at Hayden AI, your role is to design and guide development of...  .... Worked with cross functional teams for system bring up, evaluation of AI processor platforms, cameras and telematics sensors.... 
    Contract work
    Work experience placement
    Work at office
    3 days per week

    Hayden AI Technologies, Inc.

    San Francisco, CA
    3 days ago
  • Sanmina is looking for a Process Engineer in San Francisco, CA. The role involves establishing and evaluating manufacturing processes, focusing on soldering and fixture development among other duties. The ideal candidate should possess a Bachelor's degree in Engineering... 

    Sanmina

    San Francisco, CA
    2 days ago
  •  ...seeking a Member of Technical Staff to build and maintain the evaluation platform used across Magic. You will develop infrastructure...  ...research decisions. The ideal candidate should have strong software engineering skills, attention to detail, and experience with machine... 

    AI Chopping Block, Inc.

    San Francisco, CA
    4 days ago
  •  ...mentor team members, and serve as a technical resource for quality engineering. You will continue hands on test execution while taking on...  ...to identify edge cases, dependencies, and quality risks. Evaluate and recommend testing tools through POCs and technical assessments... 
    Work experience placement

    Antler

    San Francisco, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Evaluations Engineer. Be the first to apply!