Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Software Engineer (Model Evaluation & Benchmarking)

SpreeAI

Software Engineer (Model Evaluation & Benchmarking) About the Role We are hiring Engineers focused on AI Model Evaluation to build the systems that ensure multimodal AI behaves reliably, consistently, and predictably as it moves from research into production. This position focuses on evaluating generative and vision-based models through automated benchmarking, dataset-driven testing, and performance validation pipelines. You will work at the intersection of applied science, infrastructure, and product — helping define how we measure realism, consistency, and quality across image, video, and multimodal AI systems. Why This Role Exists Modern AI evaluation extends beyond pass/fail testing. Multimodal generative systems require: benchmarking across visual realism, pose consistency, and identity preservation scalable evaluation pipelines integrated into continuous deployment workflows We are building evaluation systems where research velocity and product reliability must coexist. This role is for engineers interested in defining how quality is measured in generative AI systems. What you\'ll do Build automated evaluation pipelines for multimodal AI models. Benchmark diffusion models, vision systems, and generative workflows. Validate model checkpoints and detect regressions across versions. Develop evaluation metrics for realism, consistency, and performance. Integrate evaluation tooling into CI/CD workflows. Collaborate with ML researchers and infrastructure teams to ensure production readiness. Analyze failure modes and propose evaluation strategies. Core Areas & Tooling Candidates should be familiar with or interested in: LLM, VLM, or Stable Diffusion model evals Image/Video benchmarking techniques research experiment validation pipelines Qualifications Degree in Computer Science, AI, Engineering, or comparable combination of education and practical experience. Strong programming skills in Python. Familiarity with object-oriented programming (C++, Java, Python, or similar). Strong data structures and algorithms fundamentals. Understanding of machine learning experimentation workflows. Preferred Qualifications Experience evaluating vision or generative models. Familiarity with HuggingFace ecosystem or open-source ML toolkits. Experience building automated test frameworks or benchmarking tools. Knowledge of diffusion models or multimodal architectures. Experience with data analysis tools (NumPy, Pandas, visualization libraries). SPREEAI is a fast-growing, innovative AI company at the forefront of fashion and e-commerce, revolutionizing how consumers engage with fashion through lifelike photorealistic try-on technology and hyper-personalized shopping experiences. Our mission is to redefine the retail landscape with cutting-edge AI solutions that blend high fashion and technology. We thrive in a dynamic, fast-paced environment where creativity meets technology to drive real impact. If you are passionate about innovation and shaping the future of fashion, SPREEAI offers a platform to make your mark. #J-18808-Ljbffr SpreeAI

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Software Engineer (Model Evaluation & Benchmarking) in San Francisco, CA vacancy
  • A fast-growing AI company seeks a Software Engineer to focus on Model Evaluation & Benchmarking. This role involves building evaluation systems for multimodal AI, ensuring reliable performance. The ideal candidate will possess strong Python programming skills, familiarity... 
    Suggested

    SpreeAI

    San Francisco, CA
    1 day ago
  • Refresh AI is seeking a Research Engineer in San Francisco to push the boundaries of benchmarking technology. You will build benchmarks that labs use for evaluating coding abilities and computer-use capability. Your role will require expertise in reinforcement learning... 
    Suggested
    Full time

    Refresh AI

    San Francisco, CA
    4 days ago
  • A leading AI solutions company in San Francisco is seeking an ML Eval Engineer to design evaluation benchmarks and improve model performance. This role involves working with unstructured enterprise data and collaborating closely with the ML and engineering teams. You will... 
    Suggested

    Reducto

    San Francisco, CA
    4 days ago
  • $220k - $320k

     ...specialized language models for companies that need...  ...distillation, training, evaluation, and planet-scale...  ...funded ten-person team of engineers who work in-person in...  ...founded and run their own software companies. We are high...  ...Build tooling and benchmarks to measure and track inference... 
    Suggested
    Work at office

    Inference

    San Francisco, CA
    2 days ago
  •  ...company located in San Francisco is seeking an innovative Quality Engineer for their AI products. This role blends ops, strategy, and...  ...leading labs, and ensure user satisfaction through effective evaluation baselines. Competitive salary and benefits offered, with a focus... 
    Suggested

    Notion

    San Francisco, CA
    3 days ago
  • $208k - $300k

    Machine Learning Engineer - Model Evaluations, Public Sector San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC Ready to Apply? Join...  ...LLM‑judge‑based evaluations. Design test datasets and benchmarks to measure generalization, bias, explainability, and failure... 
    Full time

    Scale AI, Inc.

    San Francisco, CA
    3 days ago
  • $180k - $270k

     ...intelligence through a hardware‑software combination. With SOC...  ...strong software engineering skills (especially in...  ..., data pipelines, or evaluation harnesses that can run...  ...at scale against live model checkpoints. Can...  ...steerability) into measurable benchmarks. Are comfortable... 
    Full time
    Work at office
    Worldwide

    Plaud

    San Francisco, CA
    3 days ago
  • $200k

    Magic, located in San Francisco, is seeking a Member of Technical Staff to build the internal evaluations platform that supports critical company decisions. You will design, implement, and validate evaluation tasks for large-scale systems, ensuring correctness and reproducibility... 

    Magic

    San Francisco, CA
    2 days ago
  • $405k

     ...group of committed researchers, engineers, policy experts, and...  ...We're looking for a Staff Software Engineer to set technical direction...  ...architect the systems, tooling, and evaluation infrastructure that...  ...eval frameworks that measure model capabilities across diverse... 
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    2 days ago
  • $172.43k - $230.95k

     ...Senior Software Engineer For The Ai Model Lifecycle Team Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the...  ...model, and experiment management: versioning, lineage, evaluation, and reproducible fine-tuning at scale. What You'll... 
    Temporary work

    Crusoe

    San Francisco, CA
    4 days ago
  •  ...the frontier of AI to bring cutting-edge models into production. With our recent $150M...  ...performance serving Build comprehensive benchmarking frameworks that measure real-world performance...  ...contributions to open-source inference engines (vLLM, TensorRT-LLM, SGLang, TGI)... 
    Flexible hours

    Baseten

    San Francisco, CA
    3 days ago
  • $320k

    Anthropic in New York City is seeking a Research Engineer to develop evaluations for Claude’s capabilities. The ideal candidate should have strong Python...  ...during training runs. The role offers a hybrid work model and competitive compensation ranging from $320,000 to $485... 
    Remote job

    Menlo Ventures

    San Francisco, CA
    1 day ago
  • $50 - $150 per hour

    A leading AI company is seeking a software engineer to review and evaluate model-generated code. This contract role requires several years of software engineering experience, particularly as a full-stack engineer at notable tech firms. You will assess code quality and... 
    Hourly pay
    Contract work
    Flexible hours

    Turing

    San Francisco, CA
    3 days ago
  • Role Overview We’re hiring a Model Performance Engineer to own the speed, cost,...  ...stable. Responsibilities Benchmark FP8 quantization across...  ...1% quality degradation. Evaluate serving frameworks (vLLM vs...  ...to shape the foundational software services of a growing company... 

    Fathom

    San Francisco, CA
    3 days ago
  • $204k - $259k

     ...Senior Software Engineer, Quantitative Evaluations Waymo is an autonomous driving technology company with the mission to be the world's most trusted...  ...systems Exposure to ad-hoc data analysis tools for rapid modeling and prototyping Experience working in the AV... 
    Full time
    Remote work

    Waymo

    San Francisco, CA
    22 hours ago
  • $175k - $215k

     ...Waymo Driver. The Simulator Evaluation team faces the ultimate data...  ...We are looking for aSoftware Engineer to build the metrics and pipelines...  ...will report to Senior Staff Software Engineering Manager and serve...  ...one day and a generative model the next. We prefer... 
    Full time
    Remote work

    Waymo

    San Francisco, CA
    1 day ago
  • $204k - $259k

     ...ground for the Waymo Driver. The Simulator Evaluation team faces the ultimate data challenge:...  ..."real"? We are looking for aSenior Software Engineer to build the metrics and systems that...  ...driven by explicit rules or foundation models-provide a trustworthy representation of... 
    Full time
    Remote work

    Waymo

    San Francisco, CA
    1 day ago
  • $204k - $259k

     ...Senior Software Engineer, Statistical Evaluation and Sampling Waymo is an autonomous driving technology company with the mission to be the world's most...  ...designing, training, evaluating, and applying ML models ~ Experience working in the AV industry ~ PhD in a... 
    Full time
    Remote work

    Waymo

    San Francisco, CA
    13 days ago
  • YO IT Consulting is seeking a Senior Propulsion Engineer to evaluate AI-generated content related to propulsion engineering. This remote role...  ...would be advantageous. Join a team challenging AI language models to improve their technical reasoning. #J-18808-Ljbffr YO IT... 
    Remote job

    YO IT Consulting

    San Francisco, CA
    1 day ago
  • Anthropic is seeking a Research Lead for the Training Insights team to shape the evaluation of model capabilities. This hands-on leadership role involves developing innovative evaluation methodologies and mentoring a team of researchers. You will play a crucial role in... 
    Remote work

    Anthropic

    San Francisco, CA
    3 days ago
  • $50 - $75 per hour

    A leading tech company based in Australia is seeking an AI Model Evaluator on a contract basis. The role involves evaluating AI-generated responses, writing prompts, and providing justifications based on specific criteria. Ideal candidates will hold a Master's degree in... 
    Hourly pay
    Contract work

    Mercor

    San Francisco, CA
    1 day ago
  • $15 - $20 per hour

    Mercor is seeking a Generalist with proficiency in English and Kannada to conduct fact-checking and generate evaluation data. This role involves assessing model response quality and ensuring alignment with conversational guidelines. The ideal candidate will possess a Bachelor... 
    Remote job
    Hourly pay

    Mercor

    San Francisco, CA
    3 days ago
  • Twelve-Labs in San Francisco is seeking a dedicated member for our ML Data Team to lead video data preparation and evaluation. This role includes defining dataset needs, automating processes, and enhancing data quality through collaboration. Ideal candidates should have... 
    Flexible hours

    Twelve-Labs

    San Francisco, CA
    3 days ago
  •  ...Software Engineer, Agent Evaluation and Quality Engineering · Full-time · San Francisco; New York Our mission is to automate coding. The first...  ...the shared harness—and across high-stakes decisions around model choice, quality, and cost. What You'll Work On Designing... 
    Full time
    Work at office

    Anysphere

    San Francisco, CA
    3 days ago
  •  ...development of cutting-edge multimodal foundation models that have the ability to comprehend...  ...-language data preparation and model evaluation. This role comes with high ownership and...  ...Internal Collaboration : Partner with Engineering and AI Model teams to align on top... 
    Work at office
    Worldwide
    Flexible hours

    Twelve Labs, Inc

    San Francisco, CA
    1 day ago
  •  ...2025 Repovive, Inc. All rights reserved. Back to Jobs Apply Now Compensation Not listed Posted April 25, 2026 Required Skills AI evaluation data pipelines agent instrumentation Requirements Mid/Senior Visa Sponsorship Not mentioned Relocation Not mentioned About the Role... 
    Relocation
    Visa sponsorship

    Repovive, Inc.

    San Francisco, CA
    4 days ago
  • $230k - $385k

    About the Team We're hiring software engineers to make OpenAI's Model Performance teams more productive. These teams work on the systems, tooling, and infrastructure that help improve model performance across OpenAI's training and inference workloads at frontier scale.... 

    OpenAI

    San Francisco, CA
    4 days ago
  • $295k

     ...use and access our start-of-the-art AI models, allowing them to do things that they've...  ...About the Role We are looking for an engineer who wants to take the world's largest...  ...Have at least 5 years of professional software engineering experience. Have or can... 

    OpenAI

    San Francisco, CA
    22 hours ago
  • $25 per hour

    Prolific is seeking AI Training Experts to assist in training and evaluating cutting-edge AI models. The role involves completing tasks such as analyzing and writing annotations, and judging AI performance. Candidates should have professional experience as an AI Trainer... 
    Remote job
    Hourly pay
    Work from home
    Flexible hours

    Prolific

    San Francisco, CA
    22 hours ago
  •  ...development. Our vast talent network trains frontier AI models in the same way teachers teach students: by sharing...  ...or London offices. About the Role As a Senior Software Engineer (AI Data & Evaluation) at Mercor, you will be at the core of building the data... 
    Work at office
    Relocation package

    Mercor Alabaster

    San Francisco, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer (Model Evaluation & Benchmarking). Be the first to apply!