Research Engineer - Evals

$160k - $240k

AI Chopping Block, Inc.

Research Engineer — Evals Location: San Francisco, CA (Hybrid) OR Remote (Americas, UTC-3 to UTC-10) Employment Type: Full time Department: Engineering Team Compensation: $160K – $240K • 0.01% – 0.10% Overview You'll build the evaluation systems that tell us whether Firecrawl actually works. That sounds simple. Our core promise — convert any URL into clean, structured, LLM-ready data reliably — is hard to measure rigorously across millions of different websites, formats, and edge cases. As we layer in models and agent workflows, the question "did that work?" gets harder, not easier. This isn't an eval role where you inherit a framework and run benchmarks. You'll design the metrics, build the pipelines, generate the datasets, and own the feedback loop from output quality back to model and product decisions. If you care about what "good" actually means and have the engineering depth to measure it, this is the role. Salary Range: $160,000 to $240,000/year (Range shown is for U.S.-based employees in San Francisco, CA. Compensation outside the U.S. is adjusted fairly based on your country's cost of living.) Equity Range: Up to 0.10% Location: San Francisco, CA or Remote (Americas, UTC-3 to UTC-10) Job Type: Full-Time Experience: 3+ years in ML engineering, applied AI, or data quality — with production systems Visa: US Citizenship/Visa required for SF; N/A for Remote What You’ll Do Build the eval stack from scratch. Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape, crawl, extract, and map. That means defining metrics, building pipelines, curating datasets, and integrating evals into CI/CD so regressions get caught before they ship. You build the infra yourself because you're the one who needs it to work. Design benchmarks that reflect reality. Our outputs need to hold up across millions of websites — SPAs, paywalled content, dynamic rendering, structured and unstructured formats. You'll build benchmark datasets that cover the real distribution of what our customers send us, including the edge cases that break naive approaches. Ground truth doesn't come for free — you'll design the collection and labeling systems too. Own LLM-as-judge pipelines. You'll design and validate automated judges that score extraction quality at scale, know the failure modes of LLM-based evaluation, and build the human review tooling needed when automation isn't enough. You understand the difference between an eval that measures something real and one that just flatters the system. Close the loop with models and RL. Evals here aren't a reporting layer — they're a training signal. You'll work closely with the RL and Search/IR research engineers to turn quality measurements into reward signals and feedback loops that make models meaningfully better. Your benchmarks directly influence what gets trained next. Run fast experiments and communicate clearly. You design experiments that test meaningful hypotheses, run them quickly, and make decisions based on results. When you have findings, anyone on the team can understand what they mean — no decoder ring required. What We're Looking For Builds their own eval infrastructure. You don't wait for tooling to appear. You write the pipelines, curate the datasets, design the rubrics, and validate the judges yourself — because you understand that infra choices directly affect what you're actually measuring. You've run evals at scale and debugged the places where they lie. Knows what "good" means for unstructured web data. You've worked with messy, real-world data before. You understand why markdown quality is hard to define, why structured extraction fidelity varies by schema, and why naive string-match metrics miss the point. You have strong opinions about what a useful benchmark actually looks like — and the rigor to validate them. Fluent in LLM evaluation methodology. You understand LLM-as-judge systems, their correlation with human judgment, and where they break down. You've designed rubrics that hold up under adversarial inputs, built human review pipelines that scale, and know how to measure inter-rater agreement. You're not fooled by evals that only look good in aggregate. Production-minded. You care about whether your evals reflect real production behavior, not just offline benchmarks. You've worked on systems serving real traffic and made hard tradeoffs between evaluation depth, coverage, and cost. A benchmark that doesn't represent what customers actually send isn't a benchmark worth maintaining. Fast and clear. You'd rather run three rough experiments this week than one polished one next month. When you have results, anyone on the team can understand what they mean — and what to do next. Backgrounds that tend to do well: ML engineers who've built eval or data quality systems at AI labs or applied teams. Engineers who've worked on LLM fine-tuning or RLHF pipelines and understand how feedback quality drives model improvement. People who've worked at the intersection of data infrastructure and model development. Anyone who's been the person on the team asking "but how do we know this actually works?" What We're NOT Looking For Benchmark runners. If your eval experience is running existing frameworks on existing benchmarks and reporting numbers, this isn't the right fit. We need someone who builds the frameworks and defines the benchmarks. People who treat evals as an afterthought. If your default workflow is to build first and evaluate later — or to treat pass rates as a proxy for actual quality — you'll struggle here. Evals are a first-class product, not a QA gate. Researchers who need a platform team. If you expect pipelines, datasets, and labeling infrastructure to exist before you can be productive, you'll be frustrated. You build the tools you need. Slow iterators. If your standard experiment cycle is measured in weeks, not days, you'll struggle with the pace. We need someone who can design, run, and interpret a meaningful experiment within a day or two. Bonus Points Any other niche expertise and skills Previous experience at a scraping, automation, or security-focused startup Ex-founder What it Means to Join Firecrawl High Leverage — Your processes directly amplify our growth. Autonomy — Own your domain; we care about outcomes, not hours. Remote-First Culture — Work at our new SF office, while collaborating with our remote team. Growth Opportunity — Early equity and a role that scales with the company. Creative Freedom — Experiment with new channels, formats, and automations. If it works, we run with it. Benefits & Perks Available to all employees Salary that makes sense — $140,000-180,000/year (U.S.-based), based on impact, not tenure Own a piece — Up to 0.15% equity in what you're helping build Unlimited PTO — Minimum 3 weeks off encouraged; take the time you need to recharge Parental leave — 12 weeks fully paid, for moms and dads Wellness stipend — $100/month for the gym, therapy, massages, or whatever keeps you human Learning & Development - Expense up to $150/year toward anything that helps you grow professionally Team offsites — A change of scenery, minus the trust falls Sabbatical — 3 paid months off after 4 years, do something fun and new Available to US-based full-time employees Full coverage, no red tape — Medical, dental, and vision (100% for employees, 50% for spouse/kids) — no weird loopholes, just care that works Life & Disability insurance — Employer-paid short-term disability, long-term disability, and life insurance — coverage for life's curveballs Supplemental options — Optional accident, critical illness, hospital indemnity, and voluntary life insurance for extra peace of mind Doctegrity telehealth — Talk to a doctor from your couch 401(k) plan — Retirement might be a ways off, but future-you will thank you Pre-tax benefits — Access to FSAs and commuter benefits to help your wallet out a bit Pet insurance — Because fur babies are family too Available to SF-based employees SF HQ perks — Snacks, drinks, team lunches, and the occasional burst of chaotic startup energy #J-18808-Ljbffr AI Chopping Block, Inc.

Apply

Vacancy posted 2 days ago

Similar jobs that could be interesting for youBased on the Research Engineer - Evals in San Francisco, CA vacancy

Research Engineer, Evals
...problems where the edge cases matter most. We’re looking for a Research Engineer to help define how we measure and improve model quality. You’... ..., and risk workflows Design and run offline and online evals that measure model performance on real customer tasks, not just...
Suggested
Variance
San Francisco, CA
2 days ago
Research Engineer - Benchmarking, Evals & Failure Analysis
...deeply committed team. You’ll work alongside researchers, operators, and AI companies at the... ...headquarters. About the Role As a Research Engineer at Mercor, you’ll work at the... ...-world reasoning. You’ll design and run evals, build rubrics and scorers, and turn failure...
Suggested
Work at office
Mercor
San Francisco, CA
2 days ago
Research Engineer - Evals
...a stealth team of elite founders and AI researchers, with backgrounds spanning Stanford, OpenAI... ...actually get better? Without a strong evals function, the lab ships vibes. With one,... ...we measure is what we want Product engineers, by instrumenting real-user behavior on...
Suggested
Relocation package
AGI, Inc.
San Francisco, CA
10 days ago
Founding Research Engineer
At Camfer, our research engineers are training models to intelligently interpret and edit parametric CAD designs in 3D space. This is a cutting... ...from text, efficient vector representations of 3D models, evals to measure performance of generations in 3D space, or RL frameworks...
Suggested
Work at office
Camfer
San Francisco, CA
4 days ago
Research Engineer - Reinforcement Learning
...stack: environments, secure sandboxes, verifiable evals, and our async RL trainer. We enable researchers, startups and enterprises to run end-to-end reinforcement... ...workflows, and deployment contexts. As a Research Engineer in our Reasoning team, you'll play a crucial role...
Suggested
Remote work
Worldwide
Visa sponsorship
Relocation package
Flexible hours
Prime-Intellect
San Francisco, CA
4 days ago
Research Engineer - Distributed Training
$150k - $300k
...stack: environments, secure sandboxes, verifiable evals, and our async RL trainer. We enable researchers, startups and enterprises to run end-to-end reinforcement... ...workflows, and deployment contexts. As a Research Engineer working on Distributed Training, you'll play a...
Remote work
Worldwide
Visa sponsorship
Relocation package
Flexible hours
Prime-Intellect
San Francisco, CA
3 days ago
Research Engineer/Scientist - Human Alignment, Consumer Devices
About the Team The Future of Computing Research team is an applied research team within the... .... We work closely across research, engineering, design, product, and safety to define what... ...to design clean experiments, reliable evals, and decision‑useful metrics. Are excited...
Work at office
Immediate start
Relocation package
Slope
San Francisco, CA
2 days ago
Principal Research Engineer - Code
...California, Turing is the world’s leading research accelerator for frontier AI labs and a... ...listed here: Environments for Software Engineering / coding agents UI-Environments for Computer... ...of datasets, RL environments, and evals for frontier AI labs in the domain of coding...
For contractors
Flexible hours
Cerebras
San Francisco, CA
1 day ago
Senior Research Engineer - Video Agents
$220k - $280k
Job Description About the role In your role as Senior Research Engineer, you'll be at the heart of building the next generation of generative... ...agent stack, from planning and tool orchestration to memory, evals, and shipping. You’ll partner closely with product, design,...
Work at office
Local area
Flexible hours
black.ai
San Francisco, CA
2 days ago
Machine Learning Research Engineer (MLRE) - Research
...Achira, we are building a team of world-class scientists, ML researchers, and engineers to work together to move beyond the beaten path in drug... ...to foundation model development. Engineer meaningful evals and metrics which enable rapid model iteration. Design...
Work at office
Achira
San Francisco, CA
4 days ago
Machine Learning Research Engineer
...AI Research Scientist We're building the first truly private, personal AI that learns your skills, judgment, and preferences without... ...in augmenting people bottom-up. Our team previously created evals used by Open AI, completed frontier AI research at MIT/Cambridge...
Shift work
Workshop Labs
San Francisco, CA
8 hours ago
Research Engineer, Frontier Evals & Environments
...and into the products people use. About the Role As a researcher working on Frontier Evals & Environments, you will help build north star model environments... ...is the role for you. You will work with researchers, engineers, product teams, infrastructure teams, and safety/...
OpenAI
San Francisco, CA
10 hours ago
Research Systems Engineer
...Research Systems Engineer As a research systems engineer, you'll train frontier-scale models and develop the methods that make continual learning... ...directly within customer environments to build custom evals, train models, and deploy agents that get better with use....
Visa sponsorship
Relocation package
Applied Compute
San Francisco, CA
3 days ago
Research Engineer, Frontier Evals - Finance
$310k - $380k
...About the team The Frontier Evals team builds north star model evaluations to drive progress towards safe AGI/ASI. This team... ...the team for you. About you We are seeking exceptional research engineers that can push the boundaries of our frontier models in the finance...
Work at office
Local area
Relocation package
Flexible hours
OpenAI
San Francisco, CA
more than 2 months ago
ML Research Engineer - Hardware Codesign
...while working closely with software and research partners to co-design hardware tightly integrated... ...’re seeking a Research-Hardware Codesign Engineer to operate at the boundary between model... ...kernels, derisking numerics via model evals, quantifying system architecture...
Relocation package
3 days per week
OpenAI
San Francisco, CA
4 days ago
Member of Technical Staff, Research Engineer
...will consume real-world trajectories or researcher hypotheses, materialize realistic data,... ...intersection of empirical AI research, systems engineering, and model evaluation. You may be a... ...RL, LLM agents, computer‑use agents, evals, post‑training, synthetic data, simulation...
Plato
San Francisco, CA
4 days ago
Member of Technical Staff - Research Engineer
...Salesforce, etc. We are a small team of engineers wrangling problems from context to search... ...tools. What you'll do build large evals with real tool calling data, measuring where... ...and app sandboxes Qualifications research you can independently execute against the...
Composio
San Francisco, CA
4 days ago
Staff Research Engineer
$300k
...shape how we work and grow as a team. About the Team The Research team at Decagon innovates on building the most advanced... ...information retrieval. We\u2019re looking for people with strong engineering skills, writing bug-free machine learning code, and building the...
Work at office
Decagon
San Francisco, CA
8 hours ago
Senior Research Engineer, Structural Dynamics & Vibrations
...Valley investors. For more information, please visit Role Description We are seeking a creative, hands-on Senior Mechanical Research Engineer with significant vibration and dynamics experience to lead complex mechanical sensing problems on edge grid intelligence...
Gridware
San Francisco, CA
7 days ago
ML/AI Research Engineer Agentic AI Lab (Founding Team)
...ML/AI Research Engineer — Agentic AI Lab (Founding Team) Location: San Francisco Bay Area Type: Full-Time Compensation: Competitive salary... ...harnesses for LLM and agent performance, including synthetic evals, trace capture, and explainability tools Contribute to...
Full time
Fabrion
San Francisco, CA
4 days ago
Platform Research Engineer
...Platform Research Engineer As a platform research engineer, you'll build the core AI systems that make Applied Compute's platform intelligent... ...embed directly within customer environments to build custom evals, train models, and deploy agents that get better with use....
Visa sponsorship
Relocation package
Applied Compute
San Francisco, CA
3 days ago
Research Engineer
...execution-focused, with a culture that values ownership, speed, and craftsmanship. The Opportunity Our partner is hiring a Research Engineer to help scale the quality assurance (QA) systems behind training data generated through their infrastructure. This role sits...
Remote work
talentpluto
San Francisco, CA
20 days ago
Research Engineer
$122k - $215k
...diverse, innovative and collaborative candidates who want to impact the world in a positive way. To learn more visit: As a Research Engineer, you will be at the forefront of advancing and deploying artificial intelligence algorithms for our self-driving vehicles. You...
Full time
Work at office
Work from home
Flexible hours
Waabi
San Francisco, CA
5 days ago
Research Engineer, Neural Rendering
$134k - $235k
...diverse, innovative and collaborative candidates who want to impact the world in a positive way. To learn more visit: As a Research Engineer in Neural Rendering, you will create the next generation of multi-sensor rendering systems for autonomous driving. You will collaborate...
Full time
Work at office
Work from home
Flexible hours
Waabi
San Francisco, CA
29 days ago
Research Engineer - Agency and Reasoning
...Description Job Description Zyphra is an artificial intelligence company based in San Francisco, California. The Role: As a Research Engineer - Agency and Reasoning , you will be a core contributor to Zyphra’s Agency and Reasoning Team. You will be involved with...
Work at office
Relocation package
Zyphra
San Francisco, CA
25 days ago
Research Engineer - Brain Computer Interface Models
...Description Job Description Zyphra is an artificial intelligence company based in San Francisco, California. The Role: As a Research Engineer - Brain Computer Interface Models , you will be a core contributor to Zyphra’s BCI work, building the next generation of open...
Work at office
Relocation package
Zyphra
San Francisco, CA
6 days ago
Research Engineer - Audio & Speech Models
...Description Job Description Zyphra is an artificial intelligence company based in San Francisco, California. The Role: As a Research Engineer - Audio & Speech Models , you will be a core contributor on Zyphra’s Audio Team, building the next generation of open-source...
Work at office
Relocation package
Zyphra
San Francisco, CA
25 days ago
Research Engineer, World Models
$155k - $269k
...autonomous-driving stack is powered by Waabi World, which delivers realistic, scalable, controllable, and efficient simulation. As a Research Engineer in the World Models team, you will develop algorithms and productionize the next generation of World Models that can reason...
Full time
Work at office
Work from home
Flexible hours
Waabi
San Francisco, CA
19 days ago
Innovation AI Developer
$177k - $197k
...will play a pivotal role on our Innovation Engineering team — designing and deploying... ...evaluation frameworks (e.g., Promptfoo, OpenAI Evals, LangSmith) and integrating structured and... ...diligence, contract review, or legal research, and prior law firm experience. If you’re...
Contract work
Worldwide
Flexible hours
Kirkland & Ellis
San Francisco, CA
4 days ago
Research Engineer (New Grad)
...Job Description Job Description We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation... .... About the Role We're seeking an exceptional Software Engineer to join our research team in advancing the frontiers of visual...
Work at office
Genmo
San Francisco, CA
23 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Research Engineer - Evals. Be the first to apply!