Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Research Engineer - Evals

$160k - $240k

Firecrawl

Research Engineer - Evals

You'll build the evaluation systems that tell us whether Firecrawl actually works. That sounds simple. It isn't. Our core promise - convert any URL into clean, structured, LLM-ready data reliably - is hard to measure rigorously across millions of different websites, formats, and edge cases. As we layer in models and agent workflows, the question "did that work?" gets harder, not easier.

This isn't an eval role where you inherit a framework and run benchmarks. You'll design the metrics, build the pipelines, generate the datasets, and own the feedback loop from output quality back to model and product decisions. If you care about what "good" actually means and have the engineering depth to measure it, this is the role.

Salary Range: $160,000 to $240,000/year (Range shown is for U.S.-based employees in San Francisco, CA. Compensation outside the U.S. is adjusted fairly based on your country's cost of living.)

Equity Range: Up to 0.10%

Location: San Francisco, CA or Remote (Americas, UTC-3 to UTC-10)

Job Type: Full-Time

Experience: 3+ years in ML engineering, applied AI, or data quality - with production systems

Visa: US Citizenship/Visa required for SF; N/A for Remote

About Firecrawl

Firecrawl is the easiest way to extract data from the web. Developers use us to reliably convert URLs into LLM-ready markdown or structured data with a single API call. In just a year, we've hit 8 figures in ARR and 120k+ GitHub stars by building the fastest way for developers to get LLM-ready data.

We're a small, fast-moving, technical team building essential infrastructure superintelligence will use to gather data on the web. We ship fast and deep.

What You'll Do

Build the eval stack from scratch. Design and own the systems that measure whether Firecrawl's outputs are actually good - across scrape, crawl, extract, and map. That means defining metrics, building pipelines, curating datasets, and integrating evals into CI/CD so regressions get caught before they ship. You build the infra yourself because you're the one who needs it to work.

Design benchmarks that reflect reality. Our outputs need to hold up across millions of websites - SPAs, paywalled content, dynamic rendering, structured and unstructured formats. You'll build benchmark datasets that cover the real distribution of what our customers send us, including the edge cases that break naive approaches. Ground truth doesn't come for free - you'll design the collection and labeling systems too.

Own LLM-as-judge pipelines. You'll design and validate automated judges that score extraction quality at scale, know the failure modes of LLM-based evaluation, and build the human review tooling needed when automation isn't enough. You understand the difference between an eval that measures something real and one that just flatters the system.

Close the loop with models and RL. Evals here aren't a reporting layer - they're a training signal. You'll work closely with the RL and Search/IR research engineers to turn quality measurements into reward signals and feedback loops that make models meaningfully better. Your benchmarks directly influence what gets trained next.

Run fast experiments and communicate clearly. You design experiments that test meaningful hypotheses, run them quickly, and make decisions based on results. When you have findings, anyone on the team can understand what they mean - no decoder ring required.

What We're Looking For

Builds their own eval infrastructure. You don't wait for tooling to appear. You write the pipelines, curate the datasets, design the rubrics, and validate the judges yourself - because you understand that infra choices directly affect what you're actually measuring. You've run evals at scale and debugged the places where they lie.

Knows what "good" means for unstructured web data. You've worked with messy, real-world data before. You understand why markdown quality is hard to define, why structured extraction fidelity varies by schema, and why naive string-match metrics miss the point. You have strong opinions about what a useful benchmark actually looks like - and the rigor to validate them.

Fluent in LLM evaluation methodology. You understand LLM-as-judge systems, their correlation with human judgment, and where they break down. You've designed rubrics that hold up under adversarial inputs, built human review pipelines that scale, and know how to measure inter-rater agreement. You're not fooled by evals that only look good in aggregate.

Production-minded. You care about whether your evals reflect real production behavior, not just offline benchmarks. You've worked on systems serving real traffic and made hard tradeoffs between evaluation depth, coverage, and cost. A benchmark that doesn't represent what customers actually send isn't a benchmark worth maintaining.

Fast and clear. You'd rather run three rough experiments this week than one polished one next month. When you have results, anyone on the team can understand what they mean - and what to do next.

Backgrounds that tend to do well: ML engineers who've built eval or data quality systems at AI labs or applied teams. Engineers who've worked on LLM fine-tuning or RLHF pipelines and understand how feedback quality drives model improvement. People who've worked at the intersection of data infrastructure and model development. Anyone who's been the person on the team asking "but how do we know this actually works?"

What We're NOT Looking For

Benchmark runners. If your eval experience is running existing frameworks on existing benchmarks and reporting numbers, this isn't the right fit. We need someone who builds the frameworks and defines the benchmarks.

People who treat evals as an afterthought. If your default workflow is to build first and evaluate later - or to treat pass rates as a proxy for actual quality - you'll struggle here. Evals are a first-class product, not a QA gate.

Researchers who need a platform team. If you expect pipelines, datasets, and labeling infrastructure to exist before you can be productive, you'll be frustrated. You build the tools you need.

Slow iterators. If your standard experiment cycle is measured in weeks, not days, you'll struggle with the pace. We need someone who can design, run, and interpret a meaningful experiment within a day or two.

Bonus Points
  • Any other niche expertise and skills
  • Previous experience at a scraping, automation, or security-focused startup
  • Ex-founder
Benefits & Perks

Available to all employees
  • Salary that makes sense - $160,000-240,000/year (U.S.-based), based on impact, not tenure
  • Own a piece - Up to 0.1% equity in what you're helping build
  • Unlimited PTO - Minimum 3 weeks off encouraged; take the time you need to recharge
  • Parental leave - 12 weeks fully paid, for moms and dads
  • Wellness stipend - $100/month for the gym, therapy, massages, or whatever keeps you human
  • Learning & Development - Expense up to $150/year toward anything that helps you grow professionally
  • Team offsites - A change of scenery, minus the trust falls
  • Sabbatical - 3 paid months off after 4 years, do something fun and new
Available to US-based full-time employees
  • Full coverage, no red tape - Medical, dental, and vision (100% for employees, 50% for spouse/kids) - no weird loopholes, just care that works
  • Life & Disability insurance - Employer-paid short-term disability, long-term disability, and life insurance - coverage for life's curveballs
  • Supplemental options - Optional accident, critical illness, hospital indemnity, and voluntary life insurance for extra peace of mind
  • Doctegrity telehealth - Talk to a doctor from your couch
  • 401(k) plan - Retirement might be a ways off, but future-you will thank you
  • Pre-tax benefits - Access to FSAs and commuter benefits to help your wallet out a bit
  • Pet insurance - Because fur babies are family too
Available to SF-based employees
  • SF HQ perks - Snacks, drinks, team lunches, and the occasional burst of chaotic startup energy

Interview Process
  1. Application Review - Send us your stuff, and a quick note on why you're excited
  2. Intro Chat (~25 min) - Quick alignment call with a member of our team
  3. Technical Interview (~1 hr) - Tackle a small challenge
  4. Interview with Founders (~30 min) - Culture, vision, and long-term fit
  5. Paid Work Trial (1-2 weeks) - Work on something real with us
  6. Decision - We move fast

If you've ever wanted to own a product-critical system and build alongside founders, this is your moment. Apply now and let's talk.
Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Research Engineer - Evals in United States vacancy
  • About the team The Frontier Evals & Environments team builds north star model environments to drive progress towards safe AGI/ASI...  ..., this is the team for you. About you We seek exceptional research engineers that can push the boundaries of our frontier models. Specifically... 
    Suggested

    OpenAI

    Los Angeles, CA
    4 days ago
  •  ...deeply committed team. You’ll work alongside researchers, operators, and AI companies at the...  ...headquarters. About the Role As a Research Engineer at Mercor, you’ll work at the...  ...-world reasoning. You’ll design and run evals, build rubrics and scorers, and turn failure... 
    Suggested
    Work at office

    Mercor

    San Francisco, CA
    3 days ago
  •  ...Employment Type Full time Department Engineering Think Different. Build the Future. Our...  ...a stealth team of elite founders and AI researchers, with backgrounds spanning Stanford, OpenAI...  ...this actually get better? Without a strong evals function, the lab ships vibes. With one,... 
    Suggested
    Full time
    Work at office
    Relocation package

    Pantera Capital

    San Francisco, CA
    3 days ago
  •  ...problems where the edge cases matter most. We’re looking for a Research Engineer to help define how we measure and improve model quality. You’...  ..., and risk workflows Design and run offline and online evals that measure model performance on real customer tasks, not just... 
    Suggested

    Variance

    San Francisco, CA
    3 days ago
  • $310k - $380k

     ...About the team The Frontier Evals team builds north star model evaluations to drive progress towards safe AGI/ASI. This team...  ...the team for you. About you We are seeking exceptional research engineers that can push the boundaries of our frontier models in the finance... 
    Suggested
    Work at office
    Local area
    Relocation package
    Flexible hours

    OpenAI

    San Francisco, CA
    more than 2 months ago
  • $350k

     ...as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to...  ...to translate capability goals into training environments and evals You may be a good fit if you: Have experience with... 
    Work at office
    Remote work
    Visa sponsorship
    Flexible hours

    Anthropic

    United States
    7 hours ago
  • $150k - $300k

     ...stack: environments, secure sandboxes, verifiable evals, and our async RL trainer. We enable researchers, startups and enterprises to run end-to-end reinforcement...  ..., and deployment contexts. As a Research Engineer working on Distributed Training, you'll play a crucial... 
    Remote work
    Worldwide
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime Intellect

    United States
    2 days ago
  •  ...most capable talent. They will be on the forefront of applied research, engineering, infrastructure and deployment at scale. They will continue...  ...: Multi-agent systems; agent communication/delegation; evals; long-running, fault-tolerant agents; tools/prompt/program design... 
    Remote work
    Home office
    Flexible hours

    Poolside

    United States
    14 hours ago
  •  ...stack: environments, secure sandboxes, verifiable evals, and our async RL trainer. We enable researchers, startups and enterprises to run end-to-end reinforcement...  ..., and deployment contexts. As a Research Engineer in our Reasoning team, you'll play a crucial role in... 
    Remote work
    Worldwide
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime Intellect

    United States
    2 days ago
  • $320k

     ...as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to...  ...understand and defend against advanced adversarial AI. Create evals and training environments to understand and shape agent... 
    Relocation
    Visa sponsorship

    Anthropic

    San Francisco, CA
    4 days ago
  • At Camfer, our research engineers are training models to intelligently interpret and edit parametric CAD designs in 3D space. This is a cutting...  ...from text, efficient vector representations of 3D models, evals to measure performance of generations in 3D space, or RL frameworks... 
    Work at office

    Camfer

    San Francisco, CA
    14 hours ago
  • $100k - $190k

    About Us FAR.AI is a non-profit AI research institute dedicated to ensuring advanced AI is...  ...especially, at varying levels of seniority: Evals and red-teaming : Conducting pre- and...  ...significant overlap between scientist and engineer roles. As an engineer, you will develop scalable... 
    Full time
    Work experience placement
    Remote work
    Visa sponsorship

    Aisafety

    Berkeley, CA
    1 day ago
  •  ...Research Systems Engineer As a research systems engineer, you'll train frontier-scale models and develop the methods that make continual learning...  ...directly within customer environments to build custom evals, train models, and deploy agents that get better with use.... 
    Visa sponsorship
    Relocation package

    Applied Compute

    San Francisco, CA
    4 days ago
  •  ...measurement, and iteration. This is applied research with consequences. Reliability,...  ...to production: partner with product engineers, instrument behavior, run A B tests, improve...  ...and counterexamples, and you build evals that catch regressions before users doYou... 
    Relocation package

    Company.ai

    San Francisco, CA
    2 days ago
  • $220k - $280k

    Job Description About the role In your role as Senior Research Engineer, you'll be at the heart of building the next generation of generative...  ...agent stack, from planning and tool orchestration to memory, evals, and shipping. You’ll partner closely with product, design,... 
    Work at office
    Local area
    Flexible hours

    black.ai

    San Francisco, CA
    3 days ago
  •  ...California, Turing is the world’s leading research accelerator for frontier AI labs and a...  ...listed here: Environments for Software Engineering / coding agents UI-Environments for Computer...  ...of datasets, RL environments, and evals for frontier AI labs in the domain of coding... 
    For contractors
    Flexible hours

    Cerebras

    San Francisco, CA
    2 days ago
  • $262k - $365k

    Senior Staff Research Engineer, DeepMind DeepMind, Mountain View, CA, USA Apply Bachelor’s degree or equivalent practical experience. 8 years...  ...cross-functionally with Researchers and Engineers on Agent Evals and Quality to ensure that we have the best quality of agents... 
    Full time

    Google Inc.

    Mountain View, CA
    4 days ago
  • About the Team The Future of Computing Research team is an applied research team within the...  .... We work closely across research, engineering, design, product, and safety to define what...  ...to design clean experiments, reliable evals, and decision‑useful metrics. Are excited... 
    Work at office
    Immediate start
    Relocation package

    Slope

    San Francisco, CA
    3 days ago
  • AIP INNOVATION ENGINEER - iDEA by Lear SOUTHFIELD, MI WORLD HQ - (HYRBID) About Lear and IDEA...  ..., schema drift, lineage, latency SLOs, evals for LLM output quality, “red team” prompts...  ...This Role is Different It’s not a pure research or model‑only role—you’ll build end‑to‑end... 
    Interim role
    Worldwide

    LEAR CORPORATION

    Southfield, MI
    1 day ago
  • $197k - $291k

    We are seeking a Staff Applied AI Engineer to lead the development and deployment of novel...  ...instrumental in translating cutting‑edge AI research into real‑world products, and...  ...experience in AI research (e.g. RL, finetuning, evals), AI applications, or model deployment Proven... 
    Full time

    Deepmind

    Mountain View, CA
    1 day ago
  •  ...Achira, we are building a team of world-class scientists, ML researchers, and engineers to work together to move beyond the beaten path in drug...  ...to foundation model development. Engineer meaningful evals and metrics which enable rapid model iteration. Design... 
    Work at office

    Achira

    San Francisco, CA
    14 hours ago
  •  ...AI Research Scientist We're building the first truly private, personal AI that learns your skills, judgment, and preferences without...  ...in augmenting people bottom-up. Our team previously created evals used by Open AI, completed frontier AI research at MIT/Cambridge... 
    Shift work

    Workshop Labs

    San Francisco, CA
    1 day ago
  •  ...while working closely with software and research partners to co-design hardware tightly integrated...  ...’re seeking a Research-Hardware Codesign Engineer to operate at the boundary between model...  ...kernels, derisking numerics via model evals, quantifying system architecture... 
    Relocation package
    3 days per week

    OpenAI

    San Francisco, CA
    14 hours ago
  • RESEARCH ENGINEER - SR. RESEARCH ENGINEER - Computational Thermofluid Engineer 18-01568 Who We Are: The Propulsion & Energy Machinery Section performs engineering R&D in the fields of gas turbine combustion, air-breathing propulsion, industrial heat and power, and liquid... 

    Southwest Research Institute

    San Antonio, TX
    1 day ago
  • $150k - $250k

     ..., and global social organizations. We research and deploy technologies that power AI-native...  ...We Are Looking For At Distyl, Research Engineers build the bridge between frontier AI...  ...learning, reward modeling, synthetic data, evals, or related post-training techniques... 
    Full time
    Work at office
    3 days per week

    Distyl AI

    San Francisco, CA
    1 day ago
  •  ...Description: Established nearly two centuries ago, FM is a leading mutual insurance company whose capital, scientific research capability and engineering expertise are solely dedicated to property risk management and the resilience of its policyholder-owners. These... 
    Full time
    Flexible hours

    FM

    Norwood, MA
    14 hours ago
  • $150k - $250k

     ...position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Research Engineer based in the United States. This role sits at the intersection of applied research and production-grade machine learning engineering... 
    Remote job
    Full time
    Work at office
    Home office

    jobgether

    United States
    4 days ago
  •  ...and even nation states. Our team of AI researchers and company builders come from DeepMind,...  ...foundational role. Reflection is building model evals and safety from the ground up, and this...  ...lifecycle. Partner with research and engineering leads across pre-training, mid-training,... 
    Relocation package

    Reflection

    San Francisco, CA
    4 days ago
  • $315k

    We are looking for Research Engineers to build “gold standard” evaluations for catastrophic risks, in order to understand what AI Safety Level...  ..., and national security, and experiment with new evals, in order to measure how risky AI systems are. Done well, this... 
    Currently hiring
    Work at office
    Immediate start
    Home office
    Visa sponsorship
    Relocation package

    Anthropic

    San Francisco, CA
    1 day ago
  •  ...Salesforce, etc. We are a small team of engineers wrangling problems from context to search...  ...tools. What you'll do build large evals with real tool calling data, measuring where...  ...and app sandboxes Qualifications research you can independently execute against the... 

    Composio

    San Francisco, CA
    14 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Research Engineer - Evals. Be the first to apply!