Applied Data Scientist, LLM Evaluation

Driver AI Inc.

Applied Data Scientist, LLM Evaluation
Introduction

At Driver, we're building systems that turn source code into human language. The tech stack includes a core compiler-like engine, a heavily asynchronous/distributed backend server, and a frontend web application that provides a rich user experience.
About Driver

We're an early-stage startup backed by Y Combinator and Google Ventures that combines first principles technical approaches and applied LLM expertise to tackle context engineering at scale. Driver builds the context layer for employees and AI agents alike to use in developing software.
Working at Driver

Driver is an early-stage but fast-growing startup. As such, we take advantage of that which startups can excel: delivery speed, flexibility, and enjoying working with a small close-knit team.

Organizational and engineering values at Driver include first-principles thinking, correct by construction, writing things down, experimentation and iteration, pragmatism, commitment to effective communication and transparency, autonomy, and ambition.
Job Overview

Title : Applied Data Scientist, LLM Evaluation

Location: Remote or Austin, Tx

Our value is directly tied to the quality of our content at scale. The platform generates technical documentation across a complex, multi-stage pipeline - producing multiple content types at different levels of abstraction, from individual code elements up to high-level summaries. Today, changes to models, context strategies, or pipeline architecture are evaluated largely through manual review and intuition. There is no systematic way to answer: "Did this change make our output better, worse, or the same - and for which languages, repo sizes, and content types?"

This is a hard problem. LLM outputs are non-deterministic - identical inputs produce different outputs across runs, and small variations at early pipeline stages compound into meaningfully different end-user content downstream. Evaluating quality requires methodology that accounts for this: statistical reasoning over multiple runs, understanding of cascade effects through the pipeline, and rubrics that balance human judgment with automated signals.

This role builds the evaluation function from scratch. You'll define what "good" means for our generated content, build the infrastructure to measure it, and create the experimental framework that lets the team ship changes with confidence.
What You'll Do

You'll own the LLM evaluation strategy at Driver - from first principles to production infrastructure. This is a foundational role: you're not joining an existing eval team, you're building it. As the function matures, you'll seed and grow a team around it.

Define quality metrics and build evaluation datasets. Establish what "good" looks like for each content type across the pipeline. Build and curate gold-standard evaluation datasets across languages and repo archetypes (monorepos, microservices, libraries, applications). Design rubrics that capture accuracy, completeness, usefulness, and readability.

Build benchmarking and experimentation infrastructure. Create automated evaluation pipelines that score output against reference datasets. Instrument the content generation pipeline to support A/B comparisons - run the same codebase through two strategies and compare results. Build tooling for LLM-as-judge evaluation and regression detection. Integrate evaluation into CI so pipeline changes come with quality evidence.

Develop automated quality signals at scale. Build quality checks that flag degraded output without requiring human review of every document. Monitor content quality trends over time. Design sampling strategies for human review that maximize signal with minimal annotation effort.

Quantify tradeoffs and inform decisions. Run experiments on model selection, context strategies, and pipeline architecture changes. Quantify cost/quality/latency tradeoffs. Partner with the engineering team to turn evaluation insights into shipped improvements.
Qualifications

Education: Bachelor's, Master's, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative field.

Experience: Minimum 3 - 5 years in applied science, ML engineering, or data science roles with a focus on evaluation, NLP, or generative AI. 7+ years experience preferred.

Required Technical Skills

Strong statistical foundations: experimental design, hypothesis testing, confidence intervals, effect sizes, power analysis.
Experience designing and running evaluations for LLM or NLP systems - you've thought carefully about what "better" means when outputs are open-ended text.
Proficient in Python and the scientific/data stack (pandas, NumPy, scipy, sklearn).
Comfortable working in Jupyter notebooks for exploration and prototyping, and turning that work into automated pipelines.
Experience with LLM-as-judge approaches, inter-annotator agreement, and rubric design for subjective quality assessment.
Familiarity with the practical challenges of non-deterministic systems: variance decomposition, multi-run methodology, distinguishing signal from noise at scale.
Strong data storytelling - you can turn experiment results into clear recommendations that drive engineering and product decisions.

Preferred and Nice-to-Have Technical Skills

Experience with LLM APIs and prompt engineering across multiple providers.
Familiarity with evaluation frameworks (e.g., RAGAS, DeepEval, custom harnesses).
Experience building data pipelines or ETL workflows (Airflow, Dagster, or similar).
Comfort with SQL and working directly against production data stores.
Experience with visualization tools (Matplotlib, Plotly, Streamlit) for building internal dashboards and reports.
Background in code understanding, developer tools, or technical documentation.
Experience building or managing annotation pipelines and human evaluation workflows.

Benefits

Competitive Compensation Packages - Cash & Equity
Flexible Work Culture
Unlimited Time Off + 12 Paid Company Holidays
Insurance - Health, Dental, & Vision
Life Insurance & FSA Accounts
401(k) Retirement Accounts - Traditional, Roth, or Both
Quarterly Team Offsites

Driver is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Applied Data Scientist, LLM Evaluation in United States vacancy

Remote Applied AI Research Scientist (LLM & Evaluation)
$30 - $50 per hour
...company specializing in AI research is seeking a mid-senior level researcher to manage applied AI research projects. The role involves end-to-end research cycles, building and evaluating LLM systems, and collaborating on dataset development. The ideal candidate should have...
Suggested
Hourly pay
Full time
Remote work
Rex USA
New York, NY
1 day ago
Data Scientist - LLM Evaluation & Survey Design
$141.8k - $258.6k
Apple Inc. in Cupertino, California is seeking a Data Scientist to join the Special Projects team. The role involves developing... ...processes, work with ML Engineers, and develop LLM auto-judges for AI model evaluation. The ideal candidate has a BA/Master’s in a relevant...
Suggested
Apple Inc.
Cupertino, CA
1 day ago
Remote Applied Quantitative Analyst - AI Model Evaluator
A technology firm is seeking an Applied Quantitative Analyst to measure the progress of AI models and improve their quality. Responsibilities include giving AI chatbots complex math problems, evaluating their outputs, and overseeing writing tasks. Ideal candidates have...
Suggested
Hourly pay
Remote work
Flexible hours
DataAnnotation
Alamogordo, NM
4 days ago
Remote Applied Quantitative Analyst - AI Model Evaluator
$30 per hour
A data technology company is seeking an Applied Quantitative Analyst to join their team in evaluating and improving AI models. You will assess the performance of chatbots in solving mathematical problems and refine their outputs through diverse tasks. Applicants should...
Suggested
Hourly pay
Remote work
Flexible hours
DataAnnotation
Indiana, PA
3 days ago
Remote Applied Quantitative Analyst - AI Model Evaluator
...A leading AI training company is seeking an Applied Quantitative Analyst to improve AI models. Responsibilities include evaluating outputs of chatbots based on complex mathematical problems and assessing quality for performance. Ideal candidates are detail-oriented with...
Suggested
Hourly pay
Remote work
Flexible hours
DataAnnotation
Salt Lake City, UT
3 days ago
Remote Applied Quantitative Analyst - AI Model Evaluator
$30 per hour
...A data annotation company is seeking an Applied Quantitative Analyst to train AI models and evaluate their outputs. The roles involve solving complex mathematical problems, assessing AI logic, and providing writing tasks for chatbots. Candidates should possess expert...
Hourly pay
Contract work
Remote work
Flexible hours
DataAnnotation
Jackson, MS
3 days ago
Remote Applied Quantitative Analyst - AI Model Evaluator
...A leading AI solutions company in the United States is seeking an Applied Quantitative Analyst to join their team. You will be responsible for evaluating the performance of AI chatbots and enhancing their mathematical reasoning capabilities. The ideal candidate should...
Hourly pay
Remote work
Flexible hours
DataAnnotation
New York, NY
3 days ago
Remote Applied Quantitative Analyst - AI Model Evaluator
$30 per hour
...A leading data analysis company is seeking an Applied Quantitative Analyst to join its team. The role involves training AI models, evaluating their performance, and solving complex mathematical problems. Ideal candidates should be fluent in English and have strong skills...
Hourly pay
Full time
Part time
Remote work
DataAnnotation
Hartford, CT
3 days ago
AI/ML Engineer, Applied Data Science
...demos, but reliable production tools? The Applied Data Science team within Legal Operations is... ..., RAG pipelines, AI agents, and evaluation frameworks — starting with rapid prototypes... ...TensorFlow, or similar) ~ Experience with LLM APIs (OpenAI, Anthropic, or similar) ~...
Apple
Cupertino, CA
3 days ago
AI Research Engineer / Data Scientist (LLM)
...AI Research Engineer / Data Scientist (LLM) - Mid-Senior Job location: Morristown... ...workflows. You'll drive architecture and evaluation strategy, productionize services with... ...Must-Have (Core) • 4-8+ years in applied ML/data/engineering, with shipped LLM...
Damco Solutions
Morristown, NJ
11 days ago
Applied Data Scientist - UK
£70k - £90k per year
...Job Title: Applied Analytics Engineer Compensation Range: £70,000... ...the UK. Overview As an Applied Data Scientist at Quid, you will build data... ...integrate APIs, and support LLM-based agentic processes to create... ...designing baselines, running evaluation cycles, and building...
Remote work
Quid
New York, NY
1 day ago
Annotation Data Scientist, Evaluation Integrity (Siri)
$154.6k - $274.9k
...Annotation Data Scientist, Evaluation Integrity (Siri) Play a part in the ongoing revolution in human... ...conversations, and the reliability of LLM-as-judge and rule-based evaluators... ...the Human Evaluation team relies on. Apply data science rigor to human-labeled data...
Relocation
Apple
Cambridge, MA
1 day ago
LLM Data Scientist- Brazil
...partner of Insight Global is looking for a talented LLM Data Scientist to join their team. The LLM Data Scientist will help to design, evaluate, and improve large language model (LLM) solutions. This role is focused on applying strong statistical and analytical rigor to...
Insight Global
San Diego, CA
2 days ago
Applied AI and Data Sciences Engineer
$150k - $225k
...Applied AI And Data Sciences Engineer We are a firm where people truly believe in what they do... ...and communications. Consult on the evaluation of vendor products for interoperability... ...Direct experience working with GenAI, LLM fine-tuning, knowledge graphs, and related...
Temporary work
Holland & Knight
Dallas, TX
3 days ago
Senior Data Engineer (Applied ML / Analytics)
$125k - $135k
...consulting firm, is looking for a Senior Data Engineer (Applied AI/ML) to join our Boston, MA office.... ..., and subject-matter experts. They evaluate modeling strategies, articulate technical... ...analysis, vector-based methods, and LLM-enabled approaches, and helping transition...
Work at office
The Brattle Group
Boston, MA
18 hours ago
Senior Data Engineer (Data + Applied AI)
$158k - $168k
...purpose. About the Role The Senior Data and AI Engineer is a high-... ...models, and BI deliverables. Applied AI (RAG pipelines, MLOps) is a... ...and lineage tracking standards. Evaluating new tools and frameworks,... ...RAG pipeline development and LLM integration using LangChain, LangGraph...
Full time
For contractors
Plume Ltd
New York, NY
1 day ago
Applied AI Engineer - iCloud Data
$181.1k - $318.4k
...Applied AI Engineer - iCloud Data Would you like to drive the future of Apple's data... ..., agents, retrieval and evaluation, and shares our passion for... ..., mentoring engineers and scientists, and helping the team adopt... ..., with 3+ years taking LLM or agentic systems from prototype...
Worldwide
Relocation
Apple
Cupertino, CA
1 day ago
Senior Applied AI & Data Scientist
$122k - $207.5k
...Senior Applied AI & Data Scientist Location: New Haven, CT, US, 06510 Category: Finance, Data & Analytics... ...contributor who designs, builds, evaluates, and deploys AI-powered solutions that... ...deployment and measurement. Design and deliver LLM-enabled analytics and "Deep Research"...
Temporary work
Remote work
Flexible hours
Knights of Columbus 1039
United States
3 days ago
Sr Applied AI Data Scientist
$110.72k - $166.08k
...Sr Data Scientist - GD07AE We're determined to make a difference and... ...The Small Business Applied AI team sits at the intersection... ...deployment: data prep, modeling, evaluation, model change management, orchestration... ...or optimize prompts using LLM as a prompt enhancer. Ability...
Temporary work
Remote work
The Hartford
United States
2 days ago
Applied AI Data Scientist
$180k - $240k
...ABOUT THE ROLE As a Data Scientist at AE Studio, you'll work on... ...Python .. Experience with LLM lifecycle: prompt design/engineering... ..., fine-tuning, and evaluation. Proven data science experience... ...? That's fine, you can still apply, and our team may fit you in...
Work at office
Remote work
Flexible hours
AE Studio
United States
38 minutes ago
Senior Applied AI Data Scientist
...supervision “model factory” by applying strong data science best practices... ...feature engineering; consistent evaluation harnesses; and standardized... ...Serve as the embedded data scientist within the supervisory organization... ...(classical ML, NLP, LLM/RAG, hybrid), justify tradeoffs...
Full time
Work at office
Work from home
Relocation
Charles Schwab
Southlake, TX
3 days ago
Data Scientist (Applied AI Scientist) - Innovation Lab
...Lab is seeking a creative and driven Data Scientist (Applied AI Scientist) to explore and pioneer the... ...-agent architectures using modern LLM orchestration frameworks (e.g., LangChain... ...engineering, model inference, and exploratory evaluation. · Collaborate closely with business...
Temporary work
Work experience placement
H1b
Work at office
Local area
Work from home
Flexible hours
3 days per week
Zions Bancorporation
Midvale, UT
1 day ago
Applied AI Health Data System Engineer-Senior Manager
$124k - $280k
...Specialty/Competency: Data, Analytics & AI Industry... ...in autonomy, you apply sound judgment, recognising... ...for use in AI and LLM-powered solutions Manage... ...and other data scientists to deliver efficient, HIPAA... ...with team members. We evaluate these factors thoughtfully...
Full time
H1b
PwC
Melville, NY
1 day ago
ML/LLM Engineer - Applied AI
...re looking for a hands-on ML/LLM Engineer who's excited to ship... ...structured and unstructured data to power decisions in high-stakes... ...who's ready to go deep on applied ML problems - from retrieval to... ..., etc.) ~ Fine-tune and evaluate LLMs using both open-source and...
Autonomize Inc
Austin, TX
1 day ago
Senior Machine Learning Engineer - VLM/LLM Evaluation
$204k - $259k
...Senior Machine Learning Engineer – VLM/LLM Evaluation Waymo is an autonomous driving technology... ...ride-hail service and can also be applied to a range of vehicle platforms and product... ...production Implement and extend large scale data and evaluation pipelines. You have:...
Full time
Temporary work
Remote work
Waymo
San Francisco, CA
18 hours ago
Applied AI / Evaluation Engineer
$155k
...people experience life at work! As an Applied AI / Evaluation Engineer, you will own the quality,... ...patterns and known failure modes Implement LLM‑as‑judge evaluation pipelines and... ...improvement Collaborate with data scientists and platform engineers to instrument...
Netclaim
Lake Oswego, OR
4 days ago
AI Engineer - NLP/LLM Data Product Engineer
...Job Title: AI Engineer - NLP/LLM Data Product Engineer Location: Chennai,... ..., and performance tuning. Platform Evaluation: Assess and compare LLM platforms and tools... ...Continuous Learning: Research and apply emerging LLM advancements, best practices...
Contract work
Saviance
Boston, MA
2 days ago
Senior Analytics Engineer, Applied AI
...traditional cycle of collecting data, building a model from scratch... ...sets you apart is how you apply it: leveraging AI platforms, generative... ...model, prompting an LLM for document extraction, and presenting... ...to leadership. Explore, Evaluate & Adopt Emerging AI Tools and...
Work experience placement
Work at office
Flexible hours
MDVIP
Boca Raton, FL
4 days ago
Remote AI/ML Research Engineer LLM Training & Evaluation
$30 - $50 per hour
...research is looking for a STEM Research Engineer to enhance applied AI/ML workflows including LLM training and dataset development. This remote, full-time... ..., maintaining training pipelines, and ensuring data quality. Ideal candidates will have expertise in NLP and...
Hourly pay
Full time
Remote work
Rex USA
New York, NY
1 day ago
Asset Management - NLP/LLM Data Scientist - Associate
...experiment with the latest data science and AI/LLM techniques. As an Associate in the NLP/LLM Data Scientist Team within Asset Management... ...investment process. You’ll apply cutting‑edge data science and... ...datasets for model training and evaluation with strong data governance....
J.P. Morgan
Worcester, MA
8 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Applied Data Scientist, LLM Evaluation. Be the first to apply!