Applied Data Scientist, LLM Evaluation
Driver AI Inc.
Applied Data Scientist, LLM Evaluation
Introduction At Driver, we're building systems that turn source code into human language. The tech stack includes a core compiler-like engine, a heavily asynchronous/distributed backend server, and a frontend web application that provides a rich user experience.
About Driver We're an early-stage startup backed by Y Combinator and Google Ventures that combines first principles technical approaches and applied LLM expertise to tackle context engineering at scale. Driver builds the context layer for employees and AI agents alike to use in developing software.
Working at Driver Driver is an early-stage but fast-growing startup. As such, we take advantage of that which startups can excel: delivery speed, flexibility, and enjoying working with a small close-knit team. Organizational and engineering values at Driver include first-principles thinking, correct by construction, writing things down, experimentation and iteration, pragmatism, commitment to effective communication and transparency, autonomy, and ambition.
Job Overview Title : Applied Data Scientist, LLM Evaluation Location: Remote or Austin, Tx Our value is directly tied to the quality of our content at scale. The platform generates technical documentation across a complex, multi-stage pipeline - producing multiple content types at different levels of abstraction, from individual code elements up to high-level summaries. Today, changes to models, context strategies, or pipeline architecture are evaluated largely through manual review and intuition. There is no systematic way to answer: "Did this change make our output better, worse, or the same - and for which languages, repo sizes, and content types?" This is a hard problem. LLM outputs are non-deterministic - identical inputs produce different outputs across runs, and small variations at early pipeline stages compound into meaningfully different end-user content downstream. Evaluating quality requires methodology that accounts for this: statistical reasoning over multiple runs, understanding of cascade effects through the pipeline, and rubrics that balance human judgment with automated signals. This role builds the evaluation function from scratch. You'll define what "good" means for our generated content, build the infrastructure to measure it, and create the experimental framework that lets the team ship changes with confidence.
What You'll Do You'll own the LLM evaluation strategy at Driver - from first principles to production infrastructure. This is a foundational role: you're not joining an existing eval team, you're building it. As the function matures, you'll seed and grow a team around it. Define quality metrics and build evaluation datasets. Establish what "good" looks like for each content type across the pipeline. Build and curate gold-standard evaluation datasets across languages and repo archetypes (monorepos, microservices, libraries, applications). Design rubrics that capture accuracy, completeness, usefulness, and readability. Build benchmarking and experimentation infrastructure. Create automated evaluation pipelines that score output against reference datasets. Instrument the content generation pipeline to support A/B comparisons - run the same codebase through two strategies and compare results. Build tooling for LLM-as-judge evaluation and regression detection. Integrate evaluation into CI so pipeline changes come with quality evidence. Develop automated quality signals at scale. Build quality checks that flag degraded output without requiring human review of every document. Monitor content quality trends over time. Design sampling strategies for human review that maximize signal with minimal annotation effort. Quantify tradeoffs and inform decisions. Run experiments on model selection, context strategies, and pipeline architecture changes. Quantify cost/quality/latency tradeoffs. Partner with the engineering team to turn evaluation insights into shipped improvements.
Qualifications Education: Bachelor's, Master's, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative field. Experience: Minimum 3 - 5 years in applied science, ML engineering, or data science roles with a focus on evaluation, NLP, or generative AI. 7+ years experience preferred. Required Technical Skills
Driver is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Introduction At Driver, we're building systems that turn source code into human language. The tech stack includes a core compiler-like engine, a heavily asynchronous/distributed backend server, and a frontend web application that provides a rich user experience.
About Driver We're an early-stage startup backed by Y Combinator and Google Ventures that combines first principles technical approaches and applied LLM expertise to tackle context engineering at scale. Driver builds the context layer for employees and AI agents alike to use in developing software.
Working at Driver Driver is an early-stage but fast-growing startup. As such, we take advantage of that which startups can excel: delivery speed, flexibility, and enjoying working with a small close-knit team. Organizational and engineering values at Driver include first-principles thinking, correct by construction, writing things down, experimentation and iteration, pragmatism, commitment to effective communication and transparency, autonomy, and ambition.
Job Overview Title : Applied Data Scientist, LLM Evaluation Location: Remote or Austin, Tx Our value is directly tied to the quality of our content at scale. The platform generates technical documentation across a complex, multi-stage pipeline - producing multiple content types at different levels of abstraction, from individual code elements up to high-level summaries. Today, changes to models, context strategies, or pipeline architecture are evaluated largely through manual review and intuition. There is no systematic way to answer: "Did this change make our output better, worse, or the same - and for which languages, repo sizes, and content types?" This is a hard problem. LLM outputs are non-deterministic - identical inputs produce different outputs across runs, and small variations at early pipeline stages compound into meaningfully different end-user content downstream. Evaluating quality requires methodology that accounts for this: statistical reasoning over multiple runs, understanding of cascade effects through the pipeline, and rubrics that balance human judgment with automated signals. This role builds the evaluation function from scratch. You'll define what "good" means for our generated content, build the infrastructure to measure it, and create the experimental framework that lets the team ship changes with confidence.
What You'll Do You'll own the LLM evaluation strategy at Driver - from first principles to production infrastructure. This is a foundational role: you're not joining an existing eval team, you're building it. As the function matures, you'll seed and grow a team around it. Define quality metrics and build evaluation datasets. Establish what "good" looks like for each content type across the pipeline. Build and curate gold-standard evaluation datasets across languages and repo archetypes (monorepos, microservices, libraries, applications). Design rubrics that capture accuracy, completeness, usefulness, and readability. Build benchmarking and experimentation infrastructure. Create automated evaluation pipelines that score output against reference datasets. Instrument the content generation pipeline to support A/B comparisons - run the same codebase through two strategies and compare results. Build tooling for LLM-as-judge evaluation and regression detection. Integrate evaluation into CI so pipeline changes come with quality evidence. Develop automated quality signals at scale. Build quality checks that flag degraded output without requiring human review of every document. Monitor content quality trends over time. Design sampling strategies for human review that maximize signal with minimal annotation effort. Quantify tradeoffs and inform decisions. Run experiments on model selection, context strategies, and pipeline architecture changes. Quantify cost/quality/latency tradeoffs. Partner with the engineering team to turn evaluation insights into shipped improvements.
Qualifications Education: Bachelor's, Master's, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative field. Experience: Minimum 3 - 5 years in applied science, ML engineering, or data science roles with a focus on evaluation, NLP, or generative AI. 7+ years experience preferred. Required Technical Skills
- Strong statistical foundations: experimental design, hypothesis testing, confidence intervals, effect sizes, power analysis.
- Experience designing and running evaluations for LLM or NLP systems - you've thought carefully about what "better" means when outputs are open-ended text.
- Proficient in Python and the scientific/data stack (pandas, NumPy, scipy, sklearn).
- Comfortable working in Jupyter notebooks for exploration and prototyping, and turning that work into automated pipelines.
- Experience with LLM-as-judge approaches, inter-annotator agreement, and rubric design for subjective quality assessment.
- Familiarity with the practical challenges of non-deterministic systems: variance decomposition, multi-run methodology, distinguishing signal from noise at scale.
- Strong data storytelling - you can turn experiment results into clear recommendations that drive engineering and product decisions.
- Experience with LLM APIs and prompt engineering across multiple providers.
- Familiarity with evaluation frameworks (e.g., RAGAS, DeepEval, custom harnesses).
- Experience building data pipelines or ETL workflows (Airflow, Dagster, or similar).
- Comfort with SQL and working directly against production data stores.
- Experience with visualization tools (Matplotlib, Plotly, Streamlit) for building internal dashboards and reports.
- Background in code understanding, developer tools, or technical documentation.
- Experience building or managing annotation pipelines and human evaluation workflows.
- Competitive Compensation Packages - Cash & Equity
- Flexible Work Culture
- Unlimited Time Off + 12 Paid Company Holidays
- Insurance - Health, Dental, & Vision
- Life Insurance & FSA Accounts
- 401(k) Retirement Accounts - Traditional, Roth, or Both
- Quarterly Team Offsites
Driver is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Applied Data Scientist, LLM Evaluation in United States vacancy
$30 - $50 per hour
...company specializing in AI research is seeking a mid-senior level researcher to manage applied AI research projects. The role involves end-to-end research cycles, building and evaluating LLM systems, and collaborating on dataset development. The ideal candidate should have...SuggestedHourly payFull timeRemote work$141.8k - $258.6k
Apple Inc. in Cupertino, California is seeking a Data Scientist to join the Special Projects team. The role involves developing... ...processes, work with ML Engineers, and develop LLM auto-judges for AI model evaluation. The ideal candidate has a BA/Master’s in a relevant...Suggested- A technology firm is seeking an Applied Quantitative Analyst to measure the progress of AI models and improve their quality. Responsibilities include giving AI chatbots complex math problems, evaluating their outputs, and overseeing writing tasks. Ideal candidates have...SuggestedHourly payRemote workFlexible hours
$30 per hour
A data technology company is seeking an Applied Quantitative Analyst to join their team in evaluating and improving AI models. You will assess the performance of chatbots in solving mathematical problems and refine their outputs through diverse tasks. Applicants should...SuggestedHourly payRemote workFlexible hours- ...A leading AI training company is seeking an Applied Quantitative Analyst to improve AI models. Responsibilities include evaluating outputs of chatbots based on complex mathematical problems and assessing quality for performance. Ideal candidates are detail-oriented with...SuggestedHourly payRemote workFlexible hours
$30 per hour
...A data annotation company is seeking an Applied Quantitative Analyst to train AI models and evaluate their outputs. The roles involve solving complex mathematical problems, assessing AI logic, and providing writing tasks for chatbots. Candidates should possess expert...Hourly payContract workRemote workFlexible hours- ...A leading AI solutions company in the United States is seeking an Applied Quantitative Analyst to join their team. You will be responsible for evaluating the performance of AI chatbots and enhancing their mathematical reasoning capabilities. The ideal candidate should...Hourly payRemote workFlexible hours
$30 per hour
...A leading data analysis company is seeking an Applied Quantitative Analyst to join its team. The role involves training AI models, evaluating their performance, and solving complex mathematical problems. Ideal candidates should be fluent in English and have strong skills...Hourly payFull timePart timeRemote work- ...demos, but reliable production tools? The Applied Data Science team within Legal Operations is... ..., RAG pipelines, AI agents, and evaluation frameworks — starting with rapid prototypes... ...TensorFlow, or similar) ~ Experience with LLM APIs (OpenAI, Anthropic, or similar) ~...
- ...AI Research Engineer / Data Scientist (LLM) - Mid-Senior Job location: Morristown... ...workflows. You'll drive architecture and evaluation strategy, productionize services with... ...Must-Have (Core) • 4-8+ years in applied ML/data/engineering, with shipped LLM...
£70k - £90k per year
...Job Title: Applied Analytics Engineer Compensation Range: £70,000... ...the UK. Overview As an Applied Data Scientist at Quid, you will build data... ...integrate APIs, and support LLM-based agentic processes to create... ...designing baselines, running evaluation cycles, and building...Remote work$154.6k - $274.9k
...Annotation Data Scientist, Evaluation Integrity (Siri) Play a part in the ongoing revolution in human... ...conversations, and the reliability of LLM-as-judge and rule-based evaluators... ...the Human Evaluation team relies on. Apply data science rigor to human-labeled data...Relocation- ...partner of Insight Global is looking for a talented LLM Data Scientist to join their team. The LLM Data Scientist will help to design, evaluate, and improve large language model (LLM) solutions. This role is focused on applying strong statistical and analytical rigor to...
$150k - $225k
...Applied AI And Data Sciences Engineer We are a firm where people truly believe in what they do... ...and communications. Consult on the evaluation of vendor products for interoperability... ...Direct experience working with GenAI, LLM fine-tuning, knowledge graphs, and related...Temporary work$125k - $135k
...consulting firm, is looking for a Senior Data Engineer (Applied AI/ML) to join our Boston, MA office.... ..., and subject-matter experts. They evaluate modeling strategies, articulate technical... ...analysis, vector-based methods, and LLM-enabled approaches, and helping transition...Work at office$158k - $168k
...purpose. About the Role The Senior Data and AI Engineer is a high-... ...models, and BI deliverables. Applied AI (RAG pipelines, MLOps) is a... ...and lineage tracking standards. Evaluating new tools and frameworks,... ...RAG pipeline development and LLM integration using LangChain, LangGraph...Full timeFor contractors$181.1k - $318.4k
...Applied AI Engineer - iCloud Data Would you like to drive the future of Apple's data... ..., agents, retrieval and evaluation, and shares our passion for... ..., mentoring engineers and scientists, and helping the team adopt... ..., with 3+ years taking LLM or agentic systems from prototype...WorldwideRelocation$122k - $207.5k
...Senior Applied AI & Data Scientist Location: New Haven, CT, US, 06510 Category: Finance, Data & Analytics... ...contributor who designs, builds, evaluates, and deploys AI-powered solutions that... ...deployment and measurement. Design and deliver LLM-enabled analytics and "Deep Research"...Temporary workRemote workFlexible hours$110.72k - $166.08k
...Sr Data Scientist - GD07AE We're determined to make a difference and... ...The Small Business Applied AI team sits at the intersection... ...deployment: data prep, modeling, evaluation, model change management, orchestration... ...or optimize prompts using LLM as a prompt enhancer. Ability...Temporary workRemote work$180k - $240k
...ABOUT THE ROLE As a Data Scientist at AE Studio, you'll work on... ...Python .. Experience with LLM lifecycle: prompt design/engineering... ..., fine-tuning, and evaluation. Proven data science experience... ...? That's fine, you can still apply, and our team may fit you in...Work at officeRemote workFlexible hours- ...supervision “model factory” by applying strong data science best practices... ...feature engineering; consistent evaluation harnesses; and standardized... ...Serve as the embedded data scientist within the supervisory organization... ...(classical ML, NLP, LLM/RAG, hybrid), justify tradeoffs...Full timeWork at officeWork from homeRelocation
- ...Lab is seeking a creative and driven Data Scientist (Applied AI Scientist) to explore and pioneer the... ...-agent architectures using modern LLM orchestration frameworks (e.g., LangChain... ...engineering, model inference, and exploratory evaluation. · Collaborate closely with business...Temporary workWork experience placementH1bWork at officeLocal areaWork from homeFlexible hours3 days per week
$124k - $280k
...Specialty/Competency: Data, Analytics & AI Industry... ...in autonomy, you apply sound judgment, recognising... ...for use in AI and LLM-powered solutions Manage... ...and other data scientists to deliver efficient, HIPAA... ...with team members. We evaluate these factors thoughtfully...Full timeH1b- ...re looking for a hands-on ML/LLM Engineer who's excited to ship... ...structured and unstructured data to power decisions in high-stakes... ...who's ready to go deep on applied ML problems - from retrieval to... ..., etc.) ~ Fine-tune and evaluate LLMs using both open-source and...
$204k - $259k
...Senior Machine Learning Engineer – VLM/LLM Evaluation Waymo is an autonomous driving technology... ...ride-hail service and can also be applied to a range of vehicle platforms and product... ...production Implement and extend large scale data and evaluation pipelines. You have:...Full timeTemporary workRemote work$155k
...people experience life at work! As an Applied AI / Evaluation Engineer, you will own the quality,... ...patterns and known failure modes Implement LLM‑as‑judge evaluation pipelines and... ...improvement Collaborate with data scientists and platform engineers to instrument...- ...Job Title: AI Engineer - NLP/LLM Data Product Engineer Location: Chennai,... ..., and performance tuning. Platform Evaluation: Assess and compare LLM platforms and tools... ...Continuous Learning: Research and apply emerging LLM advancements, best practices...Contract work
- ...traditional cycle of collecting data, building a model from scratch... ...sets you apart is how you apply it: leveraging AI platforms, generative... ...model, prompting an LLM for document extraction, and presenting... ...to leadership. Explore, Evaluate & Adopt Emerging AI Tools and...Work experience placementWork at officeFlexible hours
$30 - $50 per hour
...research is looking for a STEM Research Engineer to enhance applied AI/ML workflows including LLM training and dataset development. This remote, full-time... ..., maintaining training pipelines, and ensuring data quality. Ideal candidates will have expertise in NLP and...Hourly payFull timeRemote work- ...experiment with the latest data science and AI/LLM techniques. As an Associate in the NLP/LLM Data Scientist Team within Asset Management... ...investment process. You’ll apply cutting‑edge data science and... ...datasets for model training and evaluation with strong data governance....
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Applied Data Scientist, LLM Evaluation. Be the first to apply!
Related searches
- python data scientist United States
- data scientist no experience United States
- healthcare data scientist United States
- junior data scientist remote United States
- chief data scientist United States
- data scientist United States
- ai data scientist United States
- data scientist (hedge fund) United States
- entry level data scientist remote United States
- junior data scientist United States

