Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Applied Data Scientist, LLM Evaluation

Driver AI Inc.

Applied Data Scientist, LLM Evaluation
Introduction

At Driver, we're building systems that turn source code into human language. The tech stack includes a core compiler-like engine, a heavily asynchronous/distributed backend server, and a frontend web application that provides a rich user experience.
About Driver

We're an early-stage startup backed by Y Combinator and Google Ventures that combines first principles technical approaches and applied LLM expertise to tackle context engineering at scale. Driver builds the context layer for employees and AI agents alike to use in developing software.
Working at Driver

Driver is an early-stage but fast-growing startup. As such, we take advantage of that which startups can excel: delivery speed, flexibility, and enjoying working with a small close-knit team.

Organizational and engineering values at Driver include first-principles thinking, correct by construction, writing things down, experimentation and iteration, pragmatism, commitment to effective communication and transparency, autonomy, and ambition.
Job Overview

Title : Applied Data Scientist, LLM Evaluation

Location: Remote or Austin, Tx

Our value is directly tied to the quality of our content at scale. The platform generates technical documentation across a complex, multi-stage pipeline - producing multiple content types at different levels of abstraction, from individual code elements up to high-level summaries. Today, changes to models, context strategies, or pipeline architecture are evaluated largely through manual review and intuition. There is no systematic way to answer: "Did this change make our output better, worse, or the same - and for which languages, repo sizes, and content types?"

This is a hard problem. LLM outputs are non-deterministic - identical inputs produce different outputs across runs, and small variations at early pipeline stages compound into meaningfully different end-user content downstream. Evaluating quality requires methodology that accounts for this: statistical reasoning over multiple runs, understanding of cascade effects through the pipeline, and rubrics that balance human judgment with automated signals.

This role builds the evaluation function from scratch. You'll define what "good" means for our generated content, build the infrastructure to measure it, and create the experimental framework that lets the team ship changes with confidence.
What You'll Do

You'll own the LLM evaluation strategy at Driver - from first principles to production infrastructure. This is a foundational role: you're not joining an existing eval team, you're building it. As the function matures, you'll seed and grow a team around it.

Define quality metrics and build evaluation datasets. Establish what "good" looks like for each content type across the pipeline. Build and curate gold-standard evaluation datasets across languages and repo archetypes (monorepos, microservices, libraries, applications). Design rubrics that capture accuracy, completeness, usefulness, and readability.

Build benchmarking and experimentation infrastructure. Create automated evaluation pipelines that score output against reference datasets. Instrument the content generation pipeline to support A/B comparisons - run the same codebase through two strategies and compare results. Build tooling for LLM-as-judge evaluation and regression detection. Integrate evaluation into CI so pipeline changes come with quality evidence.

Develop automated quality signals at scale. Build quality checks that flag degraded output without requiring human review of every document. Monitor content quality trends over time. Design sampling strategies for human review that maximize signal with minimal annotation effort.

Quantify tradeoffs and inform decisions. Run experiments on model selection, context strategies, and pipeline architecture changes. Quantify cost/quality/latency tradeoffs. Partner with the engineering team to turn evaluation insights into shipped improvements.
Qualifications

Education: Bachelor's, Master's, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative field.

Experience: Minimum 3 - 5 years in applied science, ML engineering, or data science roles with a focus on evaluation, NLP, or generative AI. 7+ years experience preferred.

Required Technical Skills
  • Strong statistical foundations: experimental design, hypothesis testing, confidence intervals, effect sizes, power analysis.
  • Experience designing and running evaluations for LLM or NLP systems - you've thought carefully about what "better" means when outputs are open-ended text.
  • Proficient in Python and the scientific/data stack (pandas, NumPy, scipy, sklearn).
  • Comfortable working in Jupyter notebooks for exploration and prototyping, and turning that work into automated pipelines.
  • Experience with LLM-as-judge approaches, inter-annotator agreement, and rubric design for subjective quality assessment.
  • Familiarity with the practical challenges of non-deterministic systems: variance decomposition, multi-run methodology, distinguishing signal from noise at scale.
  • Strong data storytelling - you can turn experiment results into clear recommendations that drive engineering and product decisions.
Preferred and Nice-to-Have Technical Skills
  • Experience with LLM APIs and prompt engineering across multiple providers.
  • Familiarity with evaluation frameworks (e.g., RAGAS, DeepEval, custom harnesses).
  • Experience building data pipelines or ETL workflows (Airflow, Dagster, or similar).
  • Comfort with SQL and working directly against production data stores.
  • Experience with visualization tools (Matplotlib, Plotly, Streamlit) for building internal dashboards and reports.
  • Background in code understanding, developer tools, or technical documentation.
  • Experience building or managing annotation pipelines and human evaluation workflows.
Benefits
  • Competitive Compensation Packages - Cash & Equity
  • Flexible Work Culture
  • Unlimited Time Off + 12 Paid Company Holidays
  • Insurance - Health, Dental, & Vision
  • Life Insurance & FSA Accounts
  • 401(k) Retirement Accounts - Traditional, Roth, or Both
  • Quarterly Team Offsites

Driver is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Applied Data Scientist, LLM Evaluation in United States vacancy
  • $30 - $50 per hour

     ...company specializing in AI research is seeking a mid-senior level researcher to manage applied AI research projects. The role involves end-to-end research cycles, building and evaluating LLM systems, and collaborating on dataset development. The ideal candidate should have... 
    Suggested
    Hourly pay
    Full time
    Remote work

    Rex USA

    New York, NY
    1 day ago
  • $141.8k - $258.6k

    Apple Inc. in Cupertino, California is seeking a Data Scientist to join the Special Projects team. The role involves developing...  ...processes, work with ML Engineers, and develop LLM auto-judges for AI model evaluation. The ideal candidate has a BA/Master’s in a relevant... 
    Suggested

    Apple Inc.

    Cupertino, CA
    1 day ago
  • A technology firm is seeking an Applied Quantitative Analyst to measure the progress of AI models and improve their quality. Responsibilities include giving AI chatbots complex math problems, evaluating their outputs, and overseeing writing tasks. Ideal candidates have... 
    Suggested
    Hourly pay
    Remote work
    Flexible hours

    DataAnnotation

    Alamogordo, NM
    4 days ago
  • $30 per hour

    A data technology company is seeking an Applied Quantitative Analyst to join their team in evaluating and improving AI models. You will assess the performance of chatbots in solving mathematical problems and refine their outputs through diverse tasks. Applicants should... 
    Suggested
    Hourly pay
    Remote work
    Flexible hours

    DataAnnotation

    Indiana, PA
    3 days ago
  •  ...A leading AI training company is seeking an Applied Quantitative Analyst to improve AI models. Responsibilities include evaluating outputs of chatbots based on complex mathematical problems and assessing quality for performance. Ideal candidates are detail-oriented with... 
    Suggested
    Hourly pay
    Remote work
    Flexible hours

    DataAnnotation

    Salt Lake City, UT
    3 days ago
  • $30 per hour

     ...A data annotation company is seeking an Applied Quantitative Analyst to train AI models and evaluate their outputs. The roles involve solving complex mathematical problems, assessing AI logic, and providing writing tasks for chatbots. Candidates should possess expert... 
    Hourly pay
    Contract work
    Remote work
    Flexible hours

    DataAnnotation

    Jackson, MS
    3 days ago
  •  ...A leading AI solutions company in the United States is seeking an Applied Quantitative Analyst to join their team. You will be responsible for evaluating the performance of AI chatbots and enhancing their mathematical reasoning capabilities. The ideal candidate should... 
    Hourly pay
    Remote work
    Flexible hours

    DataAnnotation

    New York, NY
    3 days ago
  • $30 per hour

     ...A leading data analysis company is seeking an Applied Quantitative Analyst to join its team. The role involves training AI models, evaluating their performance, and solving complex mathematical problems. Ideal candidates should be fluent in English and have strong skills... 
    Hourly pay
    Full time
    Part time
    Remote work

    DataAnnotation

    Hartford, CT
    3 days ago
  •  ...demos, but reliable production tools? The Applied Data Science team within Legal Operations is...  ..., RAG pipelines, AI agents, and evaluation frameworks — starting with rapid prototypes...  ...TensorFlow, or similar) ~ Experience with LLM APIs (OpenAI, Anthropic, or similar) ~... 

    Apple

    Cupertino, CA
    3 days ago
  •  ...AI Research Engineer / Data Scientist (LLM) - Mid-Senior Job location: Morristown...  ...workflows. You'll drive architecture and evaluation strategy, productionize services with...  ...Must-Have (Core) • 4-8+ years in applied ML/data/engineering, with shipped LLM... 

    Damco Solutions

    Morristown, NJ
    11 days ago
  • £70k - £90k per year

     ...Job Title: Applied Analytics Engineer Compensation Range: £70,000...  ...the UK. Overview As an Applied Data Scientist at Quid, you will build data...  ...integrate APIs, and support LLM-based agentic processes to create...  ...designing baselines, running evaluation cycles, and building... 
    Remote work

    Quid

    New York, NY
    1 day ago
  • $154.6k - $274.9k

     ...Annotation Data Scientist, Evaluation Integrity (Siri) Play a part in the ongoing revolution in human...  ...conversations, and the reliability of LLM-as-judge and rule-based evaluators...  ...the Human Evaluation team relies on. Apply data science rigor to human-labeled data... 
    Relocation

    Apple

    Cambridge, MA
    1 day ago
  •  ...partner of Insight Global is looking for a talented LLM Data Scientist to join their team. The LLM Data Scientist will help to design, evaluate, and improve large language model (LLM) solutions. This role is focused on applying strong statistical and analytical rigor to... 

    Insight Global

    San Diego, CA
    2 days ago
  • $150k - $225k

     ...Applied AI And Data Sciences Engineer We are a firm where people truly believe in what they do...  ...and communications. Consult on the evaluation of vendor products for interoperability...  ...Direct experience working with GenAI, LLM fine-tuning, knowledge graphs, and related... 
    Temporary work

    Holland & Knight

    Dallas, TX
    3 days ago
  • $125k - $135k

     ...consulting firm, is looking for a Senior Data Engineer (Applied AI/ML) to join our Boston, MA office....  ..., and subject-matter experts. They evaluate modeling strategies, articulate technical...  ...analysis, vector-based methods, and LLM-enabled approaches, and helping transition... 
    Work at office

    The Brattle Group

    Boston, MA
    18 hours ago
  • $158k - $168k

     ...purpose. About the Role The Senior Data and AI Engineer is a high-...  ...models, and BI deliverables. Applied AI (RAG pipelines, MLOps) is a...  ...and lineage tracking standards. Evaluating new tools and frameworks,...  ...RAG pipeline development and LLM integration using LangChain, LangGraph... 
    Full time
    For contractors

    Plume Ltd

    New York, NY
    1 day ago
  • $181.1k - $318.4k

     ...Applied AI Engineer - iCloud Data Would you like to drive the future of Apple's data...  ..., agents, retrieval and evaluation, and shares our passion for...  ..., mentoring engineers and scientists, and helping the team adopt...  ..., with 3+ years taking LLM or agentic systems from prototype... 
    Worldwide
    Relocation

    Apple

    Cupertino, CA
    1 day ago
  • $122k - $207.5k

     ...Senior Applied AI & Data Scientist Location: New Haven, CT, US, 06510 Category: Finance, Data & Analytics...  ...contributor who designs, builds, evaluates, and deploys AI-powered solutions that...  ...deployment and measurement. Design and deliver LLM-enabled analytics and "Deep Research"... 
    Temporary work
    Remote work
    Flexible hours

    Knights of Columbus 1039

    United States
    3 days ago
  • $110.72k - $166.08k

     ...Sr Data Scientist - GD07AE We're determined to make a difference and...  ...The Small Business Applied AI team sits at the intersection...  ...deployment: data prep, modeling, evaluation, model change management, orchestration...  ...or optimize prompts using LLM as a prompt enhancer. Ability... 
    Temporary work
    Remote work

    The Hartford

    United States
    2 days ago
  • $180k - $240k

     ...ABOUT THE ROLE As a Data Scientist at AE Studio, you'll work on...  ...Python .. Experience with LLM lifecycle: prompt design/engineering...  ..., fine-tuning, and evaluation. Proven data science experience...  ...? That's fine, you can still apply, and our team may fit you in... 
    Work at office
    Remote work
    Flexible hours

    AE Studio

    United States
    38 minutes ago
  •  ...supervision “model factory” by applying strong data science best practices...  ...feature engineering; consistent evaluation harnesses; and standardized...  ...Serve as the embedded data scientist within the supervisory organization...  ...(classical ML, NLP, LLM/RAG, hybrid), justify tradeoffs... 
    Full time
    Work at office
    Work from home
    Relocation

    Charles Schwab

    Southlake, TX
    3 days ago
  •  ...Lab is seeking a creative and driven Data Scientist (Applied AI Scientist) to explore and pioneer the...  ...-agent architectures using modern LLM orchestration frameworks (e.g., LangChain...  ...engineering, model inference, and exploratory evaluation. · Collaborate closely with business... 
    Temporary work
    Work experience placement
    H1b
    Work at office
    Local area
    Work from home
    Flexible hours
    3 days per week

    Zions Bancorporation

    Midvale, UT
    1 day ago
  • $124k - $280k

     ...Specialty/Competency: Data, Analytics & AI Industry...  ...in autonomy, you apply sound judgment, recognising...  ...for use in AI and LLM-powered solutions Manage...  ...and other data scientists to deliver efficient, HIPAA...  ...with team members. We evaluate these factors thoughtfully... 
    Full time
    H1b

    PwC

    Melville, NY
    1 day ago
  •  ...re looking for a hands-on ML/LLM Engineer who's excited to ship...  ...structured and unstructured data to power decisions in high-stakes...  ...who's ready to go deep on applied ML problems - from retrieval to...  ..., etc.) ~ Fine-tune and evaluate LLMs using both open-source and... 

    Autonomize Inc

    Austin, TX
    1 day ago
  • $204k - $259k

     ...Senior Machine Learning Engineer – VLM/LLM Evaluation Waymo is an autonomous driving technology...  ...ride-hail service and can also be applied to a range of vehicle platforms and product...  ...production Implement and extend large scale data and evaluation pipelines. You have:... 
    Full time
    Temporary work
    Remote work

    Waymo

    San Francisco, CA
    18 hours ago
  • $155k

     ...people experience life at work! As an Applied AI / Evaluation Engineer, you will own the quality,...  ...patterns and known failure modes Implement LLM‑as‑judge evaluation pipelines and...  ...improvement Collaborate with data scientists and platform engineers to instrument... 

    Netclaim

    Lake Oswego, OR
    4 days ago
  •  ...Job Title: AI Engineer - NLP/LLM Data Product Engineer Location: Chennai,...  ..., and performance tuning. Platform Evaluation: Assess and compare LLM platforms and tools...  ...Continuous Learning: Research and apply emerging LLM advancements, best practices... 
    Contract work

    Saviance

    Boston, MA
    2 days ago
  •  ...traditional cycle of collecting data, building a model from scratch...  ...sets you apart is how you apply it: leveraging AI platforms, generative...  ...model, prompting an LLM for document extraction, and presenting...  ...to leadership. Explore, Evaluate & Adopt Emerging AI Tools and... 
    Work experience placement
    Work at office
    Flexible hours

    MDVIP

    Boca Raton, FL
    4 days ago
  • $30 - $50 per hour

     ...research is looking for a STEM Research Engineer to enhance applied AI/ML workflows including LLM training and dataset development. This remote, full-time...  ..., maintaining training pipelines, and ensuring data quality. Ideal candidates will have expertise in NLP and... 
    Hourly pay
    Full time
    Remote work

    Rex USA

    New York, NY
    1 day ago
  •  ...experiment with the latest data science and AI/LLM techniques. As an Associate in the NLP/LLM Data Scientist Team within Asset Management...  ...investment process. You’ll apply cutting‑edge data science and...  ...datasets for model training and evaluation with strong data governance.... 

    J.P. Morgan

    Worcester, MA
    8 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Applied Data Scientist, LLM Evaluation. Be the first to apply!