Applied Data Scientist, LLM Evaluation United States (Remote) View Role
$175k - $275kDriverai
- Remote job
Full-Time in Austin, TX Remote (any location) - Senior - Product & Engineering - $175k - $275k Applied Data Scientist, LLM Evaluation Introduction At Driver, we’re building systems that turn source code into human language. The tech stack includes a core compiler-like engine, a heavily asynchronous/distributed backend server, and a frontend web application that provides a rich user experience. About Driver We’re an early-stage startup backed by Y Combinator and Google Ventures that combines first principles technical approaches and applied LLM expertise to tackle context engineering at scale. Driver builds the context layer for employees and AI agents alike to use in developing software. Working at Driver Driver is an early-stage but fast-growing startup. As such, we take advantage of that which startups can excel: delivery speed, flexibility, and enjoying working with a small close-knit team. Organizational and engineering values at Driver include first-principles thinking, correct by construction, writing things down, experimentation and iteration, pragmatism, commitment to effective communication and transparency, autonomy, and ambition. Job Overview Title : Applied Data Scientist, LLM Evaluation Location: Remote or Austin, Tx Our value is directly tied to the quality of our content at scale. The platform generates technical documentation across a complex, multi-stage pipeline — producing multiple content types at different levels of abstraction, from individual code elements up to high-level summaries. Today, changes to models, context strategies, or pipeline architecture are evaluated largely through manual review and intuition. There is no systematic way to answer: “Did this change make our output better, worse, or the same — and for which languages, repo sizes, and content types?” This is a hard problem. LLM outputs are non-deterministic — identical inputs produce different outputs across runs, and small variations at early pipeline stages compound into meaningfully different end-user content downstream. Evaluating quality requires methodology that accounts for this: statistical reasoning over multiple runs, understanding of cascade effects through the pipeline, and rubrics that balance human judgment with automated signals. This role builds the evaluation function from scratch. You’ll define what “good” means for our generated content, build the infrastructure to measure it, and create the experimental framework that lets the team ship changes with confidence. What You’ll Do You’ll own the LLM evaluation strategy at Driver — from first principles to production infrastructure. This is a foundational role: you’re not joining an existing eval team, you’re building it. As the function matures, you’ll seed and grow a team around it. Define quality metrics and build evaluation datasets. Establish what “good” looks like for each content type across the pipeline. Build and curate gold-standard evaluation datasets across languages and repo archetypes (monorepos, microservices, libraries, applications). Design rubrics that capture accuracy, completeness, usefulness, and readability. Build benchmarking and experimentation infrastructure. Create automated evaluation pipelines that score output against reference datasets. Instrument the content generation pipeline to support A/B comparisons — run the same codebase through two strategies and compare results. Build tooling for LLM-as-judge evaluation and regression detection. Integrate evaluation into CI so pipeline changes come with quality evidence. Develop automated quality signals at scale. Build quality checks that flag degraded output without requiring human review of every document. Monitor content quality trends over time. Design sampling strategies for human review that maximize signal with minimal annotation effort. Quantify tradeoffs and inform decisions. Run experiments on model selection, context strategies, and pipeline architecture changes. Quantify cost/quality/latency tradeoffs. Partner with the engineering team to turn evaluation insights into shipped improvements. Qualifications Education: Bachelor’s, Master’s, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative field. Experience: Minimum 3 — 5 years in applied science, ML engineering, or data science roles with a focus on evaluation, NLP, or generative AI. 7+ years experience preferred. Required Technical Skills Strong statistical foundations: experimental design, hypothesis testing, confidence intervals, effect sizes, power analysis. Experience designing and running evaluations for LLM or NLP systems — you’ve thought carefully about what “better” means when outputs are open-ended text. Proficient in Python and the scientific/data stack (pandas, NumPy, scipy, sklearn). Comfortable working in Jupyter notebooks for exploration and prototyping, and turning that work into automated pipelines. Experience with LLM-as-judge approaches, inter-annotator agreement, and rubric design for subjective quality assessment. Familiarity with the practical challenges of non-deterministic systems: variance decomposition, multi-run methodology, distinguishing signal from noise at scale. Strong data storytelling — you can turn experiment results into clear recommendations that drive engineering and product decisions. Preferred and Nice-to-Have Technical Skills Experience with LLM APIs and prompt engineering across multiple providers. Familiarity with evaluation frameworks (e.g., RAGAS, DeepEval, custom harnesses). Experience building data pipelines or ETL workflows (Airflow, Dagster, or similar). Comfort with SQL and working directly against production data stores. Experience with visualization tools (Matplotlib, Plotly, Streamlit) for building internal dashboards and reports. Background in code understanding, developer tools, or technical documentation. Experience building or managing annotation pipelines and human evaluation workflows. Competitive Compensation Packages - Cash & Equity Flexible Work Culture Unlimited Time Off + 12 Paid Company Holidays Life Insurance & FSA Accounts 401(k) Retirement Accounts - Traditional, Roth, or Both Quarterly Team Offsites Driver is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. #J-18808-Ljbffr Driverai
- ...Data Science / Machine Learning Engineer (Remote, Continental United States) 3 weeks ago Be among... ...clients with evaluating and achieving... ...About the Role As a Data... ...forefront of applying machine learning... ...with data scientists, engineers,... ...‑of‑the‑art LLM models and technologies...Remote workLocal areaFlexible hours
$176k - $207k
Senior Data Engineer, Data Foundations & AI Platform United States (Remote) We Breathe Life Into Data... ...largest, most complete view of the U.S.... ...Healthcare Map. This role transforms... ..., inference, and evaluation. What You Bring... ...productization. Applied AI / Agentic Workflows...Remote jobLocal areaFlexible hours$141k - $208k
...Engineer - Python and Data Ecosystem United States (Remote) About ClickHouse... ...experience. About the role As a Senior Software... ...Engineers and Data Scientists to harness... ...pipelines, backends for LLM‑powered agents, and... ...premium market range may apply, as listed. These salary...Remote jobLocal areaWorldwideHome officeFlexible hoursShift work$30 per hour
A leading data analysis company is seeking an Applied Quantitative Analyst to join its team. The role involves training AI models, evaluating their performance, and solving... ...-time or part-time remote work options,... ...applicants located in the United States will be considered....Remote jobHourly payFull timePart time- Driverai is seeking an Applied Data Scientist with expertise in LLM evaluation to join its innovative team in Austin, TX. This role focuses on building the evaluation function from scratch and requires a strong background in statistics and machine learning. The successful...Remote job
- ## Data Center Engineer - New AlbanyMaumee,Ohio,United StatesFind out how well you match with... ...The person in this role is responsible for... ..., break/fix, and remote hands services, utilizing... ...happens once you apply?** Click Here to... ...and city:** United States (US) || Ohio Remote...Remote workTemporary workWork at officeImmediate start
- ...leading AI solutions company in the United States is seeking an Applied Quantitative Analyst to join their team. You will be responsible for evaluating the performance of AI chatbots and enhancing... .... This position offers flexible remote work with hourly payments starting...Remote jobHourly payFlexible hours
- Remote - United States Reddit is a community of communities. It... ...the platform. This role requires deep... ..., tags, attributes, LLM-based user profile),... ...or strong intuition) applying LLMs or foundation models... ...models: you consider data, training, evaluation, serving, and...Remote jobImmediate startShift work
$200k - $300k
United States, Remote The salary range for this role is negotiable, the range being $200,000 - $300,000 per year. About... ..., best‑in‑class Principal Data Engineer. This role presents an exciting... ...robust, high‑quality datasets. Evaluate and integrate new technologies, guiding...Remote jobWork at office$173.1k - $303k
...development, including data curation, training, and evaluation. Our goal is... ...Business Units (BUs) within... ...do in this role: Confronted... ...creativity to apply existing... ...applied research scientists, product managers... ...developing LLM based... ...personas (flexible, remote, or required...Remote workWork experience placementWork at officeFlexible hours$112k - $269k
...Additionally, you will apply traditional... ...turning raw data into valuable... ...is fully remote and does not... ...particular state within the U.... ...LLMs, utilizing LLM APIs (OpenAI,... ...and evaluation. A Bachelor’... ...range for this role to be between... ...restricted stock units, and benefits...Remote jobWork experience placementLocal area$30 - $50 per hour
AI Research Jobs in the United States (Remote, Full-Time) You will run applied AI research projects for US-based customers... ...measurable experiments across LLM evaluation, RLHF data design, prompt evaluation, and... ...Remote, FULL_TIME role supporting United States-based...Remote jobHourly payFull time- ## Data EngineerApplylocations: Remote, United Statestime type: Full timeposted on... ...across the United States, Canada, and Europe... ...unique opportunity to apply your knowledge and... ...engineers, data scientists, and product managers... ...techniques, and model evaluation.* Experience with...Remote job
$40 per hour
...are looking for a Data Scientist to join our team... ...these AI chatbots, evaluate their logic, and... ...model. In this role you will need to... ...not limited to: Applied skills in Statistics... ...time or part-time REMOTE position You’ll... ...applicants in the United States will be considered...Remote jobHourly payFull timeContract workPart time$250k - $350k
...Applied ML Systems Engineer - Finance... ...NEW YORK - UNITED STATES Salary... ...yrs Remote Status - No Remote... ...tests it against real data to see if the theory... ...t a ticket-taking role. If you see a better... ...backlog; they get evaluated and deployed,...Remote workPermanent employmentFull timeWork experience placementInternshipImmediate startRelocationRelocation package- ...research with high-quality data, advanced training... ...; and second, by applying that expertise to help... ...Ideal Background This role is ideal for engineers... ...Typical Day Look Like? Evaluate and refine AI-generated... ...Candidates must be based in the United States #J-18808-Ljbffr...Remote workFor contractorsFlexible hours
$120.8k - $151k
### Data Engineer#### San Francisco, California, United StatesData Engineer**Why join us... ...Data at Brex**Our Scientists and Engineers work... ...also play a leading role in the design,... ...per year of fully remote work!**Responsibilities... ...the company.* Apply best practices in...Remote workWork at officeWork from home3 days per week$139.5k - $258.1k
Senior Applied Scientist - AI Evaluation & Quality Systems Seattle, Washington, United States Machine Learning and AI... ...powers the AI and LLM features behind... ...Human‑centered AI, Data Quality... ...services. In this role, you will develop... ...strong point of view on when not to use...RelocationShift work$105.7k - $149.28k
...The Senior Data Scientist of Responsible... ...) team. This role embeds directly... ...focuses on evaluation methodology,... ...Generative AI or LLM-based systems... ...to apply. All qualified... ...by applicable state or local law.... ...***For remote and hybrid positions... ...money news and views shaping how we...Remote work16 hoursContract workTemporary workPart timeWork experience placementCasual workWork at officeLocal areaWork from homeWork visaFlexible hours$135k - $180k
...Engineer - Orchard Full‑time role reporting to the... ...of Platform Product and Data. The position can be... ...and Wednesday) or fully remote in Austin or Denver. Responsibilities... .... Build and tune an LLM‑powered query layer on... ...prompts, retrieval, evaluation, and a feedback loop...Remote workFull timeShift work$30 per hour
We are looking for an Applied Quantitative Analyst to... ...of these AI chatbots, evaluate their logic, and solve... ...each model. In this role you will need to hold... ...Full-time or part-time remote position. You’ll be able... ...applicants in the United States will be considered for...Remote workHourly payFull timeContract workPart time$170k - $215k
...Senior Data Scientist Company: Norstella Location: Remote, United States Date Posted: Apr 22, 20... ...organization (Citeline, Evaluate, MMIT, Panalgo,... ...industry. The Role Design and... ...opportunities to apply AI/ML to our content... ...frameworks for LLM outputs to ensure...Remote workFull timeContract workTemporary workWork experience placementLocal areaFlexible hours$178.5k
...company and remote-first team of... ...Your Team and Role Working on the Data Science Functional... ...Senior Data Scientist, you'll help... ...fine-tuning, LLM-as-judge, and... ...based in the United States. This program... ...the team. By applying for this role... ...and evaluations of test projects...Remote workFull timeWork at officeLocal areaFlexible hours$94.9k - $135.6k
...platform analyzes data from... ...Location: This role is remote and can be based... ...anywhere within the United States. Candidates... ...document‑level LLM extraction,... ...agentic frameworks applied to EHR/EMR,... ...stored procedures, views, and functions... ..., prompting, evaluation, monitoring,...Remote workTemporary workLocal areaImmediate startFlexible hours$175k - $225k
...INOD) is a global data engineering company... ...the data, evaluation frameworks, and human... ...customers. Scope of the Role: Innodata is... ...capability to advance state-of-the-art... ...training methods for LLM and multimodal systems. As an Applied Research Scientist, LLM Evaluation &...Full timeFixed term contract- ...seeking a Senior Data Scientist to join our... ...experience in applied machine learning... .... This role will design, build... ...AI by building LLM-powered applications... ..., or LLM evaluation frameworks.... ...computer and work remotely and... ...to work in the United States without sponsorship...Remote workFull timeTemporary workImmediate startWork from homeHome office
$165.98k
...Join to apply for the Sr Data Scientist role at Ulta Beauty Join... ...other business units regarding emerging... ...Bolingbrook, IL. Can work remotely or telecommute... ...any applicable state and local laws,... .... Mountain View, CA $110,000.00-... ...– Quality & LLM Judging Systems...Remote workFull timePart timeLocal areaRelocationMonday to FridayShift work$175k - $210k
...Johns Hopkins Applied Physics... ...biology-aware data harmonization... ...decisions The Role Help turn... ...Data Scientist with a strong... ...distributed, remote teams to ensure... ...NLP and LLM solutions tailored... ..., evaluation, or retrieval... ...work in the United States Desired...Remote workTemporary work$190k - $225k
...counterparts love, and this role is a key part of that... ...this Role This Staff Data Product Developer will be... ...first 6-9 months. You will apply software engineering... ...quality assurance, and state-of-the-art methodologies... ...culture, we allow up to 2 remote days per week. Our benefits...Remote workApprenticeshipWork at officeLocal area2 days per week- ...TELUS Digital AI Data Solutions Ready to... ...innovative web-based evaluation tool. A Day in the... ...Analyst: In this role, you will be doing... ...and intent by applying market expertise in... ...a resident in the United States for the last Year... ...Representative” roles. Remote Call...Remote work16 hoursFull timePart timeSeasonal workWork at officeWork from home
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Applied Data Scientist, LLM Evaluation United States (Remote) View Role. Be the first to apply!
- principal data scientist Austin, TX
- entry level data scientist Austin, TX
- energy data scientist Austin, TX
- data scientist (hedge fund) Austin, TX
- work from home data scientist Austin, TX
- junior data scientist remote Austin, TX
- python data scientist (contract) Austin, TX
- healthcare data scientist Austin, TX
- python data scientist Austin, TX
- ai data scientist Austin, TX


