AI Engineer, Agents & Evaluation

Guild.ai, Inc.

We’re looking for our first AI Engineer focused on agents and evaluation—a foundational hire who will shape how we build, measure, and scale intelligent systems. The Opportunity: Design the Playbook for High-Performance AI Agents We’re tackling one of the hardest—and most important—problems in software engineering: helping developers understand, evolve, and operate complex systems using autonomous and event‑driven AI. In this role, you’ll build the evaluation frameworks, task harnesses, and orchestration strategies that make our agents reliable, testable, and genuinely useful. Your work will not only directly improve our agents—it will create reusable benchmarks and artifacts that can inspire new approaches and push forward the broader foundation model ecosystem. If you love designing experiments, building systems, and iterating tightly between theory and code—and you’re excited by a very 0→1, research‑engineering style role—this is for you. What You Will Do Create Task Evaluations That Matter: Design and implement task‑specific evaluations that measure and improve agent quality. Each eval should both drive concrete iteration on our agents and spark broader innovation around the task itself. Define Tasks, Datasets, and Harnesses: Clearly specify tasks, collect and curate balanced datasets, and build robust evaluation harnesses that can be used across agents and modeling approaches. There is ample room for architectural design and systems thinking here. Build and Use a Reusable Evaluation Framework: Develop frameworks and tools for running evaluations at scale. Use these frameworks to tune existing agents and to guide the development of new ones in our environment. Explore Agent Orchestration Strategies: Investigate and implement orchestration patterns (tooling, routing, decomposition, multi‑agent setups, etc.) that allow agents to tackle increasingly complex, multi‑step, and long‑horizon tasks. Apply Post‑Training Techniques: Experiment with post‑training approaches (e.g., fine‑tuning, preference optimization, reward shaping, distillation) to produce high‑performance models tailored to specific tasks and workflows. Run Experiments End‑to‑End: Design, run, and analyze experiments with rigor. Turn experimental results into clear recommendations and concrete changes to model configurations, prompts, and system design. Collaborate Deeply Across the Stack: Work closely with founders, product, and infrastructure engineers to ensure evaluations, agents, and platform primitives all reinforce each other. What You Will Bring MS or Ph.D. in a relevant field (e.g., Computer Science, Machine Learning, NLP) or equivalent practical experience Strong background in machine learning and large language models, ideally including both research and hands‑on implementation 2–5 years working with LLM technology, with familiarity across: Prompting and interaction patterns Agent and tool orchestration strategies Evaluation strategies for complex, open‑ended tasks Proficiency writing production‑quality code, especially in Python; comfort working with TypeScript or modern web/backend stacks Experience designing and running experiments, and interpreting results in messy, real‑world settings Self‑motivated, comfortable operating in an unstructured, high‑ambiguity environment Strong communication skills and the ability to translate vague goals into concrete, testable setups Bonus Points Experience building agentic systems (tool‑using agents, workflows, or multi‑agent systems) in real products Prior work on model evaluation frameworks, benchmarking, or reliability/robustness testing Familiarity with modern ML tooling (training/inference stacks, experiment tracking, data pipelines) Contributions to open‑source LLM, tooling, or evaluation projects Experience at an early‑stage startup or research lab where you owned projects end‑to‑end Benefits & Perks Significant equity in an early‑stage, venture‑backed startup Comprehensive Health Benefits (Medical, Dental, Vision) Flexible PTO to ensure you have the time you need to recharge #J-18808-Ljbffr Guild.ai, Inc.

Apply

Vacancy posted 5 days ago

Similar jobs that could be interesting for youBased on the AI Engineer, Agents & Evaluation in San Francisco, CA vacancy

Senior AI Data Engineer: AI Agent Evaluation
...Canada is seeking a dedicated professional to ensure the accuracy and reliability of Veeva AI Agents. The ideal candidate will have a strong background in automated evaluation pipelines, proficiency in Python, and deep knowledge of LLM common failure modes. Responsibilities...
Suggested
Work at office
Flexible hours
Veeva Systems
San Francisco, CA
4 days ago
AI Quality Engineer: Agent Evaluation & Metrics
Anysphere is seeking a Software Engineer for the Agent Quality team in San Francisco, CA. In this role... ...design and build infrastructure to evaluate and improve ML agents. Responsibilities... ...Ideal candidates will have experience in AI evaluations, data analysis, and solid software...
Suggested
Anysphere
San Francisco, CA
1 day ago
AI Evaluations Engineer - Healthcare
$150k - $180k
...AI Evaluations Engineer – HealthcareLocation: Remote, located in the USType: Full-timeDepartment: EngineeringReports to: Director Of EngineeringResponsibilitiesBuild... ...teams, including automated testing platform for AI voice agents, debugging and observability tools.Develop and...
Suggested
Remote work
Flexible hours
Ellipsis Health
San Francisco, CA
4 days ago
AI Engineer, Evaluation
$150k - $250k
...Distyl AI Job Posting Distyl is an applied AI technology company... ..., we build AI systems using Evaluation-Driven Development —an... ...production. AI Evaluation Engineers focus on designing and implementing... ...results inform prompt design, agent logic, model selection, and release...
Suggested
Work at office
3 days per week
Distyl AI
San Francisco, CA
1 day ago
AI Agent Engineer
...Join The Future Of AI At Tessera Labs Tessera Labs is redefining... ...Intelligence. We build multi-agent AI systems that can automate... ...looking for a top-quality AI engineer with a strong focus on AI agents... ...and manage training and evaluation pipelines for LLMs. Deployment...
Suggested
Tessera Labs
San Francisco, CA
2 days ago
AI Prompt and Agent Engineer - San Francisco, CA
...AI Prompt and Agent Engineer - San Francisco, CA Schedule: In office, five days per week Work Authorization: US citizens and Green Card... ...model behavior, and enjoys the blend of experimentation, evaluation, and hands on engineering. You will work closely with product...
Work at office
Connect Staffing
San Francisco, CA
6 days ago
AI Agent Engineer
$215k - $230k
...and has the power to change our trajectory.The AI Engineering Team is chartered with enabling next-generation... ...production-ready. We're also deeply involved in evaluating and integrating cutting-edge tools in the LLM and agent space — including open-source stacks, vector...
Local area
Remote work
Crypto Pro Network
San Francisco, CA
4 days ago
Founding AI Engineer - Agent Runtime ($100k-$150k + Equity) at Pareto Agent
$100k - $150k
...This is a job that Jill, our AI Recruiter, is recruiting for on behalf of one of... ...speak to Jack. Job Title: Founding AI Engineer - Agent Runtime Salary: $100k-$150k +... ...Build a robust event-driven runtime and evaluation framework to enforce commercial policies...
Jack and Jill AI
San Francisco, CA
4 days ago
Senior AI Engineer, Agent Harness
$166.9k - $225.9k
...Senior AI Engineer Drata is advancing the frontier of compliance automation by integrating... ...and deploy scalable LLM + retrieval + agent systems in production environments Optimize... ...for latency, cost, reliability, and evaluation in real-world enterprise workloads...
Flexible hours
Drata Inc
San Francisco, CA
1 day ago
AI Engineer, Production Agents
...We're looking for a founding engineer focused on building production agents-someone who will push our platform to... ...First Production Agents on a New AI Platform This isn't a typical... ...Collaborate Closely with Product & Evaluation: Work with PMs and evaluation/ML...
Flexible hours
Guild.ai, Inc
San Francisco, CA
3 days ago
AI Agent Software Engineer
$200k - $290k
...Our Client is building production-ready AI agents that handle complex, real-world customer interactions at scale. As a Full-Stack Engineer on the Agent Engineering team, you'll design... ...-latency performance. Integrate and evaluate cutting-edge text and voice models...
Viridian Staffing
San Francisco, CA
1 day ago
AI Engineer, Agent
...Meet Eloquent AI At Eloquent AI, we're building the next... ...alongside world-class talent in AI, engineering, and product as we redefine... .... Your Role As an AI Agent Engineer at Eloquent AI, you... ...via user simulations and evaluations. Requirements ~3+ years...
Eloquent AI
San Francisco, CA
1 day ago
Remote AI Engineer, Quality & Evaluation at Enterprise Scale
A pioneering AI technology firm based in San Francisco is seeking an AI Engineer to own the evaluation infrastructure for AI agents. This role requires designing automated pipelines and building observability systems, ensuring agent performance meets enterprise standards...
Remote job
Flexible hours
Fieldguide
San Francisco, CA
2 days ago
AI Benchmarking Engineer — Evaluations & Failure Analysis
A cutting-edge AI firm in San Francisco is seeking a Research Engineer to develop evaluation systems and benchmarking pipelines for language models. Candidates should have a strong background in applied research, coding skills, and familiarity with ML models. You will work...
Mercor
San Francisco, CA
4 days ago
AI Evaluation Engineer — Data-Driven Contract Intelligence
Ironclad, located in San Francisco, is seeking an AI Evaluation Engineer to join their team. This role involves analyzing datasets, designing feedback loops, and partnering closely with AI Engineers to improve model quality. Applicants should have 8+ years of experience...
Contract work
Ironclad
San Francisco, CA
2 days ago
Senior AI Evaluation Engineer — Metrics & Data Pipelines
$240k - $280k
A leading software monitoring company is seeking a Senior Software Engineer on its AI/ML team to build evaluation infrastructure for measuring the performance of AI systems. This role involves designing datasets, creating benchmarks, and ensuring AI features behave reliably...
Sentry
San Francisco, CA
1 day ago
AI Evaluation Engineer: Data‑Driven NLP for Contracts
Ironclad Inc. is seeking an AI Evaluation Engineer to enhance contract management through AI. Located in San Francisco, the role involves analyzing datasets, designing feedback loops, and ensuring continuous improvement of ML systems. Ideal candidates will have a quantitative...
Contract work
Flexible hours
Ironclad Inc.
San Francisco, CA
3 days ago
Principal AI Agent Engineer
...Principal AI Agent Engineer Liberate builds AI agents to automate manual tasks for the $2.7T insurance industry. We started with voice... ...operational excellence ~ Strong intuition for AI agent behavior, evaluation, and risk management ~ Ability to set standards and...
Work at office
2 days per week
Venturefizz Product Management Community
San Francisco, CA
2 days ago
Principal AI Engineer (LLM Agents & Orchestration)
...Principal AI Engineer (LLM Agents & Orchestration) Job Title: Principal AI Engineer (LLM Agents & Orchestration) Focus: Building Autonomous... ...memory and context awareness across sessions Evaluation & Observability: Establish a rigorous testing framework for...
Vyro
San Francisco, CA
13 hours ago
AI Engineer, Agent Infrastructure
...Zed Zed is building the first AI-native, licensed neobank in... ...solve this. We are Stanford engineers and former YC founders who have... ...behind our production AI agents. This is not a prompt engineering... ...agent execution Build evaluation, monitoring, and debugging infrastructure...
Zed Financial PH, Inc
San Francisco, CA
3 days ago
Staff AI Evaluations Engineer — Open Foundation Models
B Capital seeks a talented individual for an AI Evaluation role in San Francisco. This position involves conducting critical comparative analysis, refining evaluation systems, and collaborating with various teams to enhance model capabilities. The ideal candidate will have...
B Capital
San Francisco, CA
1 day ago
AI Model Behavior Engineer—Quality & Evaluation
...located in San Francisco is seeking an innovative Quality Engineer for their AI products. This role blends ops, strategy, and analytics to... ...leading labs, and ensure user satisfaction through effective evaluation baselines. Competitive salary and benefits offered, with a...
Notion
San Francisco, CA
1 day ago
AI Model Evaluation Engineer — Benchmarking & Validation
A fast-growing AI company seeks a Software Engineer to focus on Model Evaluation & Benchmarking. This role involves building evaluation systems for multimodal AI, ensuring reliable performance. The ideal candidate will possess strong Python programming skills, familiarity...
SpreeAI
San Francisco, CA
4 days ago
Software Engineer, Agent Evaluation and Quality
...Software Engineer, Agent Evaluation and Quality Engineering · Full-time · San Francisco; New York Our mission is to automate coding. The first... ...You'll Work On Designing and building best-in-class AI evaluation system: curated datasets, offline replay, scorers...
Full time
Work at office
Anysphere
San Francisco, CA
1 day ago
US Tech - AI Evaluation Engineer (QA) Senior Associate
$55k - $151.47k
...people in data and analytics engineering focus on leveraging advanced technologies... ...As part of the People Tech & AI team you will develop, test, and validate Generative AI agents and data pipelines, promoting... ...platform Executing LLM evaluation frameworks using defined...
Full time
Work experience placement
H1b
Remote work
PwC
San Francisco, CA
3 days ago
Applied AI Engineer, Codex Core Agent
$230k - $385k
...About the Team The Codex Core Agent team builds the kernel of Codex. We own making... ...the Role We're looking for applied AI engineers to help bring Codex agents from... ...systems that get better real-task data into evaluation and research. Work with product teams...
OpenAI
San Francisco, CA
1 day ago
Founding AI Engineer — Agent Runtime
$100k - $150k
Most AI systems generate text. We’re building one that makes decisions . Pareto Agent is a policy-driven runtime that executes high-stakes... ...The Role As our Founding AI Engineer - Agent Runtime , you will... ...model outputs are constrained, evaluated, and enforced by a...
Summer work
Work at office
Flexible hours
Pareto Agent
San Francisco, CA
3 days ago
Founding AI Engineer (LLMs & Agents)
..., actionable data. While most AI companies focus on digital industries... .... The Role As a Founding AI Engineer, you will be the architect of... ...from fragmented data sources Evaluate & Iterate: Build internal... ...timeouts, "infinite loops" in agents) and how to build guardrails against...
Skiffra
San Francisco, CA
2 days ago
AI Engineer: Build & Benchmark High-Performance Agents
Guild.ai, Inc. in San Francisco is seeking an AI Engineer focused on developing and evaluating high-performance AI agents. This role revolves around creating evaluation frameworks, task harnesses, and orchestration strategies to enhance agent capabilities. Candidates should...
Flexible hours
Guild.ai, Inc.
San Francisco, CA
13 hours ago
Founding AI Engineer (LLMs & Agents)
...into clear, actionable data. While most AI companies focus on digital industries,... ...real. We’re looking for a Founding AI Engineer (LLMs & Agents) to architect and build the core... ...embeddings, and retrieval performance Evaluate & Iterate Develop internal benchmarking...
Rethink recruit
San Francisco, CA
13 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Engineer, Agents & Evaluation. Be the first to apply!