AI Engineer, Agents & Evaluation
Guild.ai, Inc.
We’re looking for our first AI Engineer focused on agents and evaluation—a foundational hire who will shape how we build, measure, and scale intelligent systems. The Opportunity: Design the Playbook for High-Performance AI Agents We’re tackling one of the hardest—and most important—problems in software engineering: helping developers understand, evolve, and operate complex systems using autonomous and event‑driven AI. In this role, you’ll build the evaluation frameworks, task harnesses, and orchestration strategies that make our agents reliable, testable, and genuinely useful. Your work will not only directly improve our agents—it will create reusable benchmarks and artifacts that can inspire new approaches and push forward the broader foundation model ecosystem. If you love designing experiments, building systems, and iterating tightly between theory and code—and you’re excited by a very 0→1, research‑engineering style role—this is for you. What You Will Do Create Task Evaluations That Matter: Design and implement task‑specific evaluations that measure and improve agent quality. Each eval should both drive concrete iteration on our agents and spark broader innovation around the task itself. Define Tasks, Datasets, and Harnesses: Clearly specify tasks, collect and curate balanced datasets, and build robust evaluation harnesses that can be used across agents and modeling approaches. There is ample room for architectural design and systems thinking here. Build and Use a Reusable Evaluation Framework: Develop frameworks and tools for running evaluations at scale. Use these frameworks to tune existing agents and to guide the development of new ones in our environment. Explore Agent Orchestration Strategies: Investigate and implement orchestration patterns (tooling, routing, decomposition, multi‑agent setups, etc.) that allow agents to tackle increasingly complex, multi‑step, and long‑horizon tasks. Apply Post‑Training Techniques: Experiment with post‑training approaches (e.g., fine‑tuning, preference optimization, reward shaping, distillation) to produce high‑performance models tailored to specific tasks and workflows. Run Experiments End‑to‑End: Design, run, and analyze experiments with rigor. Turn experimental results into clear recommendations and concrete changes to model configurations, prompts, and system design. Collaborate Deeply Across the Stack: Work closely with founders, product, and infrastructure engineers to ensure evaluations, agents, and platform primitives all reinforce each other. What You Will Bring MS or Ph.D. in a relevant field (e.g., Computer Science, Machine Learning, NLP) or equivalent practical experience Strong background in machine learning and large language models, ideally including both research and hands‑on implementation 2–5 years working with LLM technology, with familiarity across: Prompting and interaction patterns Agent and tool orchestration strategies Evaluation strategies for complex, open‑ended tasks Proficiency writing production‑quality code, especially in Python; comfort working with TypeScript or modern web/backend stacks Experience designing and running experiments, and interpreting results in messy, real‑world settings Self‑motivated, comfortable operating in an unstructured, high‑ambiguity environment Strong communication skills and the ability to translate vague goals into concrete, testable setups Bonus Points Experience building agentic systems (tool‑using agents, workflows, or multi‑agent systems) in real products Prior work on model evaluation frameworks, benchmarking, or reliability/robustness testing Familiarity with modern ML tooling (training/inference stacks, experiment tracking, data pipelines) Contributions to open‑source LLM, tooling, or evaluation projects Experience at an early‑stage startup or research lab where you owned projects end‑to‑end Benefits & Perks Significant equity in an early‑stage, venture‑backed startup Comprehensive Health Benefits (Medical, Dental, Vision) Flexible PTO to ensure you have the time you need to recharge #J-18808-Ljbffr Guild.ai, Inc.
- ...Canada is seeking a dedicated professional to ensure the accuracy and reliability of Veeva AI Agents. The ideal candidate will have a strong background in automated evaluation pipelines, proficiency in Python, and deep knowledge of LLM common failure modes. Responsibilities...SuggestedWork at officeFlexible hours
- Anysphere is seeking a Software Engineer for the Agent Quality team in San Francisco, CA. In this role... ...design and build infrastructure to evaluate and improve ML agents. Responsibilities... ...Ideal candidates will have experience in AI evaluations, data analysis, and solid software...Suggested
$150k - $180k
...AI Evaluations Engineer – HealthcareLocation: Remote, located in the USType: Full-timeDepartment: EngineeringReports to: Director Of EngineeringResponsibilitiesBuild... ...teams, including automated testing platform for AI voice agents, debugging and observability tools.Develop and...SuggestedRemote workFlexible hours$150k - $250k
...Distyl AI Job Posting Distyl is an applied AI technology company... ..., we build AI systems using Evaluation-Driven Development —an... ...production. AI Evaluation Engineers focus on designing and implementing... ...results inform prompt design, agent logic, model selection, and release...SuggestedWork at office3 days per week- ...Join The Future Of AI At Tessera Labs Tessera Labs is redefining... ...Intelligence. We build multi-agent AI systems that can automate... ...looking for a top-quality AI engineer with a strong focus on AI agents... ...and manage training and evaluation pipelines for LLMs. Deployment...Suggested
- ...AI Prompt and Agent Engineer - San Francisco, CA Schedule: In office, five days per week Work Authorization: US citizens and Green Card... ...model behavior, and enjoys the blend of experimentation, evaluation, and hands on engineering. You will work closely with product...Work at office
$215k - $230k
...and has the power to change our trajectory.The AI Engineering Team is chartered with enabling next-generation... ...production-ready. We're also deeply involved in evaluating and integrating cutting-edge tools in the LLM and agent space — including open-source stacks, vector...Local areaRemote work$100k - $150k
...This is a job that Jill, our AI Recruiter, is recruiting for on behalf of one of... ...speak to Jack. Job Title: Founding AI Engineer - Agent Runtime Salary: $100k-$150k +... ...Build a robust event-driven runtime and evaluation framework to enforce commercial policies...$166.9k - $225.9k
...Senior AI Engineer Drata is advancing the frontier of compliance automation by integrating... ...and deploy scalable LLM + retrieval + agent systems in production environments Optimize... ...for latency, cost, reliability, and evaluation in real-world enterprise workloads...Flexible hours- ...We're looking for a founding engineer focused on building production agents-someone who will push our platform to... ...First Production Agents on a New AI Platform This isn't a typical... ...Collaborate Closely with Product & Evaluation: Work with PMs and evaluation/ML...Flexible hours
$200k - $290k
...Our Client is building production-ready AI agents that handle complex, real-world customer interactions at scale. As a Full-Stack Engineer on the Agent Engineering team, you'll design... ...-latency performance. Integrate and evaluate cutting-edge text and voice models...- ...Meet Eloquent AI At Eloquent AI, we're building the next... ...alongside world-class talent in AI, engineering, and product as we redefine... .... Your Role As an AI Agent Engineer at Eloquent AI, you... ...via user simulations and evaluations. Requirements ~3+ years...
- A pioneering AI technology firm based in San Francisco is seeking an AI Engineer to own the evaluation infrastructure for AI agents. This role requires designing automated pipelines and building observability systems, ensuring agent performance meets enterprise standards...Remote jobFlexible hours
- A cutting-edge AI firm in San Francisco is seeking a Research Engineer to develop evaluation systems and benchmarking pipelines for language models. Candidates should have a strong background in applied research, coding skills, and familiarity with ML models. You will work...
- Ironclad, located in San Francisco, is seeking an AI Evaluation Engineer to join their team. This role involves analyzing datasets, designing feedback loops, and partnering closely with AI Engineers to improve model quality. Applicants should have 8+ years of experience...Contract work
$240k - $280k
A leading software monitoring company is seeking a Senior Software Engineer on its AI/ML team to build evaluation infrastructure for measuring the performance of AI systems. This role involves designing datasets, creating benchmarks, and ensuring AI features behave reliably...- Ironclad Inc. is seeking an AI Evaluation Engineer to enhance contract management through AI. Located in San Francisco, the role involves analyzing datasets, designing feedback loops, and ensuring continuous improvement of ML systems. Ideal candidates will have a quantitative...Contract workFlexible hours
- ...Principal AI Agent Engineer Liberate builds AI agents to automate manual tasks for the $2.7T insurance industry. We started with voice... ...operational excellence ~ Strong intuition for AI agent behavior, evaluation, and risk management ~ Ability to set standards and...Work at office2 days per week
- ...Principal AI Engineer (LLM Agents & Orchestration) Job Title: Principal AI Engineer (LLM Agents & Orchestration) Focus: Building Autonomous... ...memory and context awareness across sessions Evaluation & Observability: Establish a rigorous testing framework for...
- ...Zed Zed is building the first AI-native, licensed neobank in... ...solve this. We are Stanford engineers and former YC founders who have... ...behind our production AI agents. This is not a prompt engineering... ...agent execution Build evaluation, monitoring, and debugging infrastructure...
- B Capital seeks a talented individual for an AI Evaluation role in San Francisco. This position involves conducting critical comparative analysis, refining evaluation systems, and collaborating with various teams to enhance model capabilities. The ideal candidate will have...
- ...located in San Francisco is seeking an innovative Quality Engineer for their AI products. This role blends ops, strategy, and analytics to... ...leading labs, and ensure user satisfaction through effective evaluation baselines. Competitive salary and benefits offered, with a...
- A fast-growing AI company seeks a Software Engineer to focus on Model Evaluation & Benchmarking. This role involves building evaluation systems for multimodal AI, ensuring reliable performance. The ideal candidate will possess strong Python programming skills, familiarity...
- ...Software Engineer, Agent Evaluation and Quality Engineering · Full-time · San Francisco; New York Our mission is to automate coding. The first... ...You'll Work On Designing and building best-in-class AI evaluation system: curated datasets, offline replay, scorers...Full timeWork at office
$55k - $151.47k
...people in data and analytics engineering focus on leveraging advanced technologies... ...As part of the People Tech & AI team you will develop, test, and validate Generative AI agents and data pipelines, promoting... ...platform Executing LLM evaluation frameworks using defined...Full timeWork experience placementH1bRemote work$230k - $385k
...About the Team The Codex Core Agent team builds the kernel of Codex. We own making... ...the Role We're looking for applied AI engineers to help bring Codex agents from... ...systems that get better real-task data into evaluation and research. Work with product teams...$100k - $150k
Most AI systems generate text. We’re building one that makes decisions . Pareto Agent is a policy-driven runtime that executes high-stakes... ...The Role As our Founding AI Engineer - Agent Runtime , you will... ...model outputs are constrained, evaluated, and enforced by a...Summer workWork at officeFlexible hours- ..., actionable data. While most AI companies focus on digital industries... .... The Role As a Founding AI Engineer, you will be the architect of... ...from fragmented data sources Evaluate & Iterate: Build internal... ...timeouts, "infinite loops" in agents) and how to build guardrails against...
- Guild.ai, Inc. in San Francisco is seeking an AI Engineer focused on developing and evaluating high-performance AI agents. This role revolves around creating evaluation frameworks, task harnesses, and orchestration strategies to enhance agent capabilities. Candidates should...Flexible hours
- ...into clear, actionable data. While most AI companies focus on digital industries,... ...real. We’re looking for a Founding AI Engineer (LLMs & Agents) to architect and build the core... ...embeddings, and retrieval performance Evaluate & Iterate Develop internal benchmarking...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to AI Engineer, Agents & Evaluation. Be the first to apply!
- machine learning ai engineer San Francisco, CA
- senior ai engineer San Francisco, CA
- ai engineer remote San Francisco, CA
- ai ml engineer San Francisco, CA
- ai engineer San Francisco, CA
- ai developer San Francisco, CA
- ai research engineer San Francisco, CA
- ai prompt engineer San Francisco, CA
- special agent San Francisco, CA
- transfer agent San Francisco, CA

