Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

AI Engineer, Agents & Evaluation

Guild.ai, Inc.

We’re looking for our first AI Engineer focused on agents and evaluation—a foundational hire who will shape how we build, measure, and scale intelligent systems. The Opportunity: Design the Playbook for High-Performance AI Agents We’re tackling one of the hardest—and most important—problems in software engineering: helping developers understand, evolve, and operate complex systems using autonomous and event‑driven AI. In this role, you’ll build the evaluation frameworks, task harnesses, and orchestration strategies that make our agents reliable, testable, and genuinely useful. Your work will not only directly improve our agents—it will create reusable benchmarks and artifacts that can inspire new approaches and push forward the broader foundation model ecosystem. If you love designing experiments, building systems, and iterating tightly between theory and code—and you’re excited by a very 0→1, research‑engineering style role—this is for you. What You Will Do Create Task Evaluations That Matter: Design and implement task‑specific evaluations that measure and improve agent quality. Each eval should both drive concrete iteration on our agents and spark broader innovation around the task itself. Define Tasks, Datasets, and Harnesses: Clearly specify tasks, collect and curate balanced datasets, and build robust evaluation harnesses that can be used across agents and modeling approaches. There is ample room for architectural design and systems thinking here. Build and Use a Reusable Evaluation Framework: Develop frameworks and tools for running evaluations at scale. Use these frameworks to tune existing agents and to guide the development of new ones in our environment. Explore Agent Orchestration Strategies: Investigate and implement orchestration patterns (tooling, routing, decomposition, multi‑agent setups, etc.) that allow agents to tackle increasingly complex, multi‑step, and long‑horizon tasks. Apply Post‑Training Techniques: Experiment with post‑training approaches (e.g., fine‑tuning, preference optimization, reward shaping, distillation) to produce high‑performance models tailored to specific tasks and workflows. Run Experiments End‑to‑End: Design, run, and analyze experiments with rigor. Turn experimental results into clear recommendations and concrete changes to model configurations, prompts, and system design. Collaborate Deeply Across the Stack: Work closely with founders, product, and infrastructure engineers to ensure evaluations, agents, and platform primitives all reinforce each other. What You Will Bring MS or Ph.D. in a relevant field (e.g., Computer Science, Machine Learning, NLP) or equivalent practical experience Strong background in machine learning and large language models, ideally including both research and hands‑on implementation 2–5 years working with LLM technology, with familiarity across: Prompting and interaction patterns Agent and tool orchestration strategies Evaluation strategies for complex, open‑ended tasks Proficiency writing production‑quality code, especially in Python; comfort working with TypeScript or modern web/backend stacks Experience designing and running experiments, and interpreting results in messy, real‑world settings Self‑motivated, comfortable operating in an unstructured, high‑ambiguity environment Strong communication skills and the ability to translate vague goals into concrete, testable setups Bonus Points Experience building agentic systems (tool‑using agents, workflows, or multi‑agent systems) in real products Prior work on model evaluation frameworks, benchmarking, or reliability/robustness testing Familiarity with modern ML tooling (training/inference stacks, experiment tracking, data pipelines) Contributions to open‑source LLM, tooling, or evaluation projects Experience at an early‑stage startup or research lab where you owned projects end‑to‑end Benefits & Perks Significant equity in an early‑stage, venture‑backed startup Comprehensive Health Benefits (Medical, Dental, Vision) Flexible PTO to ensure you have the time you need to recharge #J-18808-Ljbffr Guild.ai, Inc.

Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the AI Engineer, Agents & Evaluation in San Francisco, CA vacancy
  •  ...Canada is seeking a dedicated professional to ensure the accuracy and reliability of Veeva AI Agents. The ideal candidate will have a strong background in automated evaluation pipelines, proficiency in Python, and deep knowledge of LLM common failure modes. Responsibilities... 
    Suggested
    Work at office
    Flexible hours

    Veeva Systems

    San Francisco, CA
    4 days ago
  • Anysphere is seeking a Software Engineer for the Agent Quality team in San Francisco, CA. In this role...  ...design and build infrastructure to evaluate and improve ML agents. Responsibilities...  ...Ideal candidates will have experience in AI evaluations, data analysis, and solid software... 
    Suggested

    Anysphere

    San Francisco, CA
    1 day ago
  • $150k - $180k

     ...AI Evaluations Engineer – HealthcareLocation: Remote, located in the USType: Full-timeDepartment: EngineeringReports to: Director Of EngineeringResponsibilitiesBuild...  ...teams, including automated testing platform for AI voice agents, debugging and observability tools.Develop and... 
    Suggested
    Remote work
    Flexible hours

    Ellipsis Health

    San Francisco, CA
    4 days ago
  • $150k - $250k

     ...Distyl AI Job Posting Distyl is an applied AI technology company...  ..., we build AI systems using Evaluation-Driven Development —an...  ...production. AI Evaluation Engineers focus on designing and implementing...  ...results inform prompt design, agent logic, model selection, and release... 
    Suggested
    Work at office
    3 days per week

    Distyl AI

    San Francisco, CA
    1 day ago
  •  ...Join The Future Of AI At Tessera Labs Tessera Labs is redefining...  ...Intelligence. We build multi-agent AI systems that can automate...  ...looking for a top-quality AI engineer with a strong focus on AI agents...  ...and manage training and evaluation pipelines for LLMs. Deployment... 
    Suggested

    Tessera Labs

    San Francisco, CA
    2 days ago
  •  ...AI Prompt and Agent Engineer - San Francisco, CA Schedule: In office, five days per week Work Authorization: US citizens and Green Card...  ...model behavior, and enjoys the blend of experimentation, evaluation, and hands on engineering. You will work closely with product... 
    Work at office

    Connect Staffing

    San Francisco, CA
    6 days ago
  • $215k - $230k

     ...and has the power to change our trajectory.The AI Engineering Team is chartered with enabling next-generation...  ...production-ready. We're also deeply involved in evaluating and integrating cutting-edge tools in the LLM and agent space — including open-source stacks, vector... 
    Local area
    Remote work

    Crypto Pro Network

    San Francisco, CA
    4 days ago
  • $100k - $150k

     ...This is a job that Jill, our AI Recruiter, is recruiting for on behalf of one of...  ...speak to Jack. Job Title: Founding AI Engineer - Agent Runtime Salary: $100k-$150k +...  ...Build a robust event-driven runtime and evaluation framework to enforce commercial policies... 

    Jack and Jill AI

    San Francisco, CA
    4 days ago
  • $166.9k - $225.9k

     ...Senior AI Engineer Drata is advancing the frontier of compliance automation by integrating...  ...and deploy scalable LLM + retrieval + agent systems in production environments Optimize...  ...for latency, cost, reliability, and evaluation in real-world enterprise workloads... 
    Flexible hours

    Drata Inc

    San Francisco, CA
    1 day ago
  •  ...We're looking for a founding engineer focused on building production agents-someone who will push our platform to...  ...First Production Agents on a New AI Platform This isn't a typical...  ...Collaborate Closely with Product & Evaluation: Work with PMs and evaluation/ML... 
    Flexible hours

    Guild.ai, Inc

    San Francisco, CA
    3 days ago
  • $200k - $290k

     ...Our Client is building production-ready AI agents that handle complex, real-world customer interactions at scale. As a Full-Stack Engineer on the Agent Engineering team, you'll design...  ...-latency performance. Integrate and evaluate cutting-edge text and voice models... 

    Viridian Staffing

    San Francisco, CA
    1 day ago
  •  ...Meet Eloquent AI At Eloquent AI, we're building the next...  ...alongside world-class talent in AI, engineering, and product as we redefine...  .... Your Role As an AI Agent Engineer at Eloquent AI, you...  ...via user simulations and evaluations. Requirements ~3+ years... 

    Eloquent AI

    San Francisco, CA
    1 day ago
  • A pioneering AI technology firm based in San Francisco is seeking an AI Engineer to own the evaluation infrastructure for AI agents. This role requires designing automated pipelines and building observability systems, ensuring agent performance meets enterprise standards... 
    Remote job
    Flexible hours

    Fieldguide

    San Francisco, CA
    2 days ago
  • A cutting-edge AI firm in San Francisco is seeking a Research Engineer to develop evaluation systems and benchmarking pipelines for language models. Candidates should have a strong background in applied research, coding skills, and familiarity with ML models. You will work... 

    Mercor

    San Francisco, CA
    4 days ago
  • Ironclad, located in San Francisco, is seeking an AI Evaluation Engineer to join their team. This role involves analyzing datasets, designing feedback loops, and partnering closely with AI Engineers to improve model quality. Applicants should have 8+ years of experience... 
    Contract work

    Ironclad

    San Francisco, CA
    2 days ago
  • $240k - $280k

    A leading software monitoring company is seeking a Senior Software Engineer on its AI/ML team to build evaluation infrastructure for measuring the performance of AI systems. This role involves designing datasets, creating benchmarks, and ensuring AI features behave reliably... 

    Sentry

    San Francisco, CA
    1 day ago
  • Ironclad Inc. is seeking an AI Evaluation Engineer to enhance contract management through AI. Located in San Francisco, the role involves analyzing datasets, designing feedback loops, and ensuring continuous improvement of ML systems. Ideal candidates will have a quantitative... 
    Contract work
    Flexible hours

    Ironclad Inc.

    San Francisco, CA
    3 days ago
  •  ...Principal AI Agent Engineer Liberate builds AI agents to automate manual tasks for the $2.7T insurance industry. We started with voice...  ...operational excellence ~ Strong intuition for AI agent behavior, evaluation, and risk management ~ Ability to set standards and... 
    Work at office
    2 days per week

    Venturefizz Product Management Community

    San Francisco, CA
    2 days ago
  •  ...Principal AI Engineer (LLM Agents & Orchestration) Job Title: Principal AI Engineer (LLM Agents & Orchestration) Focus: Building Autonomous...  ...memory and context awareness across sessions Evaluation & Observability: Establish a rigorous testing framework for... 

    Vyro

    San Francisco, CA
    13 hours ago
  •  ...Zed Zed is building the first AI-native, licensed neobank in...  ...solve this. We are Stanford engineers and former YC founders who have...  ...behind our production AI agents. This is not a prompt engineering...  ...agent execution Build evaluation, monitoring, and debugging infrastructure... 

    Zed Financial PH, Inc

    San Francisco, CA
    3 days ago
  • B Capital seeks a talented individual for an AI Evaluation role in San Francisco. This position involves conducting critical comparative analysis, refining evaluation systems, and collaborating with various teams to enhance model capabilities. The ideal candidate will have... 

    B Capital

    San Francisco, CA
    1 day ago
  •  ...located in San Francisco is seeking an innovative Quality Engineer for their AI products. This role blends ops, strategy, and analytics to...  ...leading labs, and ensure user satisfaction through effective evaluation baselines. Competitive salary and benefits offered, with a... 

    Notion

    San Francisco, CA
    1 day ago
  • A fast-growing AI company seeks a Software Engineer to focus on Model Evaluation & Benchmarking. This role involves building evaluation systems for multimodal AI, ensuring reliable performance. The ideal candidate will possess strong Python programming skills, familiarity... 

    SpreeAI

    San Francisco, CA
    4 days ago
  •  ...Software Engineer, Agent Evaluation and Quality Engineering · Full-time · San Francisco; New York Our mission is to automate coding. The first...  ...You'll Work On Designing and building best-in-class AI evaluation system: curated datasets, offline replay, scorers... 
    Full time
    Work at office

    Anysphere

    San Francisco, CA
    1 day ago
  • $55k - $151.47k

     ...people in data and analytics engineering focus on leveraging advanced technologies...  ...As part of the People Tech & AI team you will develop, test, and validate Generative AI agents and data pipelines, promoting...  ...platform Executing LLM evaluation frameworks using defined... 
    Full time
    Work experience placement
    H1b
    Remote work

    PwC

    San Francisco, CA
    3 days ago
  • $230k - $385k

     ...About the Team The Codex Core Agent team builds the kernel of Codex. We own making...  ...the Role We're looking for applied AI engineers to help bring Codex agents from...  ...systems that get better real-task data into evaluation and research. Work with product teams... 

    OpenAI

    San Francisco, CA
    1 day ago
  • $100k - $150k

    Most AI systems generate text. We’re building one that makes decisions . Pareto Agent is a policy-driven runtime that executes high-stakes...  ...The Role As our Founding AI Engineer - Agent Runtime , you will...  ...model outputs are constrained, evaluated, and enforced by a... 
    Summer work
    Work at office
    Flexible hours

    Pareto Agent

    San Francisco, CA
    3 days ago
  •  ..., actionable data. While most AI companies focus on digital industries...  .... The Role As a Founding AI Engineer, you will be the architect of...  ...from fragmented data sources Evaluate & Iterate: Build internal...  ...timeouts, "infinite loops" in agents) and how to build guardrails against... 

    Skiffra

    San Francisco, CA
    2 days ago
  • Guild.ai, Inc. in San Francisco is seeking an AI Engineer focused on developing and evaluating high-performance AI agents. This role revolves around creating evaluation frameworks, task harnesses, and orchestration strategies to enhance agent capabilities. Candidates should... 
    Flexible hours

    Guild.ai, Inc.

    San Francisco, CA
    13 hours ago
  •  ...into clear, actionable data. While most AI companies focus on digital industries,...  ...real. We’re looking for a Founding AI Engineer (LLMs & Agents) to architect and build the core...  ...embeddings, and retrieval performance Evaluate & Iterate Develop internal benchmarking... 

    Rethink recruit

    San Francisco, CA
    13 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Engineer, Agents & Evaluation. Be the first to apply!