AI Model Evaluation Program Lead
$300k - $320kAnthropic
About the role: We are seeking a Technical Program Manager to lead our AI model evaluation initiatives across multiple workstreams. This role will be crucial in assessing the performance, capabilities, limitations, and potential risks of our AI models. Working closely with our Research, Trust & Safety, Frontier Redteaming, and Policy teams, you will drive high-priority evaluation projects to build new processes, align metrics with policy, and track measurable progress. You will help build and adapt the model evaluation program to ensure model deployments are rigorous and aligned with our commitment to responsible AI development. The ideal candidate will have a strong technical background and experience managing cross-functional programs in AI development, ML engineering, or related fields. You’ll be joining a team of Technical Program Managers who own and drive cross-functional programs that align to the company’s top priorities. In this role, you’ll have the opportunity to make a foundational impact as you contribute the scaling of a centralized TPM function for the company. Extremely strong soft skills are paramount, as our team is front and center in driving lots of company-wide changes and top priority initiatives that require generating buy-in, balancing various opinions, and competing for attention in our rapidly scaling environment. This role is a great fit for someone who has both seen excellence at scale and operated in rapidly scaling, high-ambiguity teams and scope. We are seeking candidates with deep TPM expertise but who are comfortable acting as adaptable generalists who add value fast. We excel at maintaining a broad view of our work but diving deep into the details when necessary. We understand business goals, translate and organize them into technical programs and projects, and drive execution. We are adept at engaging with both non-technical and technical stakeholders at all levels of the company, including executive leadership. In this role, you will have the opportunity to shape the development of advanced AI systems and contribute to Anthropic's mission of ensuring that AI benefits all of humanity. If you are passionate about responsible AI development, have a strong technical background, and thrive in a fast-paced, collaborative environment, we'd love to hear from you. Responsibilities: Partner with teams like Frontier Risk Evaluations, Security, and Trust & Safety to develop and implement comprehensive evaluation protocols for our latest frontier AI models Build a single source of truth for tracking all types of model evaluations as required by our Responsible Scaling Policy, AI safety institutes, the White House, and others Develop and maintain procedures for conducting evaluations, including designing test suites, coordinating red team exercises, and analyzing results Create and manage dashboards and reporting systems to track model performance, safety metrics, and evaluation outcomes across different AI systems and versions Lead cross-functional workshops to identify potential risks and edge cases for evaluation, ensuring thorough coverage of AI capabilities and limitations Coordinate with external partners and industry standards bodies to align our evaluation practices with emerging best practices in responsible AI development Provide detailed status reports, identifying technical risks, dependencies, and areas requiring additional support Facilitate communication and coordination between technical workstreams and stakeholders Continuously identify opportunities for technical process improvements and implement changes as needed Stay up-to-date with the latest developments in AI safety, ML engineering, and related fields to ensure the program remains at the forefront of responsible AI development You might be a good fit if you: Have several years of experience in technical program management, with a track record of successfully delivering complex technical programs, preferably in AI development, ML engineering, or related fields Have experience executing technical programs that require systems and engineering-level knowledge. Have exceptionally strong interpersonal and communication skills that enable you to influence without authority, build cross-organizational support, cooperation and action around initiatives and process adoption. Have experience prompt engineering on language models Have experience designing and/or running evaluations on Large Language Models Have knowledge of emerging AI governance frameworks and best practices Have a high threshold for navigating ambiguity and are able to balance setting strategic priorities with rapid, high-quality execution. Thrive in unstructured environments, and have a knack for bringing order to chaos. The expected salary range for this position is: Annual Salary:
$300,000—$320,000 USD
Logistics Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. US visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate; operations roles are especially difficult to support. But if we make you an offer, we will make every effort to get you into the United States, and we retain an immigration lawyer to help with this. We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team. Compensation and Benefits* Anthropic’s compensation package consists of three elements: salary, equity, and benefits. We are committed to pay fairness and aim for these three elements collectively to be highly competitive with market rates. Equity - For eligible roles, equity will be a major component of the total compensation. We aim to offer higher-than-average equity compensation for a company of our size, and communicate equity amounts at the time of offer issuance. US Benefits - The following benefits are for our US-based employees: Optional equity donation matching. Comprehensive health, dental, and vision insurance for you and all your dependents. 401(k) plan with 4% matching. 22 weeks of paid parental leave. Unlimited PTO – most staff take between 4-6 weeks each year, sometimes more! Stipends for education, home office improvements, commuting, and wellness. Fertility benefits via Carrot. Daily lunches and snacks in our office. Relocation support for those moving to the Bay Area. UK Benefits - The following benefits are for our UK-based employees: Optional equity donation matching. Private health, dental, and vision insurance for you and your dependents. Pension contribution (matching 4% of your salary). 21 weeks of paid parental leave. Unlimited PTO – most staff take between 4-6 weeks each year, sometimes more! Health cash plan. Life insurance and income protection. Daily lunches and snacks in our office. #J-18808-Ljbffr Anthropic- ...Francisco is seeking a dedicated member for our ML Data Team to lead video data preparation and evaluation. This role includes defining dataset needs, automating... ...candidates should have over 5 years of experience in AI data operations, proficiency in Python, and strong...SuggestedFlexible hours
$207k - $285k
OpenAI is seeking a Technical Program Manager in San Francisco to lead initiatives that ensure the safety and robustness of its AI models. The role involves collaborating with diverse teams to turn risks into actionable plans. Ideal candidates will have experience in technical...Suggested- Gravity Engineering Services Pvt Ltd. is looking for a Technical Program Manager for Research to define and build programs essential for research teams at the cutting edge of AI development. This role requires engagement across complex and ambiguous research initiatives...Suggested
$208k - $300k
A leading AI company is seeking a Machine Learning Engineer in the Public Sector to develop automated evaluation pipelines for AI models. You will work on advanced AI systems and ensure they perform... ...candidates have a strong programming background and experience in ML...Suggested$208k - $300k
Machine Learning Engineer - Model Evaluations, Public Sector San Francisco... ...team shaping the future of AI at Scale. Machine Learning... ...production settings. Strong programming skills in Python; experience... ...technologies that power the world’s leading models, and help enterprises...SuggestedFull time- ...About the Role We are hiring Engineers focused on AI Model Evaluation to build the systems that ensure multimodal AI behaves reliably... ...combination of education and practical experience. Strong programming skills in Python. Familiarity with object-oriented...
- Anthropic is seeking a Research Lead for the Training Insights team to shape the evaluation of model capabilities. This hands-on leadership role involves developing innovative... ...You will play a crucial role in transforming how AI capabilities are assessed, working...Remote work
- ...San Francisco, CA. You will work directly with partners on AI infrastructure and model launches, serving as their primary contact and ensuring... ...leadership across commercial negotiations, marketing, and program management to drive partner satisfaction and achieve revenue...
$220k - $270k
...professional to manage partner relationships and drive successful AI model launches in San Francisco. The ideal candidate will possess... ...partnerships. Responsibilities include contract negotiations, program management, and collaborating on marketing initiatives to foster...Contract work$207k - $285k
...and mitigating risks in advanced AI systems by designing evaluations, surfacing vulnerabilities, and collaborating... ...with researchers to strengthen model reliability and public trust. About the Role As a Technical Program Manager, you will lead initiatives that test the safety...Work at officeRelocation package- A fast-growing AI company seeks a Software Engineer to focus on Model Evaluation & Benchmarking. This role involves building evaluation systems for multimodal AI,... ...The ideal candidate will possess strong Python programming skills, familiarity with machine learning workflows...
$146.2k - $261.4k
...Research Lead - AI Cyber Testing & Evaluation RAND's Center on AI, Security, and Technology (CAST), part... ...will build systems to evaluate how AI models perform across the full attack... ...Python, Java, C/C++, or other popular programming languages ~ Experience with red...Work experience placementRemote workWork from home$50 - $75 per hour
A leading tech company based in Australia is seeking an AI Model Evaluator on a contract basis. The role involves evaluating AI-generated responses, writing prompts, and providing justifications based on specific criteria. Ideal candidates will hold a Master's degree in...Hourly payContract work- YO IT Consulting is seeking a Senior Propulsion Engineer to evaluate AI-generated content related to propulsion engineering. This remote... ...processes would be advantageous. Join a team challenging AI language models to improve their technical reasoning. #J-18808-Ljbffr YO IT...Remote job
- Niantic Spatial is looking for a Staff Technical Program Manager to oversee the lifecycle of customer-facing technology and the development of their Large Geospatial Model. This role requires managing complex programs, establishing best practices, and working collaboratively...Work at office3 days per week
$180k - $260k
...investment firm in San Francisco is seeking a Model Behavior Architect to enhance their answer... ...should demonstrate a strong passion for AI, be familiar with Python, and possess a... ...80K to $260K and a comprehensive benefits program for full-time U.S. employees. #J-18808-Ljbffr...Full time$15 - $20 per hour
Mercor is seeking a Generalist with proficiency in English and Kannada to conduct fact-checking and generate evaluation data. This role involves assessing model response quality and ensuring alignment with conversational guidelines. The ideal candidate will possess a Bachelor...Remote jobHourly pay- ...cutting-edge multimodal foundation models that have the ability to... ...Index Ventures, and prominent AI visionaries and founders such... ...member of our ML Data Team - which leads the full spectrum of video-... ...language data preparation and model evaluation. This role comes with high...Work at officeWorldwideFlexible hours
$180k - $270k
...Plaud Inc. Plaud is building the world’s most trusted AI work companion for professionals to elevate... ...building reliable distributed systems, data pipelines, or evaluation harnesses that can run at scale against live model checkpoints. Can deeply partner with ML...Full timeWork at officeWorldwide- A cutting-edge AI company located in San Francisco is seeking an ML Eval Engineer to enhance model evaluations and ensure quality metrics. This role involves designing benchmarks, collaborating with teams to identify model weaknesses, and developing automated processes....
- Refresh AI is seeking a Research Engineer in San Francisco to push the boundaries of benchmarking technology. You will build benchmarks that labs use for evaluating coding abilities and computer-use capability. Your role will require expertise in reinforcement learning...Full time
- A leading AI solutions company in San Francisco is seeking an ML Eval Engineer to design evaluation benchmarks and improve model performance. This role involves working with unstructured enterprise data and collaborating closely with the ML and engineering teams. You will...
$25 per hour
Prolific is seeking AI Training Experts to assist in training and evaluating cutting-edge AI models. The role involves completing tasks such as analyzing and writing annotations, and judging AI performance. Candidates should have professional experience as an AI Trainer...Remote jobHourly payWork from homeFlexible hours- ...their ML Data Team. This role focuses on video-language data preparation, model evaluation, and requires strong skills in Python and project management. Ideal candidates should have over 5 years in AI data operations, the ability to manage large datasets, and a commitment...Flexible hours
$80 - $150 per hour
Mercor is looking for a Biology PhD Expert to evaluate technical quality and scientific reasoning across various research domains. The role involves reviewing research outputs and collaborating with experts to improve scientific rigor. Applicants should hold a PhD in relevant...Remote job- Welocalize is seeking a Data Quality Associate to evaluate AI model outputs and provide structured feedback. This is a full-time, onsite role located in San Francisco. The ideal candidate possesses a Bachelor's degree and has 1-2 years of professional writing experience...Full time
$180.8k - $226k
...frontier of GenAI and human-AI collaboration. The Gen... ...analytical Technical Program Manager (TPM) who leans... .... You will act as the lead investigative analyst... ...dashboards, and define offline evaluation frameworks (e.g., false... ...the world's leading models, and help enterprises...Full timeShift work- Welocalize is seeking a Data Quality Associate based in San Francisco for a full-time position. This role involves evaluating AI outputs and providing detailed feedback, with applicants needing native-level language proficiency and a university degree. Successful candidates...Full time
- ...seeking an innovative Quality Engineer for their AI products. This role blends ops, strategy, and... ...to shape how AI behaves, work with partners in leading labs, and ensure user satisfaction through effective evaluation baselines. Competitive salary and benefits offered...
- ...Job Description We are seeking a highly skilled Compliance Program Lead to oversee and enhance our regulatory compliance initiatives.... ...accuracy. Conduct internal compliance controls testing to evaluate the effectiveness of AML/CTF, licensing, and operational compliance...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to AI Model Evaluation Program Lead. Be the first to apply!


