Evaluations - Member of Technical Staff
$200k - $400kSimile
About the Company Pilots don't train with real passengers. Actors don't rehearse with real audiences. Yet, the most consequential decisions in society are often pushed straight to production. Simile is changing that. We have built the first AI simulation of society, populated by generative agents based on real humans. Our research pioneered the field of AI-based simulation, proving it is possible to model human behavior with high accuracy. Today, we are developing a Foundation Model to predict human behavior in any situation, at any scale. We are backed by $100M in funding led by Index Ventures, with participation from Hanabi, A*, Bain Capital Ventures, and AI visionaries including Andrej Karpathy, Fei-Fei Li, Adam D'Angelo, and Guillermo Rauch. About the Role As a Member of Technical Staff, Model Evaluations at Simile, you will build the measurement systems that determine whether our simulations of human behavior are accurate, trustworthy, and useful enough to guide real-world decisions. You will help shape what Simile measures, the quality bars we defend, and how evaluation evidence guides model, product, and customer decisions. Evaluation at Simile brings together model evals, statistics, behavioral science, research methodology, product quality, and human judgment. Our models simulate people, populations, markets, and groups, which means our evals must reason about distributions, noisy human ground truth, uncertainty, qualitative outputs, behavioral data, and customer decision-making. You will work with unusually rich data about human behavior, including surveys, long-form interviews, customer studies, qualitative research, and behavioral signals such as transactions, product interactions, and other real-world traces. We are hiring across several forms of expertise. Some candidates may be deep in LLM evaluation, model training, and research engineering. Others may bring exceptional strength in statistics, behavioral science, survey methodology, human data, product evaluation, or experimentation. Across backgrounds, we are looking for people who can reason clearly, build quickly, use agentic coding tools fluently, and take hands-on ownership of ambiguous evaluation problems. The core question for this role is simple: How do we know when a simulation of human behavior is good enough to trust? In this role, you will: Build the measurement layer for behavioral simulation: Design evals, metrics, rubrics, datasets, dashboards, and workflows that measure whether Simile’s models are accurately predicting human behavior across customer use cases, populations, question types, and decision contexts. Partner with modeling to improve models: Evaluate new model versions, diagnose regressions, identify priority areas for model-improvement cycles, and maintain stable eval suites that represent capabilities customers actually care about. Contribute to product and applied evals: Build evals for qualitative responses, retrieval, survey generation, AI-generated research reports, customer-facing outputs, and other product surfaces where model quality directly shapes customer trust. Turn subjective quality concerns into concrete rubrics, labeled data, automated graders, release criteria, and model-improvement signals. Make ground truth and uncertainty legible: Develop rigorous ways to compare simulated responses against human data, customer studies, Simile-collected ground truth, and behavioral datasets. Help the company reason about sampling error, uncertainty, calibration, margin of error, representativeness, and what “ground truth” means when human behavior is inherently noisy. Automate evaluation workflows: Use modern agentic coding tools to rapidly build internal tools, inspect model outputs, create labeling workflows, validate evals, and turn fuzzy evaluation questions into working systems. We value people who can compress long, ambiguous projects into fast, useful prototypes without losing sight of rigor or reliability. Help define the future of behavioral simulation evals: Prototype ways to evaluate behavioral predictions using diverse sources of data, including transaction or purchase behavior, product interactions, intervention response, first-party experiments, and eventually multi-agent group settings. Requirements Must Haves Evaluation Taste: You have strong intuition for what makes an eval meaningful, robust, and decision-relevant. You can explain what an eval measures, what it does not measure, how it can be gamed, and why it should or should not affect a model or product decision. LLM and Model Fluency: You understand the basics of modern LLM training, post-training, model evaluation, and hill-climbing. You do not need to be a modeling specialist, but you can read model outputs, understand modeling team needs, and reason about whether a model change actually improved the thing we care about. Statistical Judgment: You are comfortable reasoning about noisy data, uncertainty, sampling, distributions, calibration, confidence intervals, measurement validity, bias, variance, and the difference between an observed result and the underlying population quantity it estimates. Technical and Agentic Execution: You can build internal tools, scripts, dashboards, labeling workflows, analyses, or automated eval pipelines quickly. You are comfortable working with data and automation tools such as Python, SQL, R, notebooks, LLM APIs, and agentic coding tools such as Codex, Claude Code, Cursor, or equivalent systems. You know how to move quickly while still validating outputs, catching errors, and planning for the long-term.. Hands-On Ownership: You can independently drive a workstream while still doing the work yourself. You are willing to build the first version, inspect the data, debug the workflow, write the rubric, revise the metric, and keep going until the evaluation system is useful. Nice to Haves We do not expect one person to have all of these. We are hiring a team with complementary strengths. Modeling / Model-Quality Dashboards: Experience building model evaluation dashboards, regression suites, release gates, benchmark sets, model comparison workflows, or systems that help ML teams decide where to focus and when to ship. LLM-as-Judge and Human Data: Experience designing rubrics, automated graders, pairwise comparisons, expert review workflows, labeling interfaces, grader calibration, or human/model hybrid evaluation systems. Survey Methodology and Statistics: Experience with sampling, weighting, margin of error, power analysis, uncertainty quantification, Bayesian modeling, causal inference, psychometrics, polling, or measurement theory. Behavioral Simulation: Experience evaluating behavioral predictions beyond self-reported survey responses, such as transaction data, purchase behavior, mobility data, product interactions, or other passively collected behavioral signals. Behavioral Economics / Experimentation: Experience designing RCTs, A/B tests, survey experiments, vignette studies, field experiments, behavioral games, or intervention studies. Multi-Agent or Group Behavior: Interest or experience in modeling group conversation, deliberation, focus groups, juries, committees, polarization, collective decision-making, or social influence. You might be a great fit if you have worked in LLM evals, applied ML research, data science, research engineering, human data, market research, UXR, polling, behavioral science, computational social science, or behavioral economics. You might also be a recent graduate or self-directed builder with unusually strong taste in evaluation, statistics, and AI tools. You do not need to match every bullet. If you do not perfectly see yourself in this JD but believe you would be exceptional at building the measurement layer for behavioral simulation, we would love to hear from you. Compensation & Benefits At Simile, we provide competitive compensation packages that include base salary, equity, and comprehensive benefits. Salary Range: $200,000 – $400,000 USD Note: Final offers are based on experience, specialized skills, interview performance, and relevant training. Equity: Grants are available for eligible roles, subject to board approval. Health & Wellness: Comprehensive medical, dental, and vision coverage. Time Off: Flexible time off policies to support work-life balance. Our Process We prioritize thoughtful conversations and clear examples of past work. Our hiring journey is designed to help both sides align on fit, working style, and expectations. Reapplication Policy: To ensure a fair and thorough evaluation for all applicants, Simile observes a 90-day waiting period before reconsidering candidates for the same role. Commitment to Diversity & Inclusion Equal Opportunity: Simile is an equal opportunity workplace. We welcome applicants of all backgrounds and identities, valuing an environment where everyone can contribute authentically. Accommodations: If you require support or reasonable accommodations during the application process due to a disability, please let us know. We are happy to assist.
$285.55k
...our time. What We're Looking For The Evaluation Execution team at METR focuses on productionizing... ..., scalable systems and make sound technical decisions. You lead large projects from... ...Hybrid Requirements: Our technical team members are in our office in Berkeley 3-5 days/...SuggestedFull timeH1bWork at officeWork from homeHome officeRelocation package3 days per week$350k
...Anthropic, Google DeepMind, xAI, OpenAI, Microsoft, Apple, and MIT. The Role We are looking for a research engineer to build the evaluation infrastructure that tells us whether our models are getting better in ways we care about. You'll own the frameworks, pipelines,...Suggested- ...including documents & images Test, evaluate, and characterize natural language AI... ...io. Culture We're a small, all-technical team, all working at the forefront of... ...Gen Alpha team gets the same title: Member of Technical Staff. Compensation varies with experience...SuggestedPermanent employmentWork at officeVisa sponsorshipWork visa
- ...Member Of Technical Staff We're looking for a member of technical staff to build and deploy production-grade AI systems. In this role, you... ...powered systems into production environments Fine-tune, evaluate, and work with machine learning models in real-world applications...Suggested
- ...Member of Technical Staff @ Lotus AI Lotus AI is a groundbreaking primary care app that integrates your medical records, AI, and real doctors... ...curation pipelines that produce high-quality training and evaluation datasets from clinical interactions. Voice and Video AI Build...Suggested
- ...As a Member of Technical Staff (MTS), you'll build production-grade systems that power continuous optimization loops for AI agents—from evaluation pipelines and data/trace infrastructure to APIs that deploy improved policies. This role is a blend of MLE + backend engineering...
- ...you won't just observe the cutting edge of AI, your work will define what cutting edge means. We're hiring Members of Technical Staff to design the evaluations that set the standard for how AI is measured, produce analysis that shapes how companies and the broader industry...
- ...improving models. This includes trajectory visualization, evaluation workflows, monitoring dashboards, and the core product interfaces... ...core agent products. We’re building our team of founding Members of Technical Staff to design the frontier of continually learning systems....
$160k - $240k
Full-time San Francisco · In person $160k - $240k + Equity Member of Technical Staff, Modeling About the Role You will build and evaluate the models that turn operational time-series into forecasts, ranked risk drivers, and auditable decisions. The work spans time-series...Full time- ...design and the responsibility to defend. About the Role As a Member of Technical Staff, Mechanistic Interpretability at Radical Numerics, you... ...scientific insight into learning systems, improved model evaluations, and ultimately, mastery over the code of life. This role...Local area
$200k
...builds the internal platform that teams across Magic use to evaluate the performance of internal and external models. The team... ...of many of the company's most important decisions. As a Member of Technical Staff on Evals, you will build both the platform and the evaluations...Visa sponsorshipRelocation package$250k
...leaves their servers. The team is small, technical, and moving fast, with strong early... ...· Industry: AI Tools. The Role Member of Technical Staff who can handle everything from modeling... ...across enterprise customers Fine-tune, evaluate, and work with ML models in real-world...Full time$150k - $300k
...be working on advancing our ability to evaluate and serve models trained with our RL Lab... ...systems into our RL training stack. Core Technical Responsibilities LLM Serving Multi‑... ...in open development and encourage team members to contribute to the broader AI community...Work at officeRemote workVisa sponsorshipRelocation packageFlexible hoursShift work$300k
Member of Technical Staff - RL Algorithms About V max V max is an applied research lab developing AI capable of open-ended learning. We are building... ...and agentic settings. Establish empirical baselines and evaluation protocols for measuring sample efficiency, robustness,...Work at officeLocal areaShift work- About the Role As a Member of Technical Staff, Biosecurity at Radical Numerics, you will lead the design, evaluation, and deployment of biosecurity systems for biological foundation models. You will build evaluation frameworks, define safety architecture, and work with...Local area
$185k - $255k
Member of Technical Staff - Reinforcement Learning Optimized deploys AI agents into the most critical supply chains in the world: the operations... ...and post-training: the reward models, training loops, and evaluations that turn raw model capability into reliable long-horizon...$227.5k - $401k
...motivated individuals who tackle unique technical challenges at scale and solve them as... ...financial technology sector. As a Member of Technical Staff, you will operate with a high degree... ...Multi‑step Reasoning (DABStep), which evaluates AI agents on real‑world data analysis...Work at officeImmediate startRelocationFlexible hours- ...design and the responsibility to defend. About the Role As a Member of Technical Staff focused on statistical genetics, you will help us turn... ...part data architect, part methods scientist, and part model evaluator. You will collaborate closely with AI engineers and...Local area
- ...multiple levels for this role) Hands‑on experience with LLM evaluations and/or post‑training methods: How to design useful evals... ...end‑to‑end What the job involves We are seeking a Member of Technical Staff, Evals & Post‑Training Product to help define how developers...
$10k
...Combinator, and our earlier backers. Total raised: $72M Member of Technical Staff, Backend Why We’re Hiring This Role 1M+ developers and 2.7... ...agent improvement loop end to end in your domain: authoring, evaluation, deployment, observation, iteration. Be a core voice in...Live inFlexible hours- Member of Technical Staff, Document Understanding Join us and help shape the future of AI by architecting next-generation knowledge systems.... ...and interests, you might focus more on data curation and evaluation, model fine-tuning and experimentation, or ML infrastructure...Work at officeRemote work
- ...and resources that strengthen the broader AI ecosystem. As a member of SII, you'll conduct original and impactful research on improving... ...security and privacy in AI-native products. Build security evaluation frameworks, benchmarks, and datasets to measure the effectiveness...
- Member of Technical Staff - Applied Research Patronus AI is a frontier lab developing simulation research and infrastructure to accelerate progress... ...some of the earliest and most influential research in AI evaluation like FinanceBench , Lynx, SimpleSafetyTests ,...
- ...both the power to design and the responsibility to defend. About the Role As a Member of Technical Staff, Post‑Training at Radical Numerics, you will develop the training and evaluation loops that shape biological world models after pretraining. You will work on the...Local area
- ...preference and judgment. That lets us evaluate models on what people actually care... ...actually want. We’re a small, deeply technical team with people from Harvard, Berkeley... ...Angel, BoxGroup and others. The Role Member of Technical Staff, Platform Engineer You’ll design,...
- Job Description As a Member of Technical Staff (Research) at Trajectory, you will design and build the post‑training stack that lets our customers... ...own end‑to‑end experiments across data, training, and evaluation: shaping telemetry into learnable signals, training and serving...
$100k - $150k
Founding Member of Technical Staff (Security) Location: San Francisco • Singapore • Hyderabad • London Engineering • Hybrid • Full-time We're... ...work we've published in our blog. Create benchmarks to evaluate agent performance on real-world scenarios. Work closely with...Full timeFor contractorsWork at office- The opportunity We are looking for a Member of Technical Staff with deep expertise in generative modelling to work at the interface between our... ...in biology and understand the unique data challenges, evaluation paradigms and scientific workflows of biological modelling...Flexible hours
- ...gigawatt-class AI datacenters. Gimlet Labs is seeking a Member of Technical Staff (Intern) to help develop Gimlet’s platform for deploying and... ..., deploying and scaling AI systems for production Evaluating and implementing cutting-edge AI research Researching ways...Internship
$160k - $250k
Member of Technical Staff - Computational Biology About Edison Scientific focuses on building and commercializing AI agents for science, and shares... ...Technical Staff - Computational Biology, you'll build and evaluate AI agent systems to automate biological discovery. You'll...Remote work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Evaluations - Member of Technical Staff. Be the first to apply!
- application support technician San Francisco, CA
- personal computer support technician San Francisco, CA
- help desk assistant San Francisco, CA
- technical associate San Francisco, CA
- life support technician San Francisco, CA
- tech aide San Francisco, CA
- technical support analyst San Francisco, CA
- help desk technical support San Francisco, CA
- trade support analyst San Francisco, CA
- technical support specialist San Francisco, CA

