Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Evaluations - Member of Technical Staff

$200k - $400k
Full-time

Simile

About the Company Pilots don't train with real passengers. Actors don't rehearse with real audiences. Yet, the most consequential decisions in society are often pushed straight to production. Simile is changing that. We have built the first AI simulation of society, populated by generative agents based on real humans. Our research pioneered the field of AI-based simulation, proving it is possible to model human behavior with high accuracy. Today, we are developing a Foundation Model to predict human behavior in any situation, at any scale. We are backed by $100M in funding led by Index Ventures, with participation from Hanabi, A*, Bain Capital Ventures, and AI visionaries including Andrej Karpathy, Fei-Fei Li, Adam D'Angelo, and Guillermo Rauch. About the Role As a Member of Technical Staff, Model Evaluations at Simile, you will build the measurement systems that determine whether our simulations of human behavior are accurate, trustworthy, and useful enough to guide real-world decisions. You will help shape what Simile measures, the quality bars we defend, and how evaluation evidence guides model, product, and customer decisions. Evaluation at Simile brings together model evals, statistics, behavioral science, research methodology, product quality, and human judgment. Our models simulate people, populations, markets, and groups, which means our evals must reason about distributions, noisy human ground truth, uncertainty, qualitative outputs, behavioral data, and customer decision-making. You will work with unusually rich data about human behavior, including surveys, long-form interviews, customer studies, qualitative research, and behavioral signals such as transactions, product interactions, and other real-world traces. We are hiring across several forms of expertise. Some candidates may be deep in LLM evaluation, model training, and research engineering. Others may bring exceptional strength in statistics, behavioral science, survey methodology, human data, product evaluation, or experimentation. Across backgrounds, we are looking for people who can reason clearly, build quickly, use agentic coding tools fluently, and take hands-on ownership of ambiguous evaluation problems. The core question for this role is simple: How do we know when a simulation of human behavior is good enough to trust? In this role, you will: Build the measurement layer for behavioral simulation: Design evals, metrics, rubrics, datasets, dashboards, and workflows that measure whether Simile’s models are accurately predicting human behavior across customer use cases, populations, question types, and decision contexts. Partner with modeling to improve models: Evaluate new model versions, diagnose regressions, identify priority areas for model-improvement cycles, and maintain stable eval suites that represent capabilities customers actually care about. Contribute to product and applied evals: Build evals for qualitative responses, retrieval, survey generation, AI-generated research reports, customer-facing outputs, and other product surfaces where model quality directly shapes customer trust. Turn subjective quality concerns into concrete rubrics, labeled data, automated graders, release criteria, and model-improvement signals. Make ground truth and uncertainty legible: Develop rigorous ways to compare simulated responses against human data, customer studies, Simile-collected ground truth, and behavioral datasets. Help the company reason about sampling error, uncertainty, calibration, margin of error, representativeness, and what “ground truth” means when human behavior is inherently noisy. Automate evaluation workflows: Use modern agentic coding tools to rapidly build internal tools, inspect model outputs, create labeling workflows, validate evals, and turn fuzzy evaluation questions into working systems. We value people who can compress long, ambiguous projects into fast, useful prototypes without losing sight of rigor or reliability. Help define the future of behavioral simulation evals: Prototype ways to evaluate behavioral predictions using diverse sources of data, including transaction or purchase behavior, product interactions, intervention response, first-party experiments, and eventually multi-agent group settings. Requirements Must Haves Evaluation Taste: You have strong intuition for what makes an eval meaningful, robust, and decision-relevant. You can explain what an eval measures, what it does not measure, how it can be gamed, and why it should or should not affect a model or product decision. LLM and Model Fluency: You understand the basics of modern LLM training, post-training, model evaluation, and hill-climbing. You do not need to be a modeling specialist, but you can read model outputs, understand modeling team needs, and reason about whether a model change actually improved the thing we care about. Statistical Judgment: You are comfortable reasoning about noisy data, uncertainty, sampling, distributions, calibration, confidence intervals, measurement validity, bias, variance, and the difference between an observed result and the underlying population quantity it estimates. Technical and Agentic Execution: You can build internal tools, scripts, dashboards, labeling workflows, analyses, or automated eval pipelines quickly. You are comfortable working with data and automation tools such as Python, SQL, R, notebooks, LLM APIs, and agentic coding tools such as Codex, Claude Code, Cursor, or equivalent systems. You know how to move quickly while still validating outputs, catching errors, and planning for the long-term.. Hands-On Ownership: You can independently drive a workstream while still doing the work yourself. You are willing to build the first version, inspect the data, debug the workflow, write the rubric, revise the metric, and keep going until the evaluation system is useful. Nice to Haves We do not expect one person to have all of these. We are hiring a team with complementary strengths. Modeling / Model-Quality Dashboards: Experience building model evaluation dashboards, regression suites, release gates, benchmark sets, model comparison workflows, or systems that help ML teams decide where to focus and when to ship. LLM-as-Judge and Human Data: Experience designing rubrics, automated graders, pairwise comparisons, expert review workflows, labeling interfaces, grader calibration, or human/model hybrid evaluation systems. Survey Methodology and Statistics: Experience with sampling, weighting, margin of error, power analysis, uncertainty quantification, Bayesian modeling, causal inference, psychometrics, polling, or measurement theory. Behavioral Simulation: Experience evaluating behavioral predictions beyond self-reported survey responses, such as transaction data, purchase behavior, mobility data, product interactions, or other passively collected behavioral signals. Behavioral Economics / Experimentation: Experience designing RCTs, A/B tests, survey experiments, vignette studies, field experiments, behavioral games, or intervention studies. Multi-Agent or Group Behavior: Interest or experience in modeling group conversation, deliberation, focus groups, juries, committees, polarization, collective decision-making, or social influence. You might be a great fit if you have worked in LLM evals, applied ML research, data science, research engineering, human data, market research, UXR, polling, behavioral science, computational social science, or behavioral economics. You might also be a recent graduate or self-directed builder with unusually strong taste in evaluation, statistics, and AI tools. You do not need to match every bullet. If you do not perfectly see yourself in this JD but believe you would be exceptional at building the measurement layer for behavioral simulation, we would love to hear from you. Compensation & Benefits At Simile, we provide competitive compensation packages that include base salary, equity, and comprehensive benefits. Salary Range: $200,000 – $400,000 USD Note: Final offers are based on experience, specialized skills, interview performance, and relevant training. Equity: Grants are available for eligible roles, subject to board approval. Health & Wellness: Comprehensive medical, dental, and vision coverage. Time Off: Flexible time off policies to support work-life balance. Our Process We prioritize thoughtful conversations and clear examples of past work. Our hiring journey is designed to help both sides align on fit, working style, and expectations. Reapplication Policy: To ensure a fair and thorough evaluation for all applicants, Simile observes a 90-day waiting period before reconsidering candidates for the same role. Commitment to Diversity & Inclusion Equal Opportunity: Simile is an equal opportunity workplace. We welcome applicants of all backgrounds and identities, valuing an environment where everyone can contribute authentically. Accommodations: If you require support or reasonable accommodations during the application process due to a disability, please let us know. We are happy to assist.

Vacancy posted 17 hours ago
Similar jobs that could be interesting for youBased on the Evaluations - Member of Technical Staff in San Francisco, CA vacancy
  • $285.55k

     ...our time. What We're Looking For The Evaluation Execution team at METR focuses on productionizing...  ..., scalable systems and make sound technical decisions. You lead large projects from...  ...Hybrid Requirements: Our technical team members are in our office in Berkeley 3-5 days/... 
    Suggested
    Full time
    H1b
    Work at office
    Work from home
    Home office
    Relocation package
    3 days per week

    METR

    Berkeley, CA
    1 day ago
  • $350k

     ...Anthropic, Google DeepMind, xAI, OpenAI, Microsoft, Apple, and MIT. The Role We are looking for a research engineer to build the evaluation infrastructure that tells us whether our models are getting better in ways we care about. You'll own the frameworks, pipelines,... 
    Suggested

    Mirendil

    San Francisco, CA
    4 days ago
  •  ...including documents & images Test, evaluate, and characterize natural language AI...  ...io. Culture We're a small, all-technical team, all working at the forefront of...  ...Gen Alpha team gets the same title: Member of Technical Staff. Compensation varies with experience... 
    Suggested
    Permanent employment
    Work at office
    Visa sponsorship
    Work visa

    Generation Alpha Transistor

    San Francisco, CA
    3 days ago
  •  ...Member Of Technical Staff We're looking for a member of technical staff to build and deploy production-grade AI systems. In this role, you...  ...powered systems into production environments Fine-tune, evaluate, and work with machine learning models in real-world applications... 
    Suggested

    ERAGON

    San Francisco, CA
    3 days ago
  •  ...Member of Technical Staff @ Lotus AI Lotus AI is a groundbreaking primary care app that integrates your medical records, AI, and real doctors...  ...curation pipelines that produce high-quality training and evaluation datasets from clinical interactions. Voice and Video AI Build... 
    Suggested

    Lotus Health

    San Francisco, CA
    3 days ago
  •  ...As a Member of Technical Staff (MTS), you'll build production-grade systems that power continuous optimization loops for AI agents—from evaluation pipelines and data/trace infrastructure to APIs that deploy improved policies. This role is a blend of MLE + backend engineering... 

    VizopsAI

    San Francisco, CA
    3 days ago
  •  ...you won't just observe the cutting edge of AI, your work will define what cutting edge means. We're hiring Members of Technical Staff to design the evaluations that set the standard for how AI is measured, produce analysis that shapes how companies and the broader industry... 

    Artificial Analysis Inc

    San Francisco, CA
    3 days ago
  •  ...improving models. This includes trajectory visualization, evaluation workflows, monitoring dashboards, and the core product interfaces...  ...core agent products. We’re building our team of founding Members of Technical Staff to design the frontier of continually learning systems.... 

    Trajectory

    San Francisco, CA
    3 days ago
  • $160k - $240k

    Full-time San Francisco · In person $160k - $240k + Equity Member of Technical Staff, Modeling About the Role You will build and evaluate the models that turn operational time-series into forecasts, ranked risk drivers, and auditable decisions. The work spans time-series... 
    Full time

    Reific

    San Francisco, CA
    3 days ago
  •  ...design and the responsibility to defend. About the Role As a Member of Technical Staff, Mechanistic Interpretability at Radical Numerics, you...  ...scientific insight into learning systems, improved model evaluations, and ultimately, mastery over the code of life. This role... 
    Local area

    Radical Numerics Inc.

    San Francisco, CA
    1 day ago
  • $200k

     ...builds the internal platform that teams across Magic use to evaluate the performance of internal and external models. The team...  ...of many of the company's most important decisions. As a Member of Technical Staff on Evals, you will build both the platform and the evaluations... 
    Visa sponsorship
    Relocation package

    Magic

    San Francisco, CA
    5 days ago
  • $250k

     ...leaves their servers. The team is small, technical, and moving fast, with strong early...  ...· Industry: AI Tools. The Role Member of Technical Staff who can handle everything from modeling...  ...across enterprise customers Fine-tune, evaluate, and work with ML models in real-world... 
    Full time

    David Joseph & Company

    San Francisco, CA
    5 days ago
  • $150k - $300k

     ...be working on advancing our ability to evaluate and serve models trained with our RL Lab...  ...systems into our RL training stack. Core Technical Responsibilities LLM Serving Multi‑...  ...in open development and encourage team members to contribute to the broader AI community... 
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours
    Shift work

    Prime Intellect

    San Francisco, CA
    4 days ago
  • $300k

    Member of Technical Staff - RL Algorithms About V max V max is an applied research lab developing AI capable of open-ended learning. We are building...  ...and agentic settings. Establish empirical baselines and evaluation protocols for measuring sample efficiency, robustness,... 
    Work at office
    Local area
    Shift work

    Vmax

    San Francisco, CA
    5 days ago
  • About the Role As a Member of Technical Staff, Biosecurity at Radical Numerics, you will lead the design, evaluation, and deployment of biosecurity systems for biological foundation models. You will build evaluation frameworks, define safety architecture, and work with... 
    Local area

    Radical Numerics Inc.

    San Francisco, CA
    5 days ago
  • $185k - $255k

    Member of Technical Staff - Reinforcement Learning Optimized deploys AI agents into the most critical supply chains in the world: the operations...  ...and post-training: the reward models, training loops, and evaluations that turn raw model capability into reliable long-horizon... 

    Optimized, Inc.

    San Francisco, CA
    3 days ago
  • $227.5k - $401k

     ...motivated individuals who tackle unique technical challenges at scale and solve them as...  ...financial technology sector. As a Member of Technical Staff, you will operate with a high degree...  ...Multi‑step Reasoning (DABStep), which evaluates AI agents on real‑world data analysis... 
    Work at office
    Immediate start
    Relocation
    Flexible hours

    Adyen

    San Francisco, CA
    3 days ago
  •  ...design and the responsibility to defend. About the Role As a Member of Technical Staff focused on statistical genetics, you will help us turn...  ...part data architect, part methods scientist, and part model evaluator. You will collaborate closely with AI engineers and... 
    Local area

    Radical Numerics Inc.

    San Francisco, CA
    1 day ago
  •  ...multiple levels for this role) Hands‑on experience with LLM evaluations and/or post‑training methods: How to design useful evals...  ...end‑to‑end What the job involves We are seeking a Member of Technical Staff, Evals & Post‑Training Product to help define how developers... 

    Fireworks AI

    San Francisco, CA
    4 days ago
  • $10k

     ...Combinator, and our earlier backers. Total raised: $72M Member of Technical Staff, Backend Why We’re Hiring This Role 1M+ developers and 2.7...  ...agent improvement loop end to end in your domain: authoring, evaluation, deployment, observation, iteration. Be a core voice in... 
    Live in
    Flexible hours

    Vapi

    San Francisco, CA
    2 days ago
  • Member of Technical Staff, Document Understanding Join us and help shape the future of AI by architecting next-generation knowledge systems....  ...and interests, you might focus more on data curation and evaluation, model fine-tuning and experimentation, or ML infrastructure... 
    Work at office
    Remote work

    LlamaIndex, Inc.

    San Francisco, CA
    3 days ago
  •  ...and resources that strengthen the broader AI ecosystem. As a member of SII, you'll conduct original and impactful research on improving...  ...security and privacy in AI-native products. Build security evaluation frameworks, benchmarks, and datasets to measure the effectiveness... 

    United States Digital Space LLC

    San Francisco, CA
    2 days ago
  • Member of Technical Staff - Applied Research Patronus AI is a frontier lab developing simulation research and infrastructure to accelerate progress...  ...some of the earliest and most influential research in AI evaluation like FinanceBench , Lynx, SimpleSafetyTests ,... 

    Patronus AI, Inc.

    San Francisco, CA
    2 days ago
  •  ...both the power to design and the responsibility to defend. About the Role As a Member of Technical Staff, Post‑Training at Radical Numerics, you will develop the training and evaluation loops that shape biological world models after pretraining. You will work on the... 
    Local area

    Radical Numerics Inc.

    San Francisco, CA
    5 days ago
  •  ...preference and judgment. That lets us evaluate models on what people actually care...  ...actually want. We’re a small, deeply technical team with people from Harvard, Berkeley...  ...Angel, BoxGroup and others. The Role Member of Technical Staff, Platform Engineer You’ll design,... 

    Arcada Labs Incorporated

    San Francisco, CA
    4 days ago
  • Job Description As a Member of Technical Staff (Research) at Trajectory, you will design and build the post‑training stack that lets our customers...  ...own end‑to‑end experiments across data, training, and evaluation: shaping telemetry into learnable signals, training and serving... 

    Trajectory

    San Francisco, CA
    2 days ago
  • $100k - $150k

    Founding Member of Technical Staff (Security) Location: San Francisco • Singapore • Hyderabad • London Engineering • Hybrid • Full-time We're...  ...work we've published in our blog. Create benchmarks to evaluate agent performance on real-world scenarios. Work closely with... 
    Full time
    For contractors
    Work at office

    Crane Venture Partners

    San Francisco, CA
    4 days ago
  • The opportunity We are looking for a Member of Technical Staff with deep expertise in generative modelling to work at the interface between our...  ...in biology and understand the unique data challenges, evaluation paradigms and scientific workflows of biological modelling... 
    Flexible hours

    Gravity Engineering Services Pvt Ltd.

    San Francisco, CA
    2 days ago
  •  ...gigawatt-class AI datacenters. Gimlet Labs is seeking a Member of Technical Staff (Intern) to help develop Gimlet’s platform for deploying and...  ..., deploying and scaling AI systems for production Evaluating and implementing cutting-edge AI research Researching ways... 
    Internship

    Gimlet Labs

    San Francisco, CA
    3 days ago
  • $160k - $250k

    Member of Technical Staff - Computational Biology About Edison Scientific focuses on building and commercializing AI agents for science, and shares...  ...Technical Staff - Computational Biology, you'll build and evaluate AI agent systems to automate biological discovery. You'll... 
    Remote work

    Edison Scientific

    San Francisco, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Evaluations - Member of Technical Staff. Be the first to apply!