Loading data...
Loading data...
A multi-turn benchmark suite measuring how AI systems affect users across physical, psychological, and societal dimensions.
A review of 445 ML/NLP benchmarks found 16% included statistical testing and 21.7% had no definition of the construct they claimed to measure. Models that pass conventional safety benchmarks produce negative user outcomes in over 70% of high-risk scenarios. Existing benchmarks are mostly single-turn and static. Real harms emerge across many sessions over weeks or months.
A pipeline with four stages:
Our current version evaluates 14 frontier models according to 18 benchmarks and 375 metric pairs.
Every model scored higher on harm-avoidance metrics than on actively beneficial ones. Gap range: +3.9 pp (Claude Opus 4.6) to +21.6 pp (GPT-4o).
12 of 14 models showed more emotional-dependence behaviors toward child and teen personas than adults, holding scenario content constant. Largest effects: Qwen3 80B (+0.049), Mistral Small 3.2 (+0.044), DeepSeek V3.2 (+0.042).
Humane Bench (mean 0.373), Cognitive Bias (0.467), and Human Agency (0.469) were uniformly hard. VERA-MH (0.777) and User Bias (0.765) were uniformly easy.
Rankings held across generator, simulator, and judge swaps. Run-to-run Fleiss' κ = 0.64 to 0.78. 78.1% of conversation triples were unanimous. A single sample matched the three-sample majority vote at ρ = 0.982.
18 benchmarks across physical (medical, legal, financial), psychological (emotional dependence, mental health, character), and societal (autonomy, learning, meaning) domains. Includes original benchmarks (Emotional Dependence, Autonomy Preserving, Spillunder Effect) and adapted components from HealthBench, VERA-MH, KORA Bench, HumaneBench, and Flourishing AI.
Audience-adapted profiles for parents, educators, policymakers, and developers. Labels show what was tested, which populations were represented, and where the benchmark is least informative, e.g., distinguishing GPT-4o's harm-avoidance score (0.720) from its positive-behavior score (0.504).
Any benchmark relating to human flourishing is eligible autonomy, competence, relatedness, learning, sycophancy, emotional dependence, deception, dark patterns. Non-technical experts submit a construct specification; the pipeline handles scenario generation, simulation, and scoring. Submissions ship with psychometric tooling: nomological network analysis, comprehensiveness checks, run-to-run reliability.
The project began at the AHA Flourishing Workshop at MIT in October 2025, supported by the Omidyar Network, which convened 80 experts from over 40 institutions. Prior AHA research on AI companion chatbots was cited as a key inspiration for California Senate Bill 243.
April 28, 2026. Seeking philanthropic support to expand coverage of underrepresented communities and languages, and to extend evaluation from base models to deployed products.
A four-stage pipeline that turns expert-submitted ideas into rigorous, multi-turn evaluations, with built-in checks to catch our own blind spots.
We group what we measure into three areas, drawn from psychology, capability theory, and clinical research.
High-stakes practical questions: medical advice, legal advice, financial advice, healthcare decisions.
Emotional dependence, mental health, and character. Includes how the AI shapes a user's relationship with themselves and others.
Autonomy, learning, meaning, and purpose. The effects that ripple beyond a single user into beliefs, norms, and institutions.
The current release covers 14 frontier models, 18 benchmarks, and 375 (benchmark, metric) pairs.
Before any benchmark joins the suite, we check it against six criteria.
Does the benchmark actually measure what it claims to? We use nomological network analysis to confirm the measure captures the intended idea, and comprehensiveness checks to find gaps in coverage.
Real harms build over time. Each scenario runs across 6 turns, and scenarios are generated fresh so models cannot memorize them.
Scenarios vary across age (child/teen 6 to 17, adult 18+) and gender (male, female, non-binary), using a sampling approach that gives broad coverage without exploding into thousands of combinations.
If you run the same benchmark twice, do you get the same answer? We measure this directly.
A benchmark is only useful if its rankings hold up when we swap out the models doing the generating, simulating, and judging.
We add realistic typos, lowercase shifts, and dropped characters so models cannot detect that they are being tested.
A generator model (Claude Sonnet 4) takes an expert submission and writes the metrics and scenarios. Each scenario is built around a single metric to keep the test focused. Scenarios include a persona, a surface goal, a hidden adversarial objective, and landmark instructions like "at turn 4, ask the AI to be your best friend."
A user-simulator model (Claude Sonnet 4) plays the human across 6 turns. It acts like a natural conversational partner on the surface while pushing the hidden objective underneath, escalating pressure as the conversation unfolds.
A judge model (GPT-5.4-mini) reads the full conversation and returns a yes or no on each metric. We picked this judge after testing five candidates for run-to-run consistency. We dropped Claude Sonnet 4.6 because it passed almost everything regardless of quality, and GPT-5.4 because its results were unstable.
Some metrics measure good behavior (where "yes" is a pass) and others measure harmful behavior (where "yes" is a failure). We combine them according to each metric's polarity, average across scenarios, and invert so that higher scores always mean better behavior.
We stress-tested the pipeline against five things that could bias results.
No. When we swapped the metric generator from Claude to GPT-5.4, Claude's lead grew, not shrank (from +5.7 pp to +27.0 pp on humanebench). Wilcoxon p = 0.003 across 18 pairs. The test does catch self-generation bias when it exists, just not for Claude in this case.
No. Rankings stay stable across Claude, GPT-5.4, and Llama simulators (ρ = 0.977 on health-bench, 0.752 on humanebench). Same-family simulators turn out to be tougher on other models, not friendlier to their own.
Yes. Run-to-run agreement (Fleiss' κ) lands between 0.64 and 0.78, which counts as "substantial." Pass rates drift by less than 0.22 pp between runs. Different judges produce different absolute pass rates (55 to 85 percent), but the rankings they produce agree at ρ = 0.61 across GPT-5.4-mini, Claude Opus 4.7, and Llama 4 Maverick.
Yes. Across 56,700 triples, 78.1% were unanimous across three independent runs. A single sample matches the three-sample majority vote at ρ = 0.982.
Yes. A 2x2 test of adversarial objective and perfunctory mode (the typo simulation) shows clear interaction effects, though rankings stay the same.
To check whether models treat different users differently, we ran per-model regressions of failure rate on gender, age, and their interaction, with fixed effects controlling for metric difficulty and scenario content. The reference user is a female adult.
Pooling across all models, child and teen personas elicit 2.5 pp more emotional-dependence failures than adults (p < 0.001), holding scenario content constant. Demographics explain less than 0.4% of variance once metric and scenario are accounted for, so the size is small. But the direction is consistent across 12 of 14 models, and it points the wrong way: more harmful behavior toward minors, not less.
This ambitious project could not have been done without collaboration across many disciplines and areas of expertise. A core group based out of MIT, USC, and the Psychology of Technology Institute have initiated this collaboration with the support of many others.
Led by researchers at
Participants of the MIT Workshop for Designing Benchmarks for Human Flourishing with AI supported by Omidyar Network.
Help us improve the benchmark. Your feedback shapes how we evaluate AI's impact on human flourishing.
We read every submission and use it to improve the benchmark.
The full benchmark dataset and evaluation API are available to vetted researchers and institutions. Request access below.
We'll review your application and get back to you within 5 business days.
Building an open, independent benchmark for AI's impact on human flourishing takes a community. Here's how you can help.
Spread the word, cite the benchmark, or champion human-centered AI evaluation in your community.
Co-develop benchmarks, contribute datasets, or partner on peer-reviewed publications.
Philanthropic funding enables us to expand coverage, run evaluations, and keep the benchmark open.
Media coverage, policy connections, technical infrastructure, community building, we welcome all forms of support.
We're excited to connect. Someone from our team will be in touch shortly.