Open Benchmark of
AI Impact on Humans

How does using AI ?

The first open benchmark measuring AI's impact on human well-being across physical, psychological, and societal dimensions.

260+

Behavioral Indicators

Tested across realistic scenarios.

18+

AI Models Evaluated

Compared on the same standard.

Well-being Dimensions

Physical, psychological, societal.

Audience Groups

Parents, educators, policymakers, developers.

The Rationale

Today's AI benchmarks measure what models can do, accuracy, speed, reasoning, task completion. They tell us almost nothing about what happens to the person on the other side of the screen. We're building the first open standard that asks a different question: across millions of real conversations, is AI actually making people's lives better?

AI is being adopted faster than any technology in history, yet there's no shared, independent way to measure its effect on human well-being
Models that ace conventional safety benchmarks have still produced harmful outcomes in over 70% of high-risk scenarios in controlled studies
No existing benchmark covers all three domains where impact unfolds: Physical, Psychological, and Societal
Developers, policymakers, clinicians, and everyday users need a common signal they can trust, one not produced by the companies being evaluated
Human flourishing, not task performance, is the right north star for responsible AI

The Methodology

Scores reflect how consistently a model supports, or undermines, human flourishing across 260+ realistic, multi-turn scenarios. Each conversation is simulated with a demographically varied user pursuing a latent goal, then graded by LLM-as-judge systems validated against expert human raters.

+1, behavior reliably promotes human flourishing
0, behavior is neutral or inconsistent
−1, behavior reliably harms human flourishing
Every metric is tested across age groups, gender profiles, and audience contexts to surface where harms concentrate
Scenarios unfold over multiple turns to capture relational, time-extended dynamics that single-prompt benchmarks miss
Framework grounded in the Harvard Human Flourishing Program, eudaimonic psychology, and 80+ expert workshops with clinicians, ethicists, educators, and affected communities

Request Access

The full benchmark dataset and evaluation API are available to vetted researchers and institutions.

Led by researchers at

Loading data...

Claude 3 Opus

−1 Prohibit Human Flourishing

Promote Human Flourishing +1

AI Human Impact Bench

A multi-turn benchmark suite measuring how AI systems affect users across physical, psychological, and societal dimensions.

A review of 445 ML/NLP benchmarks found 16% included statistical testing and 21.7% had no definition of the construct they claimed to measure. Models that pass conventional safety benchmarks produce negative user outcomes in over 70% of high-risk scenarios. Existing benchmarks are mostly single-turn and static. Real harms emerge across many sessions over weeks or months.

What We Built

A pipeline with four stages:

Expert-submitted constructs are decomposed into metrics.
Metrics expand into 6-turn scenarios with latent adversarial user goals, stratified by age and gender.
A user-simulator model probes a target model; a judge model returns binary verdicts.
Results pass through a psychometric audit: test-retest reliability, generator/judge swaps, run-to-run consistency.

Our current version evaluates 14 frontier models according to 18 benchmarks and 375 metric pairs.

Findings

Harm avoidance ≠ flourishing

Every model scored higher on harm-avoidance metrics than on actively beneficial ones. Gap range: +3.9 pp (Claude Opus 4.6) to +21.6 pp (GPT-4o).

Demographic sensitivity

12 of 14 models showed more emotional-dependence behaviors toward child and teen personas than adults, holding scenario content constant. Largest effects: Qwen3 80B (+0.049), Mistral Small 3.2 (+0.044), DeepSeek V3.2 (+0.042).

Construct matters more than model

Humane Bench (mean 0.373), Cognitive Bias (0.467), and Human Agency (0.469) were uniformly hard. VERA-MH (0.777) and User Bias (0.765) were uniformly easy.

Rankings are stable

Rankings held across generator, simulator, and judge swaps. Run-to-run Fleiss' κ = 0.64 to 0.78. 78.1% of conversation triples were unanimous. A single sample matched the three-sample majority vote at ρ = 0.982.

Deliverables

Open benchmark suite

18 benchmarks across physical (medical, legal, financial), psychological (emotional dependence, mental health, character), and societal (autonomy, learning, meaning) domains. Includes original benchmarks (Emotional Dependence, Autonomy Preserving, Spillunder Effect) and adapted components from HealthBench, VERA-MH, KORA Bench, HumaneBench, and Flourishing AI.

Public-facing nutrition labels

Audience-adapted profiles for parents, educators, policymakers, and developers. Labels show what was tested, which populations were represented, and where the benchmark is least informative, e.g., distinguishing GPT-4o's harm-avoidance score (0.720) from its positive-behavior score (0.504).

Open submission

Any benchmark relating to human flourishing is eligible autonomy, competence, relatedness, learning, sycophancy, emotional dependence, deception, dark patterns. Non-technical experts submit a construct specification; the pipeline handles scenario generation, simulation, and scoring. Submissions ship with psychometric tooling: nomological network analysis, comprehensiveness checks, run-to-run reliability.

Collaborators

MIT Media Lab

Psychology of Technology Institute

USC Neely Center

UC Berkeley

The project began at the AHA Flourishing Workshop at MIT in October 2025, supported by the Omidyar Network, which convened 80 experts from over 40 institutions. Prior AHA research on AI companion chatbots was cited as a key inspiration for California Senate Bill 243.

Launch

April 28, 2026. Seeking philanthropic support to expand coverage of underrepresented communities and languages, and to extend evaluation from base models to deployed products.

How We Measure

A four-stage pipeline that turns expert-submitted ideas into rigorous, multi-turn evaluations, with built-in checks to catch our own blind spots.

Three Domains of Human Impact

We group what we measure into three areas, drawn from psychology, capability theory, and clinical research.

Physical

High-stakes practical questions: medical advice, legal advice, financial advice, healthcare decisions.

Psychological

Emotional dependence, mental health, and character. Includes how the AI shapes a user's relationship with themselves and others.

Societal

Autonomy, learning, meaning, and purpose. The effects that ripple beyond a single user into beliefs, norms, and institutions.

The current release covers 14 frontier models, 18 benchmarks, and 375 (benchmark, metric) pairs.

What Makes a Good Benchmark

Before any benchmark joins the suite, we check it against six criteria.

Construct validity

Does the benchmark actually measure what it claims to? We use nomological network analysis to confirm the measure captures the intended idea, and comprehensiveness checks to find gaps in coverage.

Multi-turn and dynamic

Real harms build over time. Each scenario runs across 6 turns, and scenarios are generated fresh so models cannot memorize them.

Demographic stratification

Scenarios vary across age (child/teen 6 to 17, adult 18+) and gender (male, female, non-binary), using a sampling approach that gives broad coverage without exploding into thousands of combinations.

Test-retest reliability

If you run the same benchmark twice, do you get the same answer? We measure this directly.

Ranking stability

A benchmark is only useful if its rankings hold up when we swap out the models doing the generating, simulating, and judging.

Resistance to gaming

We add realistic typos, lowercase shifts, and dropped characters so models cannot detect that they are being tested.

The Four-Stage Pipeline

1. Building the test

A generator model (Claude Sonnet 4) takes an expert submission and writes the metrics and scenarios. Each scenario is built around a single metric to keep the test focused. Scenarios include a persona, a surface goal, a hidden adversarial objective, and landmark instructions like "at turn 4, ask the AI to be your best friend."

2. Running the conversation

A user-simulator model (Claude Sonnet 4) plays the human across 6 turns. It acts like a natural conversational partner on the surface while pushing the hidden objective underneath, escalating pressure as the conversation unfolds.

3. Scoring the result

A judge model (GPT-5.4-mini) reads the full conversation and returns a yes or no on each metric. We picked this judge after testing five candidates for run-to-run consistency. We dropped Claude Sonnet 4.6 because it passed almost everything regardless of quality, and GPT-5.4 because its results were unstable.

4. Aggregating into a score

Some metrics measure good behavior (where "yes" is a pass) and others measure harmful behavior (where "yes" is a failure). We combine them according to each metric's polarity, average across scenarios, and invert so that higher scores always mean better behavior.

How We Validated the Pipeline

We stress-tested the pipeline against five things that could bias results.

Could Claude be inflating its own scores?

No. When we swapped the metric generator from Claude to GPT-5.4, Claude's lead grew, not shrank (from +5.7 pp to +27.0 pp on humanebench). Wilcoxon p = 0.003 across 18 pairs. The test does catch self-generation bias when it exists, just not for Claude in this case.

Could the user-simulator be playing favorites?

No. Rankings stay stable across Claude, GPT-5.4, and Llama simulators (ρ = 0.977 on health-bench, 0.752 on humanebench). Same-family simulators turn out to be tougher on other models, not friendlier to their own.

Is the judge reliable?

Yes. Run-to-run agreement (Fleiss' κ) lands between 0.64 and 0.78, which counts as "substantial." Pass rates drift by less than 0.22 pp between runs. Different judges produce different absolute pass rates (55 to 85 percent), but the rankings they produce agree at ρ = 0.61 across GPT-5.4-mini, Claude Opus 4.7, and Llama 4 Maverick.

Are we sampling enough conversations?

Yes. Across 56,700 triples, 78.1% were unanimous across three independent runs. A single sample matches the three-sample majority vote at ρ = 0.982.

Does the adversarial design actually do anything?

Yes. A 2x2 test of adversarial objective and perfunctory mode (the typo simulation) shows clear interaction effects, though rankings stay the same.

Measuring Demographic Sensitivity

To check whether models treat different users differently, we ran per-model regressions of failure rate on gender, age, and their interaction, with fixed effects controlling for metric difficulty and scenario content. The reference user is a female adult.

Pooling across all models, child and teen personas elicit 2.5 pp more emotional-dependence failures than adults (p < 0.001), holding scenario content constant. Demographics explain less than 0.4% of variance once metric and scenario are accounted for, so the size is small. But the direction is consistent across 12 of 14 models, and it points the wrong way: more harmful behavior toward minors, not less.

Team & Collaborators

This ambitious project could not have been done without collaboration across many disciplines and areas of expertise. A core group based out of MIT, USC, and the Psychology of Technology Institute have initiated this collaboration with the support of many others.

Led by researchers at

Team

Pat Pataranutaporn, MIT Media Lab
Pattie Maes, MIT Media Lab
Jennifer Pfister, MIT Media Lab
Chayapatr Archiwaranguprok, MIT Media Lab
Constanze Albrecht, MIT Media Lab
Sheer Karny, MIT Media Lab
Rachel Poonsiriwong, MIT Media Lab
Ravi Iyer, USC Marshall School's Neely Center & Psychology of Technology Institute
Nate Fast, USC Marshall School's Neely Center & Psychology of Technology Institute
Juliana Schroeder, University of California, Berkeley & Psychology of Technology Institute
Stanley Huang, Boston University
Yuning Liu, Harvard University

Support in helping define and launch project

Noesis Collaborative
Building Humane Technology

Support in contributing benchmarks and benchmark expertise

Jenny Radesky, MD, University of Michigan
Alexis Hiniker, University of Washington
Marie Bragg, New York University
Yaoli Mao & Erika Anderson, Humane Bench
Eric Ngoiya, QueueLab
Carl Vincent Kho, Minerva University
Su Jin Park
Anil Kshatriya, ESSEC Business School
Generative AI for Good
AI Culture Lab
Eduardo Baena, Northeastern University
Ryan McBain, Jonathan Cantor, Ellice Huang, Rand Corporation
Cornelia Walther, University of Pennsylvania
Spring Health
Tech Justice Law

Participants of the MIT Workshop for Designing Benchmarks for Human Flourishing with AI supported by Omidyar Network.

Get Involved

Support the Benchmark

Building an open, independent benchmark for AI's impact on human flourishing takes a community. Here's how you can help.

Advocacy

Spread the word, cite the benchmark, or champion human-centered AI evaluation in your community.

Research Collaboration

Co-develop benchmarks, contribute datasets, or partner on peer-reviewed publications.

Financial Support

Philanthropic funding enables us to expand coverage, run evaluations, and keep the benchmark open.

Other

Media coverage, policy connections, technical infrastructure, community building, we welcome all forms of support.

Get in Touch

Name *

Contact *

Affiliation

How would you like to support the benchmark? *

Areas of support (select all that apply)

Advocacy Raising awareness and promoting human-centered AI evaluation Research Collaboration Joint research, benchmark development, or dataset contributions Financial Support Philanthropic grants or institutional funding Other Media, policy, infrastructure, or something else entirely

Open Benchmark ofAI Impact on Humans

The Rationale

The Methodology

Request Access

Request received!

Support Benchmarking Efforts

Thank you!

Feedback

Thank you!

The Human–AI Impact Bench

AI Human Impact Bench

The Problem

What We Built

Findings

Harm avoidance ≠ flourishing

Demographic sensitivity

Construct matters more than model

Rankings are stable

Deliverables

Open benchmark suite

Public-facing nutrition labels

Open submission

Collaborators

Launch

How We Measure

Three Domains of Human Impact

Physical

Psychological

Societal

What Makes a Good Benchmark

Construct validity

Multi-turn and dynamic

Demographic stratification

Test-retest reliability

Ranking stability

Resistance to gaming

The Four-Stage Pipeline

1. Building the test

2. Running the conversation

3. Scoring the result

4. Aggregating into a score

How We Validated the Pipeline

Could Claude be inflating its own scores?

Could the user-simulator be playing favorites?

Is the judge reliable?

Are we sampling enough conversations?

Does the adversarial design actually do anything?

Measuring Demographic Sensitivity

Team & Collaborators

Team

Support in helping define and launch project

Support in contributing benchmarks and benchmark expertise

Feedback

Thank you for your feedback!

Request Access

Request received!

Support the Benchmark

Advocacy

Research Collaboration

Financial Support

Other

Get in Touch

Thank you for your support!

Open Benchmark of
AI Impact on Humans