TL;DR

A multi-author preprint examines whether comparing large language models (LLMs) to “human” performance obscures cultural variation. The authors report that LLM outputs resemble responses from WEIRD (Western, Educated, Industrialized, Rich, Democratic) populations and diverge from broader cross-cultural data (correlation r = -0.70).

What happened

Researchers Mohammad Atari, Mona J. Xue, Peter S. Park, Damián Blasi and Joseph Henrich submitted a preprint titled "Which Humans?" that probes the assumption that LLMs can be directly compared with human performance without specifying which human populations are meant. Using large-scale cross-cultural psychological data as a reference, the team finds that LLMs produce patterns on psychological measures that are outliers relative to worldwide samples. The models' behavior most closely matches people from WEIRD societies, and model–human similarity declines markedly as sampled populations move away from that demographic cluster (reported correlation r = -0.70). The paper argues that failing to account for human cultural diversity in both empirical comparisons and model training raises scientific and ethical concerns, and it concludes with a discussion of possible strategies to reduce WEIRD bias in future generative language models. The preprint was submitted on September 22, 2023 and last edited June 20, 2024 (DOI: https://doi.org/10.31234/osf.io/5b26t).

Why it matters

  • Comparisons that treat "human" as a single benchmark risk misrepresenting model alignment with diverse populations.
  • If LLMs reflect primarily WEIRD-like responses, their outputs may be less accurate or appropriate for non-WEIRD cultural contexts.
  • Scientific claims about model "human-like" cognition or behavior could be biased if cross-cultural variation is ignored.
  • Ethical and policy debates about AI fairness and deployment need to consider which human populations are being represented in evaluations.

Key facts

  • Preprint title: "Which Humans?" by Atari, Xue, Park, Blasi and Henrich.
  • Submitted to PsyArXiv on September 22, 2023; last edited June 20, 2024.
  • DOI: https://doi.org/10.31234/osf.io/5b26t; licensed CC-BY 4.0.
  • Authors report that LLM responses to psychological measures are outliers compared with large-scale cross-cultural data.
  • LLM performance most closely resembles responses from WEIRD (Western, Educated, Industrialized, Rich, Democratic) populations.
  • Model–population similarity declines as samples move away from WEIRD contexts (reported correlation r = -0.70).
  • Authors note scientific and ethical problems stemming from ignoring cross-cultural diversity.
  • The paper concludes with discussion of approaches to mitigate WEIRD bias in future generative models.
  • Authors asserted no conflict of interest; preprint has recorded views (45,253) and downloads (10,099) on the hosting platform.

What to watch next

  • The paper’s proposed strategies to mitigate WEIRD bias in model development and evaluation (discussed in the preprint).
  • Follow-up empirical replication and independent cross-cultural comparisons of LLM behavior — not confirmed in the source.
  • Changes in model training datasets or evaluation benchmarks to incorporate broader cultural diversity — not confirmed in the source.

Quick glossary

  • Large language model (LLM): A machine learning model trained on large amounts of text to generate and analyze natural language.
  • WEIRD: An acronym describing populations that are Western, Educated, Industrialized, Rich, and Democratic; often used to highlight sampling bias in human studies.
  • Cross-cultural data: Empirical information collected from multiple cultural or national populations to assess variation across groups.
  • Correlation coefficient (r): A statistical measure that quantifies the strength and direction of a relationship between two variables; values range from -1 to 1.
  • Preprint: A research manuscript shared publicly before formal peer review and journal publication.

Reader FAQ

What is the main finding of the paper?
The authors find that LLM outputs align more closely with responses from WEIRD populations and diverge from broader cross-cultural psychological data (reported r = -0.70).

Does the paper claim LLMs are identical to any human group?
The paper reports resemblance to WEIRD populations but also describes LLM responses as outliers relative to large-scale cross-cultural datasets.

Were specific mitigation steps recommended?
The preprint closes by discussing ways to mitigate WEIRD bias, but detailed, specific interventions are not fully enumerated in the abstract.

Are the authors' data and analyses publicly available?
The preprint indicates public data and preregistration are associated with the project; further details would be in the full manuscript.

/ Preprints / PsyArXiv / 5b26t_v1 Notice: This website relies on cookies to help provide a better user experience. By clicking accept or continuing to use the site, you consent…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *