TL;DR
Researchers propose DatBench, a curated evaluation approach for vision-language models (VLMs) guided by three desiderata: faithfulness, discriminability, and compute efficiency. They identify common failure modes in existing benchmarks, clean and transform datasets, and release DatBench-Full (33 datasets) plus a smaller DatBench subset that yields large speedups while retaining discriminative power.
What happened
A multi-author study examined how current empirical evaluations for vision-language models can mislead researchers and consume disproportionate compute. The authors articulate three criteria that good evaluations should meet: be faithful to the modality and intended tasks, distinguish models of different quality, and be efficient to run. They investigate widespread issues—multiple-choice formats that encourage guessing and saturate quickly, questions solvable without images (present in up to 70% of some evaluations), and mislabeled or ambiguous items (affecting up to 42% of examples in certain sets). By transforming multiple-choice items into generative prompts and filtering out blind or problematic examples, the team improved signal quality and reduced evaluation cost. The effort produced DatBench-Full, a cleaned suite of 33 datasets covering nine VLM capabilities, and DatBench, a much smaller discriminative subset that matches original discriminative power while offering roughly 13x average speedups (up to 50x). The paper also highlights that evaluation can account for a substantial share of development compute—by some estimates nearly one-fifth.
Why it matters
- Evaluation choices shape perceived progress; biased or noisy tests can overstate model capabilities.
- Removing examples that don’t require images or that are mislabeled yields more accurate comparisons between models.
- Converting multiple-choice tasks to generative formats can reveal real performance gaps that earlier formats hid.
- Faster, more discriminative evaluation reduces compute spent on testing and supports more sustainable model development.
Key facts
- The paper defines three desiderata for VLM evaluation: faithfulness, discriminability, and efficiency.
- Multiple-choice formats can reward guessing and often stop differentiating models as they improve.
- Certain benchmark suites contain up to 70% of examples that can be solved without viewing the image.
- Mislabeled or ambiguous items compromise up to 42% of examples in some datasets.
- Converting multiple-choice questions to generative tasks exposed capability drops as large as 35%.
- The authors curated rather than discarded existing benchmarks through transformation and filtering.
- DatBench-Full is a cleaned evaluation suite comprising 33 datasets spanning nine VLM capabilities.
- DatBench is a smaller, discriminative subset that achieves about a 13x average speedup and up to 50x in some cases while closely matching original discriminative power.
- By some estimates cited in the work, evaluation has consumed nearly 20% of development compute in recent practice.
What to watch next
- Community uptake and independent comparisons using the released DatBench-Full and DatBench (not confirmed in the source).
- Whether other benchmark maintainers adopt similar transformation and filtering procedures to improve fidelity (not confirmed in the source).
- Follow-up work on automating dataset curation and measuring long-term effects on research directions (not confirmed in the source).
Quick glossary
- Vision-Language Model (VLM): A machine learning model designed to process and reason about both visual inputs (images or video) and textual inputs.
- Discriminability: The ability of an evaluation to differentiate between models of different performance levels.
- Faithful evaluation: An assessment that measures capabilities relevant to the modality and intended downstream applications without confounding shortcuts.
- Multiple-choice format: An evaluation setup where a model selects the correct answer from a fixed set of options, which can sometimes permit guessing.
- Generative task: A task where the model must produce free-form text as an answer rather than selecting from preprovided choices.
Reader FAQ
What is DatBench?
A curated evaluation effort that includes DatBench-Full (33 cleaned datasets) and a smaller discriminative subset called DatBench.
What problems does DatBench aim to fix?
It targets issues that reduce evaluation fidelity—such as multiple-choice shortcuts, image-unnecessary questions, and mislabeled or ambiguous examples—and aims to improve discriminability and efficiency.
How much faster is the smaller DatBench subset?
The authors report about a 13x average speedup and up to 50x in some cases while maintaining similar discriminative power.
Has the dataset and code been made available?
The paper states the authors release DatBench-Full and DatBench, but specific hosting details are not confirmed in the source.
Computer Science > Machine Learning [Submitted on 5 Jan 2026] DatBench: Discriminative, Faithful, and Efficient VLM Evaluations Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin…
Sources
- DatBench: Discriminative, faithful, and efficient VLM evaluations
- Discriminative, Faithful, and Efficient VLM Evaluations
- machine-learning
- Deep Learning Monitor – Find new Arxiv papers, tweets and …
Related posts
- Hierarchical Autoregressive Modeling Boosts Memory Efficiency in Language Generation
- Show HN: Jax-JS — JavaScript ML and array library compiling to WebGPU
- Nvidia to speed Siemens chip-design tools by running them on its GPUs