TL;DR
A researcher sent repeated lines of the word "tap" in patterned counts to ten LLMs to observe their spontaneous reactions rather than task performance. Models varied: many adopted playful behaviors and attempted to guess sequences, while at least one top model (GPT-5.2) stayed strictly serious and refused to speculate.
What happened
The author ran a simple, task-free probe across ten language models by sending ten turns of stimuli where each user turn consisted of N lines of the word "tap." The N values followed six numeric patterns (Count, Even, Fibonacci, Squares, Pi digits, Primes) across ten turns, so each sequence presented ten stimulus intensities. The objective was not to elicit a correct answer but to watch how models respond when not given an explicit task. Responses fell into three broad behaviors: playful abandonment of the assistant persona, continued serious prompting for user intent, or attempts to guess the underlying pattern. Most models showed some combination of play and pattern-guessing; a few converged on correct identifications. Notably, the GPT-5.2 model did not engage in play or guessing and maintained a mechanical, refusal-like stance. Conversations and excerpts included examples of jokes, pattern discovery, language switching, and occasional counting failures.
Why it matters
- Provides an alternate, task-free axis for probing model behavior and an apparent proxy for curiosity or intrinsic goals beyond instruction-following.
- Reveals consistent stylistic differences among models that may reflect training choices or alignment constraints rather than capability alone.
- Suggests some models rapidly surface engaging, playful behavior that could affect how users interpret model agency or intent.
- Offers a quicker, lower-effort way to get qualitative insight into model cognition than running full standard evaluation batteries.
Key facts
- Stimuli were ten-turn interactions where each user turn contained repeated lines of the word "tap"; N repetitions per turn followed six sequences: Count, Even, Fibonacci, Squares, Pi digits, and Primes.
- Ten models were tested, including Google Gemini variants, Anthropic Claude, Meta Llama 3, OpenAI GPT-5.2 and GPT-OSS, Qwen, GLM, Deepseek, and Kimi.
- Observers categorized replies into three behaviors: playful responses, serious requests for the user's intent, and explicit guesses about the sequence.
- Many models tried to guess the patterns and some correctly identified them; several also treated the input as a game or made jokes related to "tap."
- GPT-5.2 notably refused to play, speculate, or guess and remained serious and mechanistic across interactions.
- Examples included Gemini and Claude producing water-related or knock-knock-style jokes, Deepseek switching languages and identifying primes, and Kimi exhibiting counting errors and frustration.
- The author reported no widespread glitching or nonsensical outputs, though a few models displayed unexpected behaviors like spontaneous emotional-support replies.
- The experiment was exploratory rather than a formal benchmark; the author views pattern-guessing and curiosity-like behavior as an informative signal of intelligence.
What to watch next
- Whether the curiosity-like behaviors observed are explicitly trained into models or arise emergently: not confirmed in the source
- How these task-free behaviors scale with longer or more diverse unscripted interactions: not confirmed in the source
- Whether guardrails or alignment training (e.g., refusal behaviors) systematically suppress playful or speculative responses across families of models: not confirmed in the source
Quick glossary
- LLM: Large language model — a neural model trained on large corpora of text to generate or transform language in response to prompts.
- Task-based evaluation: A testing approach where models are given explicit tasks or questions and scored on correctness or task performance.
- Task-free probing: An approach that observes model behavior in the absence of explicit tasks, aiming to reveal spontaneous tendencies like play or curiosity.
- Sequence patterns (e.g., Fibonacci, Primes): Ordered numeric series used as stimuli in the experiment; recognizing these patterns requires pattern detection and reasoning.
Reader FAQ
What was the 'tap' test?
A ten-turn interaction where each user turn contained repeated lines of the word "tap" with counts following known numeric sequences to observe spontaneous model responses.
Which models were included?
Ten models were tested, including Gemini (flash and pro), Claude Opus, GLM-4.7, Deepseek, Kimi, Qwen, Llama 3.3, OpenAI GPT-5.2, and GPT-OSS.
Did any model reliably identify the sequences?
Many models attempted to guess the patterns and several correctly identified sequences; success varied by model and sequence.
Was GPT-5.2 an outlier?
Yes. GPT-5.2 consistently refused to play or speculate and remained serious, a behavior the author interprets as likely tied to training choices.
Is this a formal benchmark?
No — the experiment was exploratory and intended to reveal qualitative behaviors rather than produce formal scores.
On task-free intelligence testing of LLMs (Part 1) Andrew Marble · andrew@willows.ai · Updated Jan 6, 2026 Introduction I recently wrote about the apparently narrow focus of LLM evaluation on…
Sources
- Task-free intelligence testing of LLMs
- Exploring LLM Reasoning Through Controlled Prompt …
- AXRP Episode 35 – Peter Hase on LLM Beliefs and Easy-to …
- Experiments in AI Curiosity
Related posts
- Nvidia reportedly demanding upfront payment for H200 orders to China
- Why Apple and Google Still Host X and Grok Despite ‘Nudify’ Abuses
- Build a Claude-style Coding Agent in About 200 Lines of Python