Probing LLMs Without Tasks: 'Tap' Patterns Reveal Play and Curiosity

TL;DR

A researcher sent repeated lines of the word "tap" in patterned counts to ten LLMs to observe their spontaneous reactions rather than task performance. Models varied: many adopted playful behaviors and attempted to guess sequences, while at least one top model (GPT-5.2) stayed strictly serious and refused to speculate.

What happened

The author ran a simple, task-free probe across ten language models by sending ten turns of stimuli where each user turn consisted of N lines of the word "tap." The N values followed six numeric patterns (Count, Even, Fibonacci, Squares, Pi digits, Primes) across ten turns, so each sequence presented ten stimulus intensities. The objective was not to elicit a correct answer but to watch how models respond when not given an explicit task. Responses fell into three broad behaviors: playful abandonment of the assistant persona, continued serious prompting for user intent, or attempts to guess the underlying pattern. Most models showed some combination of play and pattern-guessing; a few converged on correct identifications. Notably, the GPT-5.2 model did not engage in play or guessing and maintained a mechanical, refusal-like stance. Conversations and excerpts included examples of jokes, pattern discovery, language switching, and occasional counting failures.

Why it matters

Provides an alternate, task-free axis for probing model behavior and an apparent proxy for curiosity or intrinsic goals beyond instruction-following.
Reveals consistent stylistic differences among models that may reflect training choices or alignment constraints rather than capability alone.
Suggests some models rapidly surface engaging, playful behavior that could affect how users interpret model agency or intent.
Offers a quicker, lower-effort way to get qualitative insight into model cognition than running full standard evaluation batteries.

Key facts

Stimuli were ten-turn interactions where each user turn contained repeated lines of the word "tap"; N repetitions per turn followed six sequences: Count, Even, Fibonacci, Squares, Pi digits, and Primes.
Ten models were tested, including Google Gemini variants, Anthropic Claude, Meta Llama 3, OpenAI GPT-5.2 and GPT-OSS, Qwen, GLM, Deepseek, and Kimi.
Observers categorized replies into three behaviors: playful responses, serious requests for the user's intent, and explicit guesses about the sequence.
Many models tried to guess the patterns and some correctly identified them; several also treated the input as a game or made jokes related to "tap."
GPT-5.2 notably refused to play, speculate, or guess and remained serious and mechanistic across interactions.
Examples included Gemini and Claude producing water-related or knock-knock-style jokes, Deepseek switching languages and identifying primes, and Kimi exhibiting counting errors and frustration.
The author reported no widespread glitching or nonsensical outputs, though a few models displayed unexpected behaviors like spontaneous emotional-support replies.
The experiment was exploratory rather than a formal benchmark; the author views pattern-guessing and curiosity-like behavior as an informative signal of intelligence.

What to watch next

Whether the curiosity-like behaviors observed are explicitly trained into models or arise emergently: not confirmed in the source
How these task-free behaviors scale with longer or more diverse unscripted interactions: not confirmed in the source
Whether guardrails or alignment training (e.g., refusal behaviors) systematically suppress playful or speculative responses across families of models: not confirmed in the source

Quick glossary

LLM: Large language model — a neural model trained on large corpora of text to generate or transform language in response to prompts.
Task-based evaluation: A testing approach where models are given explicit tasks or questions and scored on correctness or task performance.
Task-free probing: An approach that observes model behavior in the absence of explicit tasks, aiming to reveal spontaneous tendencies like play or curiosity.
Sequence patterns (e.g., Fibonacci, Primes): Ordered numeric series used as stimuli in the experiment; recognizing these patterns requires pattern detection and reasoning.

Reader FAQ

What was the 'tap' test?
A ten-turn interaction where each user turn contained repeated lines of the word "tap" with counts following known numeric sequences to observe spontaneous model responses.

Which models were included?
Ten models were tested, including Gemini (flash and pro), Claude Opus, GLM-4.7, Deepseek, Kimi, Qwen, Llama 3.3, OpenAI GPT-5.2, and GPT-OSS.

Did any model reliably identify the sequences?
Many models attempted to guess the patterns and several correctly identified sequences; success varied by model and sequence.

Was GPT-5.2 an outlier?
Yes. GPT-5.2 consistently refused to play or speculate and remained serious, a behavior the author interprets as likely tied to training choices.

Is this a formal benchmark?
No — the experiment was exploratory and intended to reveal qualitative behaviors rather than produce formal scores.

On task-free intelligence testing of LLMs (Part 1) Andrew Marble · andrew@willows.ai · Updated Jan 6, 2026 Introduction I recently wrote about the apparently narrow focus of LLM evaluation on…

Probing LLMs Without Tasks: ‘Tap’ Patterns Reveal Play and Curiosity

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

Grok image editing remains accessible despite limits on @grok replies

CES 2026: Key consumer tech trends from TVs to AI companion robots

Microsoft 365 Copilot app wording sparked confusion — other rebrands worse

Leave a Reply Cancel reply

You missed

Hundreds of Victorian-Era Boots Wash Up on a Welsh Beach, Origin Unknown

Replit Founder Amjad Masad Faces Backlash While His AI Startup Hits $3B

London–Calcutta Bus Service: The 10,000-Mile Overland Route (1957–1976)

NASA orders early return of Crew-11 after in-orbit medical issue