TimeCapsuleLLM: Language model trained only on 1800–1875 London texts

TL;DR

TimeCapsuleLLM is an LLM trained from scratch on texts drawn exclusively from London between 1800 and 1875 to emulate period voice and avoid modern influence. Multiple model builds (v0 through v2 variants) show progressive improvements in coherence and historical recall; a 15GB sample of the v2 corpus is available while the full 90GB remains untokenized.

What happened

A public project called TimeCapsuleLLM builds language models trained solely on historical material from specific places and years — in this case London, 1800–1875. The author trained several iterations from tiny (v0) to larger models (v1, v2-mini evaluations), reporting that early versions output archaic vocabulary and style but often produced incoherent sentences, OCR artifacts and factual hallucinations. Later builds show clearer Victorian prose and, in v1, an ability to connect a year to a historical event and named figures. The dataset described for v2 totals 90GB across 136,344 documents, though the full set is not yet tokenized; a 15GB sample is published on Hugging Face. The repository includes tokenizer scripts, dataset lists, training notes, a bias report reference and model checkpoints or links; the code is available under an MIT license on GitHub.

Why it matters

Tests an approach dubbed Selective Temporal Training (STT) that constrains training data to a narrow historical window to shape linguistic style and knowledge.
Offers a reproducible path for researchers wanting models that reflect historical language without modern pretraining contamination.
Highlights practical issues for historical corpora: OCR errors, tokenization challenges and a high rate of factual hallucination in earlier builds.
Makes data, tooling and training notes available publicly (GitHub, Hugging Face), aiding transparency and follow-up work.

Key facts

Project goal: train LLMs from scratch on texts exclusively from 1800–1875 London to reduce modern bias.
Version progression: v0 → v0.5 → v1 → v2 mini-evaluations, showing incremental improvements in grammar and historical recall.
Dataset v2 totals 90GB spanning 136,344 documents; the complete 90GB is not tokenized yet and a 15GB sample is linked on Hugging Face.
Model sizes reported: v0 ≈ 16M params, v0.5 ≈ 123M, v1 ≈ 700M, v2mini-eval1 ≈ 300M.
Training data quantities by release: v0 ≈ 187MB, v0.5 ≈ 435MB, v1 ≈ 6.25GB, v2mini-eval1 ≈ 15GB.
Hardware used: early builds on a GeForce RTX 4060 system; later builds (v1, v2 mini) used rented A100 SXM GPUs.
Output limitations noted: high factual hallucination rates in early models and persistent OCR noise (e.g., 'Digitized by Google' artifacts).
Repository contains tokenizer training scripts, dataset lists, example outputs, a bias report reference and an MIT license.
Project page and artifacts are hosted on GitHub; a Hugging Face dataset link is provided for the 15GB sample.

What to watch next

Completion and public release of the fully tokenized 90GB v2 corpus (the full set is currently not tokenized).
Publication of the v2 bias report findings and any formal evaluations of historical fidelity and factual accuracy (not confirmed in the source).
Broader community evaluations or deployments of TimeCapsuleLLM and follow-on projects using Selective Temporal Training (not confirmed in the source).

Quick glossary

Selective Temporal Training (STT): A training approach that restricts all training data to a specific historical time window to model the language and knowledge of that era.
Tokenizer: A component that converts raw text into token IDs the model can process; custom tokenizers are commonly built when using specialized corpora.
OCR noise: Errors and artifacts introduced when printed materials are digitized with optical character recognition, often requiring cleanup before training.
Hallucination: When a language model asserts details or facts not supported by its training data or reality.
Fine-tuning vs. training from scratch: Fine-tuning adapts a pre-trained model to new data; training from scratch builds a model solely from the provided corpus without inheriting prior knowledge.

Reader FAQ

What is Selective Temporal Training?
It is the method of curating and using only data from a defined historical time period so the model embodies that era’s language and context.

Is the full 90GB v2 dataset available?
The full 90GB is not yet tokenized; a 15GB sample of the v2 London corpus is available on Hugging Face.

Where can I access the code and model artifacts?
The project repository and links (including a Hugging Face dataset link) are published on GitHub under the project name TimeCapsuleLLM.

Are these models validated for modern production use?
not confirmed in the source

What license covers the repository?
The repository states an MIT license.

🌐 Language TimeCapsule LLM A language model trained from scratch exclusively on data from certain places and time periods to reduce modern bias and emulate the voice, vocabulary, and worldview…

TimeCapsuleLLM: Language model trained only on 1800–1875 London texts

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

Fabrice Bellard’s ts_zip: LLM-based text compressor promising high compression ratios

Fabrice Bellard’s ts_zip: LLM-backed text compression using RWKV 169M v4

Adobe Firefly Adds OpenAI GPT-Image 1.5, Offers Temporary Unlimited Images

Leave a Reply Cancel reply

You missed

Superhuman AI exfiltrates emails — PromptArmor details zero-click data leak

Fabrice Bellard’s ts_zip: LLM-based text compressor promising high compression ratios

Danish developer converts floppy drive into a simple TV remote for child

Samsung releases limited-time Stranger Things theme and phone wallpapers