TL;DR

An individual researcher used PufferLib, extensive hyperparameter sweeps and curated curricula to train agents that outperform a terabyte-scale search baseline on 2048 and learn robust Tetris play. Key ingredients included very fast C-based environments, observation and reward engineering, endgame-focused curricula, and an LSTM policy.

What happened

Using PufferLib’s fast C-based environments and a disciplined sweep methodology, the author trained small neural policies that reached and exceeded prior search-based results on 2048 and produced robust Tetris agents. The 2048 policy was about 15 MB, trained for roughly 75 minutes, and achieved a 65,536-tile rate of 14.75% and a 32,768-tile rate of 71.22% over 115k episodes—surpassing a prior search approach that relied on multi-terabyte endgame tables. Training leveraged 1M+ environment steps per second per CPU core, two gaming desktops each with a single RTX 4090, and roughly 200 hyperparameter sweeps guided by Protein’s cost-outcome Pareto sampling. Success hinged on careful observation design (18 features per board cell), reward shaping (merge rewards, penalties, monotonicity and snake incentives), and curriculum learning: scaffolding that spawns high tiles and endgame-only environments that start from pre-placed large tiles. For Tetris, an accidental bug that persisted next-piece one-hots created noisy early observations; that noise acted like a form of curriculum, increasing robustness, and led to deliberate external and internal curriculum strategies.

Why it matters

  • Small, well-crafted policies with targeted curricula can exceed solutions that rely on massive search tables.
  • Extremely fast simulators enable systematic hyperparameter sweeps and rapid iteration on design choices.
  • Curriculum learning lets agents experience rare, high-value states that are impractical to reach by unguided play.
  • Accidental perturbations or noise can create effective robustness curricula, suggesting new ways to design training regimes.

Key facts

  • PufferLib environments run at over 1M steps per second per CPU core (C-based implementation).
  • Training used two high-end gaming desktops, each with a single RTX 4090 (compute sponsored by Puffer.ai).
  • About 200 hyperparameter sweeps were performed, guided by Protein’s Pareto-sampling approach.
  • The 2048 policy was ~15 MB and trained for roughly 75 minutes, evaluated across 115,000 episodes.
  • 2048 results: 14.75% success rate for the 65,536 tile and 71.22% for the 32,768 tile.
  • Previous search-based endgame tables reached 65,536 at an 8.4% rate but required a few terabytes of tables.
  • Observation encoding used 18 features per 4×4 cell (normalized tile values, empty cell flag, one-hot tile values, snake-state flag).
  • Reward shaping included merge rewards (weight 0.0625), invalid-move penalty (-0.05), game-over penalty (-1.0), and small monotonicity incentives (weight 0.00003).
  • Policy architecture for 2048: ~3.7M parameters with an LSTM layer (encoder: three FC layers with GELU).
  • Curriculum techniques included scaffolding spawns of high tiles and endgame-only environments that begin from pre-placed large tiles.

What to watch next

  • Deeper networks and extreme-depth architectures as a path toward higher tiles (not confirmed in the source as a guaranteed solution).
  • Automated curriculum-discovery approaches such as Go-Explore to find stepping stones to very large tiles.
  • Comparative evaluation of external (garbage-line injection) versus internal (observation-noise decay) curricula in Tetris; the author reports mixed outcomes and does not claim a clear winner (not confirmed in the source).

Quick glossary

  • Curriculum learning: A training strategy that controls the order and difficulty of experiences presented to an agent so it can learn complex behaviors in stages.
  • Endgame table: A precomputed lookup or dataset that stores optimal actions or values for late-game states; can be very large for combinatorial games.
  • LSTM: Long Short-Term Memory, a type of recurrent neural network layer that helps models remember information over long sequences.
  • Pareto sweep: A hyperparameter search strategy that samples points along the cost-outcome Pareto front to find efficient trade-offs between compute cost and performance.
  • Reward shaping: Modifying the reward signal in reinforcement learning to emphasize desired behaviors and speed up learning.

Reader FAQ

Did a small policy really outperform terabyte-scale search on 2048?
Yes; the reported 15 MB policy achieved a 14.75% 65,536-tile rate versus an 8.4% rate from a prior few-TB endgame-table approach.

What hardware and runtime were used for training?
Training ran on two gaming desktops with single RTX 4090 GPUs; the 2048 policy trained for about 75 minutes and evaluations covered 115k episodes.

Was the Tetris robustness trick intentional?
No. A bug that caused next-piece one-hot encodings to persist introduced noisy observations; the author later reproduced the effect deliberately as a curriculum mechanism.

Are the exact reproduction steps and datasets provided?
Code and training commands are referenced (PufferLib and puffer train commands), but full reproducibility details beyond that are not confirmed in the source.

Scaffolding to Superhuman: How Curriculum Learning Solved 2048 and Tetris December 29, 2025 • ai, reinforcement-learning • Tags: rl, pufferlib, curriculum-learning, 2048, tetris Training gaming agents is an addictive game….

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *