TL;DR

Johnny’s Software Lab collected 18 posts that examine ways to speed software by improving use of the memory subsystem. Topics range from data layout and access patterns to low-latency design, TLB behavior, prefetching and multithreading, and the author offers consulting and training.

What happened

A series of 18 blog posts on Johnny’s Software Lab was published that focus on memory subsystem optimizations — techniques intended to make programs faster by better leveraging caches, translation lookaside buffers (TLBs), allocators and related hardware or OS facilities. The collection groups practical topics including reducing total memory accesses, reorganizing data access patterns to improve cache locality, revising class and data-structure layouts, and modifying memory layout via custom allocators. Other posts cover instruction-level parallelism, explicit software prefetching for random access, reducing TLB misses (including huge pages), conserving memory-subsystem bandwidth, interactions between branch prediction and caches, multithreading effects on memory behavior, and latency-sensitive application strategies. The series also includes guidance on measuring memory-subsystem performance and a catch-all post for miscellaneous topics. The site invites readers to request help with performance problems or vectorization training and to follow its social channels for updates.

Why it matters

  • Better use of the memory subsystem can produce measurable speedups, especially for software handling large datasets.
  • Optimizations address both throughput and latency concerns, affecting performance-sensitive and low-latency applications.
  • Reducing memory bandwidth and TLB pressure can improve a program’s impact on co-located workloads and system neighbors.
  • Understanding data layout, access patterns and hardware interactions helps developers make targeted, effective performance changes.

Key facts

  • The collection comprises 18 posts on memory subsystem optimizations on Johnny’s Software Lab.
  • Topics include decreasing total memory accesses and changing access patterns to increase cache locality.
  • Several posts discuss changing data layout: class layout and layout of common data structures (linked lists, trees, hash maps).
  • Memory layout and custom allocators are treated separately from compile-time data layout.
  • Instruction-level parallelism and techniques to hide memory latency are covered.
  • There are dedicated posts on software prefetching, TLB misses and use of huge pages.
  • The series addresses multithreading effects, branch prediction interactions, and latency-sensitive application techniques.
  • Guidance is provided on measuring memory-subsystem performance and a post covers remaining related topics.
  • The blog offers contact options for project performance discussions and vectorization training for teams.
  • Readers are encouraged to follow the site on LinkedIn, Twitter and Mastodon for new content.

What to watch next

  • Follow the blog’s LinkedIn, Twitter or Mastodon accounts to be notified when new material appears (confirmed in the source).
  • Contact the site for consulting on performance problems or to arrange vectorization training for teams (confirmed in the source).
  • Any specific upcoming posts, publication schedule or planned extensions to the series are not confirmed in the source.

Quick glossary

  • Memory subsystem: The combination of caches, main memory, controllers and translation mechanisms that provides data to the CPU.
  • Cache locality: A property of programs where accessed data is near other recently accessed data, increasing the chance it resides in a fast cache.
  • TLB (Translation Lookaside Buffer): A small cache in the CPU that stores recent virtual-to-physical address translations to speed memory access.
  • Instruction-level parallelism (ILP): The ability of a CPU to execute multiple independent instructions simultaneously to improve throughput.
  • Software prefetching: Compiler- or programmer-inserted hints that request the hardware load specific data into cache before it is needed.

Reader FAQ

How many posts are in the series?
The collection contains 18 posts (confirmed in the source).

Are these techniques only for large datasets?
Most posts are aimed at software that handles large datasets, but some techniques apply regardless of dataset size (confirmed in the source).

Can I get training or consulting based on this material?
The site invites readers to contact them for performance discussions or vectorization training (confirmed in the source).

Does the series provide ready-made scripts or tools for measurement?
Not confirmed in the source.

Memory Subsystem Optimizations In this blog I wrote 18 blog posts about memory subsystem optimizations. By memory subsystem optimizations, I mean optimizations that aim at making software faster by better…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *