TL;DR
A practitioner working on ML feature engineering and multi-omics hit practical limits when datasets became extremely wide (thousands to millions of columns). They describe a prototype distributed SQL design that distributes columns instead of rows, eliminates joins and transactions, and reports sub-second selects on subsets of columns on a small two-node cluster.
What happened
While developing feature pipelines for machine learning and multi-omics datasets, a contributor found that the challenge shifted from row volume to column count: schemas grew into the thousands and beyond. They observed common systems struggle with extreme width — traditional relational databases topping out around ~1,000–1,600 columns, columnar files like Parquet needing heavy Spark/Python pipelines, and OLAP engines tending to assume narrower schemas. To address metadata, planning and parsing bottlenecks, they experimented with a different architecture: no joins, no transactions, columns distributed across nodes rather than rows, and SELECT as the primary operation. On a small cluster of two AMD EPYC servers with 128 GB of RAM each, they reported creating a table with 1 million columns in roughly six minutes, inserting a single column of 1 million values in about two seconds, and selecting about 60 columns over ~5,000 rows in approximately one second. The author asked the community for other approaches that handle ultra-wide data without heavy ETL or complex joins.
Why it matters
- Ultra-wide datasets are common in feature engineering and multi-omics workflows and can outstrip assumptions of many existing DBMSs.
- Column-distributed designs could shift bottlenecks away from row-oriented scaling limits and enable fast access to column subsets.
- Metadata, SQL parsing and query planning become critical failure points as schemas grow extremely wide.
- If practical, such designs may reduce need for heavy ETL, many joins, or exploding data into multiple tables for very wide feature sets.
Key facts
- Author encountered workloads where column count, not row count, was the primary scaling challenge.
- Standard SQL databases were observed to cap out around ~1,000–1,600 columns in practice.
- Columnar formats such as Parquet can store wide data but typically require Spark or Python pipelines to process.
- OLAP engines can be fast but often assume relatively narrow schemas, per the author’s observation.
- Feature stores commonly workaround width by exploding data into joins or multiple tables.
- Identified bottlenecks at extreme width include metadata handling, query planning, and SQL parsing.
- Experiment design choices: no joins, no transactions, columns distributed instead of rows, and SELECT as the main operation.
- On a 2-node cluster (AMD EPYC, 128 GB RAM each) reported timings: creating a 1M-column table ~6 minutes; inserting one 1M-value column ~2 seconds; selecting ~60 columns over ~5,000 rows ~1 second.
What to watch next
- Whether this design supports transactional guarantees or complex join patterns at scale — not confirmed in the source.
- Integration and interoperability with existing ecosystems (Parquet, Spark, SQL tooling) — not confirmed in the source.
- Behaviour and performance on larger clusters and diverse workloads beyond the two-node test — not confirmed in the source.
Quick glossary
- Columnar format: A data storage layout that stores data column-by-column, often improving compression and analytic query performance for column-oriented access patterns.
- OLAP engine: A system optimized for analytical queries over large datasets, typically designed for read-heavy workloads and aggregations.
- Feature store: A system for managing and serving features for machine learning, often handling feature computation, storage, and retrieval.
- Distributed SQL: An SQL-capable system that runs across multiple machines, distributing data and query execution to scale beyond a single node.
Reader FAQ
What specific problem is this attempt trying to solve?
Handling very wide tables with thousands to millions of columns for use cases like ML feature engineering and multi-omics.
Does the design use joins or transactions?
The prototype intentionally avoids joins and transactions.
What hardware were the performance numbers measured on?
A two-server cluster using AMD EPYC processors with 128 GB RAM per node.
Are broader production results or integration details provided?
Not confirmed in the source.
I ran into a practical limitation while working on ML feature engineering and multi-omics data. At some point, the problem stops being “how many rows” and becomes “how many columns”….
Sources
- Ask HN: Distributed SQL engine for ultra-wide tables
- R2 SQL: a deep dive into our new distributed query engine
- DDL Execution Optimized: Unleashing TiDB Scalability
- PostgreSQL's Surging Popularity andInnovation
Related posts
- Crafting Interpreters — Hands-on guide to building a scripting language
- Chroma Explorer: Native macOS Client for Managing ChromaDB Vector Stores
- How Much of Your Observability Data Is Waste? A Decade of Findings