I went to San Francisco for Small Data SF, a conference sponsored by Mother Duck. The premise of the event was that smaller sets of data are both very useful and prevalent. The manifesto speaks to me, as I am a big fan of smaller sets of data for sure. I also think that most of the time we can use less data than we think we need, especially when it's recent data. That often is more relevant and we end up with contorted queries that try to weight new or old data differently to reflect this. Maybe the best line for me is this one:
Bigger data has an opportunity cost: Time.
I think time is a very valuable commodity and large sets of data can slow you down. There's also the chance that looking at too much data starts to blur the lines of understanding. We may start to miss information in our dataset, or we may find people arguing about different things the data means, because we have so much data that we can find support for any position somewhere in the vast sea of numbers, strings, and dates.
Big data also has a real cost in resources, often money. One of the examples was from the organizer, who once gave a demo on stage, querying a PB of data. That's impressive, and lots of us would want to be able to query our very-large-but-less-than-PB-sized data in minutes. However, the thing that wasn't disclosed in the demo was the query cost over USD$5k.
I've heard from a number of customers and speakers that most people don't have big data. Most of us have 100s-of-GB-sized working sets of data, sometimes with TB-sized archives in the same database that slow everything down. If we could easily extract out the useful data, we could query those hundreds of GB more efficiently.
This is especially true in the era of small devices that can handle something close to a TB of data in a small form factor. With some of the columnar systems that compress data, a TB of raw data might be substantially compressed in Parquet files or an analysis system like DuckDB. In that case, we might realistically search and analyze 1TB of data on a laptop.
I know that big data is relative, but many of us face challenges with data sizes and query performance. I know lots of you embrace the challenge and see working with TB (or larger) systems as a badge of honor. I also know the reality is that most of us struggle to separate our archive data from current working data in our systems. However, if we could, would most of you want to work with smaller data sets or do you enjoy large ones? I know which way I lean.