It seems that every year we see new ways of analyzing information that companies are adopting. In this era of Big Data, with the challenges of real time BI analysis of (often) streaming sets of data, companies search for ways to handle the load. We had map-reduce methods to process bits a few years back and lately there has been a growing popularity of machine learning (and deep learning) used to gain insights from the massive data sets we have.
The problem is that in trying to analyze data, we find that we often don't have enough data in many cases. While some parts of our organizations face a surplus of data, others trying to provide an analysis might face a shortage, at least for some types of data. This might be especially true when business people want to engage in a new type of business or a new way of working with customers.
The last couple years have given rise to a number of companies that actually are gathering and selling labeled data, or even generating synthetic data that can be used to build and train models for analysis. As we look to let machines learn to solve some problems on their own, we need to provide them lots of data, which has become big business. I have heard of companies paying six or seven figures a year to get data sets for their data scientists.
In some sense, as noted in this keynote, data is the more valuable part of these systems. Staff matters, and certainly the software and models are important, but the data is key. Good data, with lots of features, can produce a better trained system than poor data. Many of us that work in traditional software know this as well. If we use poor data sets in development, with limited values, and not in the skew and selectivity that we'll see in the live system, we often build lower quality software with more bugs.
In some sense, I think that our data is more valuable than we realize, and far too many developers don't take advantage of using the data our organizaition does have to build features and properly test them. Actually, too few of us actually test things well, but certainly we often can't without a good set of data. I've been disappointed with random generators, though they are useful in that they can find unexpected issues from the random values, including NULLs, that will creep into systems. I really wish we had better subsetting tools that would help us use a portion of our production data. Redgate is working on tooling, but I'd think this was a problem we'd have gotten better at solving, between software people and database staff.
I've had a nice career working with data, and I'm glad that the recognition of the value of data has continued to grow through the years. Now I'd like to see us actually start to emphasize the importance of producing and using more useful data sets when we build software, whether in traditional means or using machine learning techniques. My guess is we'll get more useful and better quality software if we do.