Today we have a guest editorial as Steve is traveling.
I’d like to think that most enterprises and all developers get the value of having a single source of truth, though of course only us data people really get it. We’re comfortable with the concepts of normalization, have experienced the pain of denormalizing, have researched more than once the question of ‘why doesn’t this report match that report’, and we get that data is rarely pristine. Even if imperfect, having a single source of truth is a really important thing.
For many people, and especially for us data people, describing the single source of truth has an easy answer - the database. We capture OLTP data in it, then we might copy it directly or in some transformed way to reporting servers, data warehouses, search engines, and more, always knowing that while it’s all supposed to be equivalent that sometimes it is not. However, we can figure it out by going back to the database. We also know that as long as all reports are based on the same copy of the truth that all the reports should match (even if they are wrong!).
Does that mean the source of truth has to be in the database? Or more nuanced, does all of the the truth have to be stored in the database?
It’s tempting to say yes. Oh, it is tempting. But practical? Not if you want to leverage external systems and providers. One example might be a CRM application. Does the sales team enter prospects and new customers in the CRM app and have them propagated to the database? Or does it get entered in the database and then pushed into the CRM app? Another example might be if you use an email service like SendGrid. They know how many messages bounced so you have to get that particular truth from them. Yes you can pull into the database and treat it like truth, but if there is ever a doubt about the numbers, the database is not the single source of truth for that particular data point.
The net of all this is that having a single source of truth is critical and it doesn’t mean it all has to be in the same place. I think that’s an important reminder for those of us whose view of the world is database centric. The goal is to have a single truth for each datapoint, if we can achieve that then we can make decisions about whether to copy that data somewhere to simplify usage or to mash it all up at a layer above the database.