Data Debt

Question

Data Debt

Steve Jones - SSC Editor

SSC Guru

Points: 737073
More actions
January 31, 2025 at 12:00 am

#4516343

Comments posted to this topic are about the item Data Debt

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

rfulbrig SSC Rookie Points: 36 More actions · Answer 1

At my last job, my primary responsibility was identifying and correcting years of data debt. The company management initiated the process and demanded that a thorough cleanup be accomplished. Years of data definition and SQL coding with little or no standards had created a scene from the wild west. The data cleanup process involved repetitively identifying, redefining, renaming, and describing database column names and attributes to be consistent across the entire collection of systems on SQL Server, which consists of more than 1,600 tables containing 26 million+ records. I worked side by side with the founder and CEO of the company who supplied all the background for each data element and a clear description of each element's purpose and interaction with other data elements. His remarkable memory and attention to detail made this work.

When we first started, it felt like trying to chew through granite. However, as time went on our methods were refined and improved which reduced the amount of time required to make the necessary changes. My "technical partners" were Microsoft Team Foundation Server (TFS) for configuration management, the Perl programming language, the Vim editor, and of course SQL.

The payoff has been huge. Standards are now in place with appropriate oversight to ensure the standards are followed. Testing new features has become easier and more seamless. Responding to bug reports no longer involves fighting to understand the code base. I am retired now, but what a great way to end an IT career that started in 1969!

Roy Fulbright
Computer Consultant

David.Poole SSC Guru Points: 76096 More actions · Answer 2

I worked for a company that had several different lines of business, all acting independently. This meant that recording the customer contact details was done in ways that each line of business had come up with. Some of those lines of business had come together to share a common approach because there were shared interests between them and an opportunity for cost savings. Standardising on a single vendor API for address validation made sense.

Over time more lines of business standardised on that single vendor API. Someone realised that a competing vendor API was close enough that putting an internal wrapper around the two would give us some advantages.

Present a common data representation regardless of the vendor API
Provide resilience if one vendor API went down
Spread the load across APIs according to the pricing plan

We also started to look at our reference data. In the UK insurance industry there are a number of standard reference data sets. Some of these have applicability to lines of business that weren't insurance.

This all sounds well and good and it would be but for human fallibility. There are always people who push against standards and norms. Occasionally the reasons are good because natural evolution applies to standards too if they are to remain relevant. Mostly its contrariness. One line of business decided that a UK address would be best represented as a single string/varchar column. That broke the geodemographic analysis and recommendation systems that relied on the different address parts. As one exasperated friend put it "is it really necessary for them to shoot themselves in both feet"!

One of the things I am looking into at the moment is bi-directional data contracts.

Outbound contract for a data product. For example, 50 attributes are defined.
Inbound contracts are for all data consumers of the product. In aggregate only 30 attributes for the product are actually used

We know that we have complete freedom to change the contract for the 15 attributes no-one is using.

We also know that we need robust data tests for the 35 attributes that people are using if we change the mechanisms and data flows that supply the Data Product.

LinkedIn Profile

Eric M Russell SSC Guru Points: 125579 More actions · Answer 3

First, "data debt" smells a lot like "technical debt".

Regarding data warehouses (not to be confused with data lakes), if it's architected based on something like the Kimball dimensional method, then it specifically addresses issues like integration of multiple source systems, de-duplication, and labeling.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Data Debt

Cookies on SQLServerCentral