Data Debt

  • Comments posted to this topic are about the item Data Debt

  • At my last job, my primary responsibility was identifying and correcting years of data debt. The company management initiated the process and demanded that a thorough cleanup be accomplished. Years of data definition and SQL coding with little or no standards had created a scene from the wild west. The data cleanup process involved repetitively identifying, redefining, renaming, and describing database column names and attributes to be consistent across the entire collection of systems on SQL Server, which consists of more than 1,600 tables containing 26 million+ records. I worked side by side with the founder and CEO of the company who supplied all the background for each data element and a clear description of each element's purpose and interaction with other data elements. His remarkable memory and attention to detail made this work.

    When we first started, it felt like trying to chew through granite. However, as time went on our methods were refined and improved which reduced the amount of time required to make the necessary changes. My "technical partners" were Microsoft Team Foundation Server (TFS) for configuration management, the Perl programming language, the Vim editor, and of course SQL.

    The payoff has been huge. Standards are now in place with appropriate oversight to ensure the standards are followed. Testing new features has become easier and more seamless. Responding to bug reports no longer involves fighting to understand the code base. I am retired now, but what a great way to end an IT career that started in 1969!

    Roy Fulbright
    Computer Consultant

  • I worked for a company that had several different lines of business, all acting independently.  This meant that recording the customer contact details was done in ways that each line of business had come up with.  Some of those lines of business had come together to share a common approach because there were shared interests between them and an opportunity for cost savings.  Standardising on a single vendor API for address validation made sense.

    Over time more lines of business standardised on that single vendor API.  Someone realised that a competing vendor API was close enough that putting an internal wrapper around the two would give us some advantages.

    • Present a common data representation regardless of the vendor API
    • Provide resilience if one vendor API went down
    • Spread the load across APIs according to the pricing plan

    We also started to look at our reference data.  In the UK insurance industry there are a number of standard reference data sets. Some of these have applicability to lines of business that weren't insurance.

    This all sounds well and good and it would be but for human fallibility.  There are always people who push against standards and norms.  Occasionally the reasons are good because natural evolution applies to standards too if they are to remain relevant.  Mostly its contrariness.   One line of business decided that a UK address would be best represented as a single string/varchar column.  That broke the geodemographic analysis  and recommendation systems that relied on the different address parts.  As one exasperated friend put it "is it really necessary for them to shoot themselves in both feet"!

    One of the things I am looking into at the moment is bi-directional data contracts.

    • Outbound contract for a data product.  For example, 50 attributes are defined.
    • Inbound contracts are for all data consumers of the product.  In aggregate only 30 attributes for the product are actually used

    We know that we have complete freedom to change the contract for the 15 attributes no-one is using.

    We also know that we need robust data tests for the 35 attributes that people are using if we change the mechanisms and data flows that supply the Data Product.

  • First, "data debt" smells a lot like "technical debt".

    Regarding data warehouses (not to be confused with data lakes), if it's architected based on something like the Kimball dimensional method, then it specifically addresses issues like integration of multiple source systems, de-duplication, and labeling.

     

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply