SQLServerCentral Editorial

Analyzing Breached Data

,

A few of you out there might be data scientists who profile data regularly. Probably a fair number of you do import/export work and learn to check data values, perhaps with counts, distincts, or other aggregates. I don't know if the performance tuners out there look at the skew of data or the details of what is in a query that needs improvement. However, all of you are likely familiar with data and trying to query it for some type of meaning.

One of the largest data breaches occurred with National Public Data. Troy Hunt analyzed the breach as a part of his work with haveIbeenpwned. The piece is an interesting analysis of the data, trying to determine both it's legitimacy as well as what is actually included in the breach. It's a fascinating read and I encourage you to look at it not just from the data analysis side, but also to be aware of what data about you is being aggregated and sold by companies.

The read is interesting as it is a bit of a detective story, digging through data in a folder, which is something I've had to do. I've had people in previous jobs just dump a bunch of data on me and ask me to load it into a database. Or a table. Often without them knowing what type of data it is, what formats, do files relate to each other? Are there multiple tables worth of data in a file? All questions I've had to ask myself (and answer), and similar to what Troy did to analyze the breach.

Data is very important to many of us, in different ways, but I'm often amazed at how few people actually understand how to organize data and ensure others can track the metadata about their data (what their data represents). I'm guessing this is why every person that gets an extract of data to load into Excel formats it in different ways.

In many cases, people want the ability to query data, but they prefer to just focus on one table that contains a lot of information. They don't want to know how to "join" data together. I think this might be the reason we see so many views in databases, and why we have views built on views. Each new client of the database needs their own view structure.

The world of data is a mess, even inside an organization. Once we start moving data between organizations, it's truly a mess. We might bemoan all the inefficiencies and work we do to move, change, and re-load data as custom, human ETL machines, but there is one great thing about this tangled web. It provides for steady, secure jobs for many of us with no end of work in sight.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating