February 22, 2016 at 9:54 pm
Comments posted to this topic are about the item A Lightweight, Self-adjusting, Baseline-less Data Monitor
February 22, 2016 at 10:10 pm
I've advocated for years that many of our metrics for data quality should be based on statistical analysis rather than brute force counting of every single value in every single row.
At one of my contracts we were collecting data from a consumer "internet streaming" device that "called home" constantly while in use in millions of consumer locations.
There were times when a firmware update could - and did - unintentionally break "part" of the data collection but not all of it. For example the volume of data coming in might be normal, but a particular data value might be completely broken.
I implemented a sampling algorithm in TSQL that dynamically calculated the standard deviation for occurrence and range of selected values in a table on a daily basis and could then determine if the current/latest data load contained any value ranges that exceeded the standard deviation.
So we could quickly detect any unplanned/unexpected loss of data in the collection system. Without an automated system it would have been prohibitively expensive to manually verify the data on a daily basis...
And without statistical analysis it would have been too expensive in computing resources and time.
February 23, 2016 at 3:15 am
Excellent article and some very cogent comments by kenambcrose.
...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell
February 23, 2016 at 7:58 am
Excellent article, well written and an easy read. I need to look at/play around with your code a little more to understand what you're doing but I love the overall approach.
-- Itzik Ben-Gan 2001
February 23, 2016 at 1:03 pm
The R or Python stuff is useful if the statistical work is more complex. Most of what can be done in R can easily work in Python, which is a great language for linking systems or data together.
The R Services will allow you to run R in SQL Server without the data movement in/out of SQL, which is nice, but that's only 2016. For most people, and most systems, this won't be reality for years.
March 9, 2016 at 4:43 pm
Great article, thanks.
November 11, 2016 at 6:03 pm
Very interesting article, that I tucked away and recently had a reason to revisit. I am not sure I am understanding it completely. Do you have any other references to the technique you are using that you could share? Is this just a standard statistical method of forming a probability distribution from actual data?
Thanks for any pointers you can provide.
October 13, 2017 at 10:07 am
I just spent 30 minutes reading and re-reading this article. It is well thought out, noting it's strong points and where it can be weak. The writer presents the ideas in a clear and concise manner. Bravo! I wish more articles discussing an interesting idea were like this one.
October 13, 2017 at 11:22 am
I am sorry that I didn't see this last year - great article!
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply