September 9, 2020 at 12:00 am
Comments posted to this topic are about the item Do You Have Big Data?
September 9, 2020 at 7:47 am
I liked Buck Woody's definition of data. Apologies to Buck if I get it wrong "Big data is data you struggle to process in time with the technology you have available today".
There's a lot of leeway in that definition. If I need to process a volume of data what is the required turn-around time? Processing a TB overnight might not qualify but doing it in under 5 minutes might be.
Maybe an organisation doesn't have deep enough pockets for the technology to do the job in the timescale whereas a Times250 company regards the sums involved as merely the cost of doing business.
It also depends on what the intended use of the data is. I had a conversation with a data scientist and they said "I couldn't give a stuff about data volumes, we've had sampling techniques for years, the answers from the full data set and the sample dataset are remarkably similar. What I care about is the data quality; can I trust it?
September 9, 2020 at 2:34 pm
Data Quality is a big deal, and sampling can be good, but we've also seen lots of sampling bias that comes out later. Of course, we also have overwhelming data sets that hide things. I think about some of the image recco that struggles with people of color because the dominant sets are of Caucasian skin.
September 9, 2020 at 2:48 pm
It also depends on what the intended use of the data is. I had a conversation with a data scientist and they said "I couldn't give a stuff about data volumes, we've had sampling techniques for years, the answers from the full data set and the sample dataset are remarkably similar. What I care about is the data quality; can I trust it?
That entirely depends on what question you're trying to answer. There's a huge difference between we mischarged X number of customers vs we mischarged X% of customers. Even though both numbers have a very valid uses.
September 9, 2020 at 5:06 pm
Overall data volumes are growing at a faster rate than our growth in processing capabilities.
Wrong.
Volumes are growing at the same pace as processing capabilities.
When processing capabilities are not sufficient they hire better developers to build more efficient processes.
_____________
Code for TallyGenerator
September 9, 2020 at 7:54 pm
Heh... to Sergiy's point, I wonder how many supposed "Big Data" problems are actually BIG "poor design", "poor implementation", "poor requirements", and "poor coding" problems. To wit, my definition of "Big Data" is "Any data that you don't actually know how to handle". 😀
For example... one of the companies I do work for has clients whose loan numbers have leading zero's. They're complaining about there being too much data to do loan number searches. The way they do such is a thing is to strip off the leading zeros from the inputs and do a leading wildcard search on the loan number, which also uses a trailing wild card just in case someone decides to not search for the whole loan number. Would you call that a "Big Data" problem? They are and they're in search of technology that will fix the problem even though I've told them how to fix it with what they have.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 10, 2020 at 3:00 am
That's nothing.
How about 2 multi-million rows tables (one of them is closer to billion) joined by a value extracted from an XML column?
Yes, by XML_col.value('XML path', 'int' )
The job running that task was effectively forcing end of the day for the whole office. The state of the art server (no kidding, really envious configuration many can only dream about) was effectively down for any other task.
It was the record breaking PLE I ever saw: 6ms.
_____________
Code for TallyGenerator
September 10, 2020 at 12:49 pm
My first big data project was on a personalisation scoring model for 5 million customers against 250,000+ products and all their attributes combinations. That was 1.25 trillion data points equating to a 37TB dataset that required frequent refreshes.
We found that certain product attribute combinations couldn't exist and that those 5 million customers naturally grouped into 45 clusters. There's a world of difference between 45 and 5 million. This shrunk the size of the problem down to the size where the traditional technology we already had was good enough.
I'm not saying that we couldn't have found that out without going down the Big Data route but it was useful to demonstrate the realities of managing and applying Big Data techniques and technology to the problem. It was useful to show that costs increase exponentially as you move towards truly personalised information but the revenue gain tails off long before you get there. ROI for customer clusters is far higher rather than customers are true individuals.
September 10, 2020 at 1:36 pm
That's nothing.
How about 2 multi-million rows tables (one of them is closer to billion) joined by a value extracted from an XML column?
Yes, by XML_col.value('XML path', 'int' )
The job running that task was effectively forcing end of the day for the whole office. The state of the art server (no kidding, really envious configuration many can only dream about) was effectively down for any other task.
It was the record breaking PLE I ever saw: 6ms.
Yeah... that same company that I was talking about has some nightly runs that are killers but even they don't have a 6ms long term PLE. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
September 11, 2020 at 7:55 pm
I've always defined "big data" as having two important characteristics. Volume and velocity. Volume is sort of obvious; the sheer volume of data is more than we've had in the past. This is part partly the nature of the problems we do today but a lot of it is a fact that since it's cheap to store it, we store it!
Velocity is a measure of how rapidly the data changes, once it is stored. If you have ever done direct mail campaigns in the old days, you would have gotten a mailing list from one of several suppliers. Depending on what the mailing list supplier uses a classification system, you could order by ZIP Codes and geography, and some subgroups. So you could do things like "doctors in Western Pennsylvania who specialize in heart surgery." to target your audience. The contract always included an agreement to buy back failed mailings, up to a particular limit. You had agreed to set to accept X% undeliverable mail because the supplier knew that he could never be up to date all the time. This principle of an expected error rate still holds today. Even if you have scrubbed your data, while you're waiting to get your mailing out some of your customers are going to move, change their names, or die.
I like to think of this is another version of the Heisenberg uncertainty principle. If I can get an exact summary of my data model, it has to be static. If my data is actually changing, the best I can do is try to get a measurement of the amount of error in my information.
Please post DDL and follow ANSI/ISO standards when asking for help.
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply