December 15, 2015 at 3:30 am
I recognize that cross posting is not generally considered good form, but this forum didn't exist yesterday. I'm adding my question here.
I could use some help from anyone with some data science in their background.
I'm trying to learn R. I have a simple idea to prove the usefulness of R to a DBA that I want to turn into a blog post. In 2014, the new cardinality estimator started using exponential backoff in the density for compound column keys. This is because in the most common cases, you're looking at data that has relationships, not data that is completely unique. So, for example, a region and a salesperson are likely to be related, not a unique relationship, in each sale.
But, there are going to be exceptional data, so Microsoft has a traceflag you can use to force the optimizer to use the old method of density * density.
However, wouldn't it be cool if there was a quick way to feed two columns into a query against an R function to see if the data has the types of relationships we're talking about between a region and a salesperson? Enter R
I'm absolutely convinced this is an easy problem and I'm just missing the right formula. I've been looking at the Chi-Square Test of Independence and the Chi-Square Test of Homogeneity, but I don't think they're right. If I understand this (and I'm really stretching a bit), they're for two-way tables and the data I'm looking at is a one-way table.
My question, which set of formulas should I be using? Could anyone give me a nudge in the right direction please?
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 16, 2015 at 9:11 am
Not sure I have enough info to answer this, but I have a cheat-sheet referenced on this page that might help, along with some ideas: https://buckwoody.wordpress.com/2015/12/02/knowing-which-statistical-formula-to-use/
Buck Woody
MCDBA, MCSE, Novell and Sun Certified
December 16, 2015 at 9:46 am
So, for example, a region and a salesperson are likely to be related, not a unique relationship, in each sale.
I think you're on the right track, and just getting stuck on the terminology. I think if any of these questions is the question you're asking, a chi-squared test of independence is appropriate:
- Is there a relationship between region and salesperson?
- Given the name of the salesperson, could I make a better-than-chance guess at the region in which a sale happened? (and vice versa)
- Does the salesperson-region relationship in the data reflect the database's previous assumption (uniform distribution) or its new assumption (some kind of non-uniformity)?
I'm way out of my depth on all of this, so if any of this doesn't make sense, don't hesitate to let me know.
December 16, 2015 at 10:23 am
BuckWoody (12/16/2015)
Thanks Buck. That's already one of my go-to resources as I start figuring all this out. I appreciate what you're doing over there.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 16, 2015 at 10:26 am
After reading up on cardinality, another couple of questions occurred to me:
- Does the density/distribution of salesperson-region relationships have high cardinality (salespeople strongly associated with one or two regions), or low (salespeople working across many regions)?
- Does the density of those relationships better match the old optimizer assumption, or the new one (exponential)?
In that case, an alternative might be to plot three densities: your empirical density, the previously-assumed density (uniform?), and the new density (exponential?). There are a few examples here, but tons more on the web: https://heuristically.wordpress.com/2012/06/13/comparing-continuous-distributions-with-r/
After that, there are a whole slew of goodness-of-fit tests for checking whether your data fit a given distribution or not; off the top of my head, you could probably start fitdistr() from the MASS package.
No clue how you'd go about generating all of those densities, though!
December 16, 2015 at 10:30 am
mattmparker (12/16/2015)
So, for example, a region and a salesperson are likely to be related, not a unique relationship, in each sale.
I think you're on the right track, and just getting stuck on the terminology. I think if any of these questions is the question you're asking, a chi-squared test of independence is appropriate:
- Is there a relationship between region and salesperson?
- Given the name of the salesperson, could I make a better-than-chance guess at the region in which a sale happened? (and vice versa)
- Does the salesperson-region relationship in the data reflect the database's previous assumption (uniform distribution) or its new assumption (some kind of non-uniformity)?
I'm way out of my depth on all of this, so if any of this doesn't make sense, don't hesitate to let me know.
Yeah, I'm looking to see if there is a relationship between region and salesperson, so it sounds like the chi-square test of independence is the way to go. I'm working on building a new data set for my tests. I'll post results and code back here when I get them in hand.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 16, 2015 at 10:35 am
mattmparker (12/16/2015)
After reading up on cardinality, another couple of questions occurred to me:- Does the density/distribution of salesperson-region relationships have high cardinality (salespeople strongly associated with one or two regions), or low (salespeople working across many regions)?
- Does the density of those relationships better match the old optimizer assumption, or the new one (exponential)?
In that case, an alternative might be to plot three densities: your empirical density, the previously-assumed density (uniform?), and the new density (exponential?). There are a few examples here, but tons more on the web: https://heuristically.wordpress.com/2012/06/13/comparing-continuous-distributions-with-r/
After that, there are a whole slew of goodness-of-fit tests for checking whether your data fit a given distribution or not; off the top of my head, you could probably start fitdistr() from the MASS package.
No clue how you'd go about generating all of those densities, though!
Excellent. Thanks. I'll explore this as well. Sorry the question mixed technologies so much. I'll see what I can do with that and report back.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 17, 2015 at 7:00 am
Thanks for all the help guys. I'm thinking that I'm going to back off and rethink my blog post. I can't get my head around what I need to do to properly prepare the data in order to run it through these algorithms. For the short term I think I'll concentrate on learning the basics of the language instead of immediately trying to apply it.
Thanks again for the assist. Sorry I'm letting you down.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 17, 2015 at 7:10 am
Not a problem at all. In fact, maybe I'll blog it - but not in the way you think. 🙂
Do keep asking questions - it helps us in our approach on educating people as much as it helps you get the answer you need. We need to learn where people are having issues so that we teach that part first.
Buck Woody
MCDBA, MCSE, Novell and Sun Certified
December 18, 2015 at 10:49 pm
Grant Fritchey (12/17/2015)
Thanks for all the help guys. I'm thinking that I'm going to back off and rethink my blog post. I can't get my head around what I need to do to properly prepare the data in order to run it through these algorithms. For the short term I think I'll concentrate on learning the basics of the language instead of immediately trying to apply it.Thanks again for the assist. Sorry I'm letting you down.
Just food for thought.
Why not start with a data-adaptive approach where you start with the data, with little consideration of the model at first, and then when find useful predictors, fit the model to the data you have? Assume you're boss hands you a data set and find out if you can use what you have first, then move onto fine-tuning that data second and modeling third.
I say this because it's what I would do in Python. I would start with the data first, then load it into Python second. This at a very high-level regardless of the algorithms, get's me to a solid foundation of generating the data and streamlining it into the application.
From there, I would explore the data in something like Python Notebook (R has similar) in real-time in my web browser and try to model the data from there. The end result will hopefully be a solid foundation to start processing the data and outputing results that can be usable to answer questions like, "How likely is a salesperson related to a region Mr. AngryDBA?"
I only say that because I think you can visualize the flow of what needs to happen. It's the algorithm that is throwing you off from moving forward. Screw the algorithm, just output some data, load it into R and start exploring what you have. I think once you start doing that, you will start to find ways to fit the model to the data and completing the objective you want to accomplish here.
December 19, 2015 at 4:55 am
Thanks for the feedback. I think understanding, not the data itself, but what I need to do with the data, not the algorithm, but the preparation, is the key. I'm working on it.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 19, 2015 at 6:07 am
If only half the people I deal with made that statement, my day would be half as difficult. 🙂 You have the most important points in mind!
Buck Woody
MCDBA, MCSE, Novell and Sun Certified
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply