Distributing the greatest value by rank in SQL

Question

Distributing the greatest value by rank in SQL

Viewing 11 posts - 31 through 40 (of 40 total)

You must be logged in to reply to this topic. Login to reply

Dwain Camps SSC Guru Points: 86978 More actions · Answer 1

SG,

BTW. In case you haven't, you should probably read this article by Jeff Moden: Solving the Running Total and Ordinal Rank Problems[/url]. Look for the section on triangular joins and note the similarity to what you proposed.

I'll pose a theorem to be disproved: Anything that can be solved with a triangular join can also be solved by a QU.

From memory, I think Jeff's CURSOR solution (in the linked article) ran faster than the triangular join, but I may be wrong. That same solution could also be applied here.

I ran a little demo for my guys here once, posing to them the Running Total problem. Sure enough, someone came up with a triangular join solution that worked. So then I created 25K rows of test data (coincidentally about the same size as the OP's for this problem) and QU ran in less than a second and the triangular join took more than a minute.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

sgmunson SSC Guru Points: 110640 More actions · Answer 2

dwain.c (12/18/2012)
sgmunson (12/18/2012)
Dwain,
A "Quirky Update" isn't going to work here, as the problem isn't just a matter of retrieving a value from a previous record. It's a matter of finding the maximum value for ALL previous records, so this kind of solution isn't going to do it. Unfortunately, there's really no well-performing way to go about this, and the performance is also going to be heavily dependent on how often the data has non-conforming values, and exactly where they occur. Given that the query I supplied runs in a minute and 40 seconds, there's not a lot of incentive to improve the query until the volume of input records gets larger - or at least large enough to cause a performance problem, anyway. We'd all love to have solutions that are blazingly fast, but sometimes, what's needed is just going to chunk a bit. It's the nature of the beast. That said, however, if you DO find something that runs faster, and that produces the same result, I'd love to see it. I'm always up to learn something new.
I disagree with you on two points:
#1
"A "Quirky Update" isn't going to work here, as the problem isn't just a matter of retrieving a value from a previous record. It's a matter of finding the maximum value for ALL previous records"
I think that's exactly what my QU is doing.
#2
"Given that the query I supplied runs in a minute and 40 seconds, there's not a lot of incentive to improve the query"
That incentive is always there. We should never be satisfied with something just because it is "good enough." Before you know it we forget about it and the record set grows until the query no longer scales to match and then we have users complaining that the system is slow.
Now I'm not saying my query is necessarily perfect. It is possible I've misunderstood the desired result in some way. I do believe this problem is a good fit for this solution approach. The QU isn't limited to the prior record. It can retrieve the information off any prior record, given sufficient matching criteria for it to identify exactly what information it needs to retrieve.

And allow me to disagree that you can find "sufficient matching criteria" for quirky update to work. It may be that you misunderstand the requirement, which I'll rephrase here for simplicity. You need to start a given group of records in the same industry and category with the value for AvgSales for the record with the lowest Emp_Rank value, and then never go below that value as you move further along the group in Emp_Rank order. Each time you encounter a higher value, that becomes the new value to not go below, until you reach the end of the group, at which point you start over. One of the problems I had with this was when I encountered a data record where for 2 or more consecutive records, the actual value for AvgSales went down, but as you have to never do that, I found that just getting the previous record's value is inadequate. Additionally, you have no way to know what order the AvgSales values will appear RELATIVE to the Emp_Rank values, and thus no clear relationship between the two. As there is no way to match a specific "previous" record, you must examine ALL such records, and I'm pretty sure QU isn't going to be able to do that, as you'd need at least as many variables as there are records in any given combination of industry and category, and the number needed would likely differ for each such group. How could you ever be sure which variable to assign to which record value?

If I understand the term "triangular join" correctly, that's sort of what my query does, and I can't say I've ever heard of a way for a QU to do that. But hey - Surprise me !!!

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

sgmunson SSC Guru Points: 110640 More actions · Answer 3

dwain.c (12/18/2012)
SG,
BTW. In case you haven't, you should probably read this article by Jeff Moden: Solving the Running Total and Ordinal Rank Problems[/url]. Look for the section on triangular joins and note the similarity to what you proposed.
I'll pose a theorem to be disproved: Anything that can be solved with a triangular join can also be solved by a QU.
From memory, I think Jeff's CURSOR solution (in the linked article) ran faster than the triangular join, but I may be wrong. That same solution could also be applied here.
I ran a little demo for my guys here once, posing to them the Running Total problem. Sure enough, someone came up with a triangular join solution that worked. So then I created 25K rows of test data (coincidentally about the same size as the OP's for this problem) and QU ran in less than a second and the triangular join took more than a minute.

I read that article quite some time ago, and have read similar material in books, and the one thing I think you're right about, is that the CURSOR likely outpeformed the triangular join. The problem here is having no choice but to have the database engine search every "previous" record in the group to find the largest AvgSales value. Thus you could say you have a running MAXIMUM problem, where the value needed is less than the maximum value in the group until the record that has the maximum value is reached, AND you have to have any intermediate values from the start of the group, which will differ in every group.

Honestly Dwain, I'd love to see you find a solution here, but I'm afraid I'll have to see it to believe it. My primary reason for holding this belief is that I need to know what algorithm you want to use to determine that MAX value AND when to apply it for each and every record, without doing the same thing the triangular join has to do.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Dwain Camps SSC Guru Points: 86978 More actions · Answer 4

sgmunson (12/18/2012)
...
You need to start a given group of records in the same industry and category with the value for AvgSales for the record with the lowest Emp_Rank value, and then never go below that value as you move further along the group in Emp_Rank order. Each time you encounter a higher value, that becomes the new value to not go below, until you reach the end of the group, at which point you start over.

Good explanation and that's what I understood. I believe if you look at my output results, that's what you're going to see.

sgmunson (12/18/2012)
One of the problems I had with this was when I encountered a data record where for 2 or more consecutive records, the actual value for AvgSales went down...

Yup. I had that problem too, which is fixed in my second submission.

sgmunson (12/18/2012)
If I understand the term "triangular join" correctly, that's sort of what my query does, and I can't say I've ever heard of a way for a QU to do that. But hey - Surprise me !!!

Then I suppose you haven't read the article I linked to for there you will see it.

It's OK to disagree agreeably. 😛

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86978 More actions · Answer 5

sgmunson (12/18/2012)
Honestly Dwain, I'd love to see you find a solution here, but I'm afraid I'll have to see it to believe it. My primary reason for holding this belief is that I need to know what algorithm you want to use to determine that MAX value AND when to apply it for each and every record, without doing the same thing the triangular join has to do.

Sorry I didn't see your above post before I suggested you hadn't read Jeff's article as it appears you have.

As far as the algorithm I used, I'll be the first to admit it ain't pretty.

But if you examine the record set it produces (see the output results I posted earlier), I'd challenge you to find where it is wrong.

I believe that there may be more going on in the OPs actual UPDATE, something of which may violate one of the many rules of the QU (found in that article).

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

sgmunson SSC Guru Points: 110640 More actions · Answer 6

dwain.c (12/18/2012)
sgmunson (12/18/2012)
Honestly Dwain, I'd love to see you find a solution here, but I'm afraid I'll have to see it to believe it. My primary reason for holding this belief is that I need to know what algorithm you want to use to determine that MAX value AND when to apply it for each and every record, without doing the same thing the triangular join has to do.
Sorry I didn't see your above post before I suggested you hadn't read Jeff's article as it appears you have.
As far as the algorithm I used, I'll be the first to admit it ain't pretty.
But if you examine the record set it produces (see the output results I posted earlier), I'd challenge you to find where it is wrong.
I believe that there may be more going on in the OPs actual UPDATE, something of which may violate one of the many rules of the QU (found in that article).

If I recall correctly, one of those rules was the necessity of having a CLUSTERED INDEX in the order needed for the QU to process the records in, and it would have to process them in Industry, Category, Emp_Rank order, so without that being the case, the results would likely be dog chow. Unfortunately, as is all too often the case, if I see what appears to be a reasonably intelligent original poster, and they claim something doesn't work, I take their word for it whenever I just don't have the time to test another poster's query. Thus, I've been assuming it didn't work, and your posts convinced me you were "still working on it", and that may just have been a too quick reading on my part. Lots of "Bad Habits" here, perhaps, but born of necessity, as I have so little time for testing things like this. If I could generate a good sample of, say, 1/2 million to several million test records, I'd find a weekend and test your code. Got any easy way to generate sample data of that size for this specific exercise?

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Dwain Camps SSC Guru Points: 86978 More actions · Answer 7

sgmunson (12/18/2012)
..
If I could generate a good sample of, say, 1/2 million to several million test records, I'd find a weekend and test your code. Got any easy way to generate sample data of that size for this specific exercise?

Of course I do! Try these two articles:

http://www.sqlservercentral.com/articles/Data+Generation/87901/

http://www.sqlservercentral.com/articles/Test+Data/88964/

Empowered by the information in these two articles it should take about 15 minutes to set up the data you need. Excluding reading time of course.

At least that's about how long it would take me anyway. Of course, I'm not the one that needs convincing that the QU will work.

And no worries about our exchange. I take it all in good form. Healthy skepticism is always a good thing.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

lmeinke SSCommitted Points: 1956 More actions · Answer 8

I appreciate the dialogue here and I am benefiting from it. Dwain, I re-ran your code on your small example and it worked this time....no idea why. It still, however, does not work on my table of only 26K records. It grabs the maximum value that occurs within a grouping (Industry/Category) and applies it to all records in that grouping. I am in meeting this morning but hope to get back to this this afternoon where I will double check and investigate more thoroughly.

Thank you,

Lonnie M

Dwain Camps SSC Guru Points: 86978 More actions · Answer 9

Credit to you Lonnie for being so thorough in your testing.

They say "never look a gift horse in the mouth," but I'd say on forum answers you always should.

Nothing like having code break down in Prod and when you're getting your reaming from the boss you tell him "I got it from some a...hole on the Internet." Does not go over real big... trust me.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86978 More actions · Answer 10

SG - Another thought occurred to me on this question.

I would call this just another form of a "data smear" (a term used by Jeff Moden in this link: http://www.sqlservercentral.com/Forums/Topic1393805-392-1.aspx#bm1393812).

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

raym85 SSC Eights! Points: 938 More actions · Answer 11

Im a bit lost with the desired output. are you only comparing the first two rows? should the desired output in the 225 provided the industry and category are the same?