March 4, 2014 at 2:13 pm
Revenant (3/4/2014)
TomThomson (3/4/2014)
Revenant (3/4/2014)
Hugo Kornelis (3/4/2014)
For SELECT DISTINCT, allowing ORDER BY to operate on columns not in the SELECT list would produce erratic and unpredictable behaviour. Consider this table:
Col1 | Col2
-----+-----
1 | 1
2 | 2
3 | 1
Now what do you expect to get returned if you allow "SELECT DISTINCT Col2 FROM TheTableAbove ORDER BY Col1;"?
Maybe I do note get something, Hugo.
IMO no matter how you slice it -- meaning ORDER or other clauses --, Col2 still has only two distinct values, right?
Tes, but which comes first with order by Col1? The Col1 values where Col2 is 1 are 3 and 1, and the Col1 value where Col2 is 2 is 2; so does 1<2 rule or does 2<3 rule? Yes, one could define the effect - several different ways, each of which would probably be unacceptable to the majority of people, so it was best to say "you can't do it if you specify distinct". Without distinct it's easy - you return as many rows as there are in the table, ordered as specified. With distinct, you can't return all the rows because there are fewer distinct Col2 values than there are rows.
Hmm... I am not a computer scientist and even less a programming language designer; however, I see that there could be a new rule introduced that would say that if you are selecting distinct values based on a column which is not in the select list, you present the first value encountered in the (implicit) full SELECT.
Unfortunately, our rather small example does not make my point obvious, but please consider -
Col1 | Col2
-----+-----
1 | 2
2 | 3
3 | 1
4 | 2
That would give you 2,3,1 right?
I could live with this.
Edit: making the sequence more obvious, IMO
Yes, that's one of the many approaches that would work, and in my view it's one of the better ones because it doesn't require any colision resolution mechanism. But would it be accepted?
Some people would suggest computing the average position in the full list for each value, and ordering by that (perhaps falling back on your method to resolve collisions). Others might suggest looking at whether the number of value A - value B pairs in the full list is where the value A element is before the value B element is greater than or less than the number of pairs where the opposite is the case (again resolving collisions somehow). These are just two of the methods of getting a good representation of the "average" relative position. You can complicate them by weighting elements or pairs of elements depending on how distant from the middle of the unpruned list they are, or when working with pairs by how far apart the two components are in the unpruned list.
I remember a discussion about this particular ordering problem - not in the context of databases, something completely different, way back in the 60s, where the concensus ended up being that the best procedure would be to use a random (ideally not pseudo-random, but in practice true random wasn't really going to be possible) selection of which item with each given value to retain (that would be done by looking at the items in true random order and retaining the part specified for the distinct value and the part designated for ordering for the first item with each distinct value for the specified part, sort the retained sub-items including the (unwanted) ordering part, and then throw away the unwanted part. Like your suggested method, that doesn't require any collision resolution; of course it might be difficult to live with in a relational database context, as it's certainly not deterministic.
Tom
March 4, 2014 at 2:32 pm
TomThomson (3/4/2014)
Revenant (3/4/2014)
TomThomson (3/4/2014)
Revenant (3/4/2014)
Hugo Kornelis (3/4/2014)
For SELECT DISTINCT, allowing ORDER BY to operate on columns not in the SELECT list would produce erratic and unpredictable behaviour. Consider this table:
Col1 | Col2
-----+-----
1 | 1
2 | 2
3 | 1
Now what do you expect to get returned if you allow "SELECT DISTINCT Col2 FROM TheTableAbove ORDER BY Col1;"?
Maybe I do note get something, Hugo.
IMO no matter how you slice it -- meaning ORDER or other clauses --, Col2 still has only two distinct values, right?
Tes, but which comes first with order by Col1? The Col1 values where Col2 is 1 are 3 and 1, and the Col1 value where Col2 is 2 is 2; so does 1<2 rule or does 2<3 rule? Yes, one could define the effect - several different ways, each of which would probably be unacceptable to the majority of people, so it was best to say "you can't do it if you specify distinct". Without distinct it's easy - you return as many rows as there are in the table, ordered as specified. With distinct, you can't return all the rows because there are fewer distinct Col2 values than there are rows.
Hmm... I am not a computer scientist and even less a programming language designer; however, I see that there could be a new rule introduced that would say that if you are selecting distinct values based on a column which is not in the select list, you present the first value encountered in the (implicit) full SELECT.
Unfortunately, our rather small example does not make my point obvious, but please consider -
Col1 | Col2
-----+-----
1 | 2
2 | 3
3 | 1
4 | 2
That would give you 2,3,1 right?
I could live with this.
Edit: making the sequence more obvious, IMO
Yes, that's one of the many approaches that would work, and in my view it's one of the better ones because it doesn't require any colision resolution mechanism. But would it be accepted?
Some people would suggest computing the average position in the full list for each value, and ordering by that (perhaps falling back on your method to resolve collisions). Others might suggest looking at whether the number of value A - value B pairs in the full list is where the value A element is before the value B element is greater than or less than the number of pairs where the opposite is the case (again resolving collisions somehow). These are just two of the methods of getting a good representation of the "average" relative position. You can complicate them by weighting elements or pairs of elements depending on how distant from the middle of the unpruned list they are, or when working with pairs by how far apart the two components are in the unpruned list.
I remember a discussion about this particular ordering problem - not in the context of databases, something completely different, way back in the 60s, where the concensus ended up being that the best procedure would be to use a random (ideally not pseudo-random, but in practice true random wasn't really going to be possible) selection of which item with each given value to retain (that would be done by looking at the items in true random order and retaining the part specified for the distinct value and the part designated for ordering for the first item with each distinct value for the specified part, sort the retained sub-items including the (unwanted) ordering part, and then throw away the unwanted part. Like your suggested method, that doesn't require any collision resolution; of course it might be difficult to live with in a relational database context, as it's certainly not deterministic.
Tom, as I said, I have a stake in this - I would like to see these issues clarified -, but I am not the one to clarify them. HOWEVER, I think I can get an SIG going on it, with MSFT support, that would make a (seriously considered) amendment to SQL ANSI.
March 4, 2014 at 2:54 pm
Revenant (3/4/2014)
Tom, as I said, I have a stake in this - I would like to see these issues clarified -, but I am not the one to clarify them. HOWEVER, I think I can get an SIG going on it, with MSFT support, that would make a (seriously considered) amendment to SQL ANSI.
Frankly, I don't really see the point. It is already possible to do a distinct on some columns and sort on whatever aggregate you want of another column, by replacing the DISTINCT with the competely equivalent GROUP BY expression. I don't think that providing a slightly shorter way to express this for one specific aggregate is not enough of a benefit to even start on attempting to get the standard amended.
I also expect that the standardization committee will reject the proposed amendment for exactly these reasons.
March 4, 2014 at 3:03 pm
Hugo Kornelis (3/4/2014)
Revenant (3/4/2014)
Tom, as I said, I have a stake in this - I would like to see these issues clarified -, but I am not the one to clarify them. HOWEVER, I think I can get an SIG going on it, with MSFT support, that would make a (seriously considered) amendment to SQL ANSI.Frankly, I don't really see the point. It is already possible to do a distinct on some columns and sort on whatever aggregate you want of another column, by replacing the DISTINCT with the competely equivalent GROUP BY expression. I don't think that providing a slightly shorter way to express this for one specific aggregate is not enough of a benefit to even start on attempting to get the standard amended.
I also expect that the standardization committee will reject the proposed amendment for exactly these reasons.
Hugo, as old and as white haired I am, I would never predict what these committees will or will not accept. <no smiley>
March 5, 2014 at 3:05 am
Revenant (3/4/2014)
Hugo Kornelis (3/4/2014)
Revenant (3/4/2014)
Tom, as I said, I have a stake in this - I would like to see these issues clarified -, but I am not the one to clarify them. HOWEVER, I think I can get an SIG going on it, with MSFT support, that would make a (seriously considered) amendment to SQL ANSI.Frankly, I don't really see the point. It is already possible to do a distinct on some columns and sort on whatever aggregate you want of another column, by replacing the DISTINCT with the competely equivalent GROUP BY expression. I don't think that providing a slightly shorter way to express this for one specific aggregate is not enough of a benefit to even start on attempting to get the standard amended.
I also expect that the standardization committee will reject the proposed amendment for exactly these reasons.
Hugo, as old and as white haired I am, I would never predict what these committees will or will not accept. <no smiley>
I suspect you missed the point that when the column to be sorted on also contains duplicates the result is up to the optimiser, which means that it depends on things like available buffer space, processing power, whether a covering index is available that is already ordered on the desired column, and all sorts of other things that form nopart of the relational definition. It's no more deterministic than the thing I mentioned dating from the 60s (and Hugo's version using group by and min is equally non-deterministic). It will work, but I would be extremely surprised if, in a database context, it is useful. I agree of course that a standardisation committee is unpredictable (also often incompetent, but that's neither here nor there in this discussion) but sincerely hope that they would reject this on the grounds that Ockham's razor (aka the KISS principle) is an extremely useful tool in language design, and if not on those grounds then on the grounds that it doesn't actually do anything useful.
edit: typos
Tom
March 5, 2014 at 5:03 am
It's a BUG in SQL.
If I want to allocate an ALIAS to a table I want the alias to be applied and assumed consistently wherever I refer to that table. I don't want to have to consult my SQL Querying reference book (Itzik Ben-Gan ... available on Amazon) for the precedence of operations in the SQL statement for each line of code I write. Get a life. I desperately wanted to say "Who cares?" to that one, but deep down in my heart I thought ... "This is going to be more Microsoft stupidity!" ... so I was right after all! I could have CTRL-C & CTRL-V'd that into Mgt Studio and got it right ... and I could have been braggin' like chuff down in the canteen right now. What a missed opportunity. Well that's life. Why do I do this job?
March 5, 2014 at 3:58 pm
BTW, you get a different result on SQL Server 2000 - both queries return in the same order.
The only way to sort by the actual table column (in SQL Server 2000) is to make the SELECT column name different 🙁 .
Viewing 7 posts - 31 through 36 (of 36 total)
You must be logged in to reply to this topic. Login to reply