January 7, 2010 at 12:14 am
Hi,
We all know that DISTINCT clause will provide unique resultset (means removes the duplicate rows). I have a problem while using DISTINCT clause. Can anybody help me that how DISTINCT has been treated and executed by SQL Engine internally.
Lets see an example below:
DECLARE @temp TABLE(
id INT,
Name VARCHAR(50)
)
INSERT INTO @temp
SELECT 1, 'Nizam' UNION ALL
SELECT 2, 'Sandeep' UNION ALL
SELECT 3, 'Ravi' UNION ALL
SELECT 4, 'Sandeep'
In the above table, I have inserted data, where I have two rows having the same name (id = 2 and 4).
SELECT DISTINCT name FROM @temp
Once I execute the above SQL, will give me three distinct names, in this case.
Moreover I want these names as a comma delimited. So I have written the standard logic to add comma along with names like below.
Declare @x VARCHAR(50)
SELECT DISTINCT @x = COALESCE(@x + ',', '') + name
FROM @temp
SELECT @x
OUTPUT
--------
Sandeep
But problem is that I am getting only one (id = 4) name in the variable @x. If I remove DISTINCT clause from the above SQL, @x variable get populated by names but with duplicates.
OUTPUT
--------
Nizam,Sandeep,Ravi,Sandeep
Moreover I have a work around of the problem. Which gives me appropriate result using below SQL.
SET @x = NULL
SELECT @x = COALESCE(@x + ',', '') + x.name
FROM (SELECT DISTINCT name FROM @temp) x
SELECT @x
I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.
Please explain that what is happening behind the scene and why I am getting only one name.
Thanks in Advance,
Nizam
January 7, 2010 at 1:31 am
*edit removed*
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 7, 2010 at 6:46 am
You could get the results without the overhead of distinct by checking for occurence of name in the string built thus far through a Case statement:
Declare @x VARCHAR(100)
SELECT @x = case
when @x is null then name
when patindex('%'+name+'%', @x) = 0 then COALESCE(@x + ',', '') + name
else @x
end
FROM @temp
SELECT 'Result ==> @x is: ''' + Coalesce(@x, 'Null value')+ ''''
Toni
January 7, 2010 at 6:56 am
Actually the Coalesce does not add anything since the first when clause handles a null '@x' variable already.
Toni
January 7, 2010 at 10:37 am
This works
DECLARE @temp TABLE(
id INT,
Name VARCHAR(50)
)
INSERT INTO @temp
SELECT 1, 'Nizam' UNION ALL
SELECT 2, 'Sandeep' UNION ALL
SELECT 3, 'Ravi' UNION ALL
SELECT 4, 'Sandeep'
Declare @x VARCHAR(50)
set @x = ''
SELECT @x = @x + name + ','
FROM @temp
group by name
SELECT @x
For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
January 7, 2010 at 10:42 am
Hi,
That way an extra comma will be placed in the end.
DECLARE @temp TABLE(
id INT,
Name VARCHAR(50)
)
INSERT INTO @temp
SELECT 1, 'Nizam' UNION ALL
SELECT 2, 'Sandeep' UNION ALL
SELECT 3, 'Ravi' UNION ALL
SELECT 4, 'Sandeep'
Declare @x VARCHAR(50)
set @x = null
SELECT @x = COALESCE(@x + ',', '') + name
FROM @temp
group by name
SELECT @x
José Cruz
January 7, 2010 at 11:18 pm
Thanks for your responses.
But my problem is not the getting data (output) efficiently, rather I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.
Please explain that what is happening behind the scene and why I am getting only one name.
Something internal of the SQL engine (HOW).
Nizam.
January 7, 2010 at 11:23 pm
Thanks for your responses.
But my problem is not the getting data (output) efficiently, rather I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.
Please explain that what is happening behind the scene and why I am getting only one name.
Something internal of the SQL engine (HOW).
Nizam.
January 7, 2010 at 11:24 pm
If i remeber correctly COALESCE returns the first not null value.
"Keep Trying"
January 8, 2010 at 9:47 am
mohd.nizamuddin (1/7/2010)
Thanks for your responses.But my problem is not the getting data (output) efficiently, rather I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.
Please explain that what is happening behind the scene and why I am getting only one name.
Something internal of the SQL engine (HOW).
Nizam.
Actually, how it works or not behind the scenes is neither the problem nor the question here. You've simply told it take take a "Distinct" on the wrong thing and it returned the wrong thing. Your code attempts to take a DISTINCT on an answer that does not yet exist. Instead, you must take the DISTINCT on the source of the information used to create the answer... thusly...
DECLARE @temp TABLE
(
ID INT,
Name VARCHAR(50)
)
INSERT INTO @temp
SELECT 1, 'Robert' UNION ALL
SELECT 2, 'Sandy' UNION ALL
SELECT 3, 'Roger' UNION ALL
SELECT 4, 'Sandy'
DECLARE @x VARCHAR(50)
SELECT @x = COALESCE(@x + ',', '') + d.Name
FROM (SELECT DISTINCT Name FROM @temp) d
SELECT @x
--Jeff Moden
Change is inevitable... Change for the better is not.
January 8, 2010 at 11:14 am
The reason you get Sandeep as a result is because DISTINCT is applied to "COALESCE( @x + ',', '') + name" first - which generates an internal virtual table. Then the variable @x is set the the rows of the virtual table.
Basically SQL runs the following queries internally for your query
INSERT __InternalVirtualTable1( __col1 )
SELECT DISTINCT COALESCE(@x + ',', '') + name
FROM @temp
So __InternalVirtualTable1 will have the the values below
Nizam
Ravi
Sandeep
Then SQL runs
SELECT @x = __col1
FROM __InternalVirtualTable1
and @x will contain 'Sandeep' because the last row in __InternalVirtualTable1 contains 'Sandeep'.
Look into virtual tables if you want to know how sql processes a query internally.
January 10, 2010 at 9:33 pm
nadabadan (1/8/2010)
The reason you get Sandeep as a result is because DISTINCT.
.
.
Look into virtual tables if you want to know how sql processes a query internally.
Thanks a lot. I will look further to know more about. Really, it will help me to start with to know more inside SQL. 🙂
January 10, 2010 at 9:57 pm
Jeff Moden (1/8/2010)
Actually, how it works or not behind the scenes is neither the problem nor the question here. You've simply told it take take a "Distinct" on the wrong thing and it returned the wrong thing. Your code attempts to take a DISTINCT on an answer that does not yet exist. Instead, you must take the DISTINCT on the source of the information used to create the answer... thusly...
DECLARE @temp TABLE
(
ID INT,
Name VARCHAR(50)
)
INSERT INTO @temp
SELECT 1, 'Robert' UNION ALL
SELECT 2, 'Sandy' UNION ALL
SELECT 3, 'Roger' UNION ALL
SELECT 4, 'Sandy'
DECLARE @x VARCHAR(50)
SELECT @x = COALESCE(@x + ',', '') + d.Name
FROM (SELECT DISTINCT Name FROM @temp) d
SELECT @x
Thanks Jeff.
January 11, 2010 at 5:06 am
About the 'internals'...
The first thing to realize is that the trick of generating a comma separated string using multiple assignments takes advantage of an undocumented quirk in SQL Server. It is a quirk that has been around for many years, and has pretty much always worked...but if you're going to use it, you probably need to be aware of its internals.
First though, I'm going to mention a better (though still not perfect) alternative:
SELECT @x =
STUFF(
(
SELECT DISTINCT ',' + name
FROM @temp
ORDER BY name
FOR XML PATH (''), TYPE
).value('.[1]', 'varchar(max)')
,1,1, ''); -- Removes the leading comma
(This method is documented, faster, allows for explicit ordering, and does not rely on @x having a NULL value at the start).
Anyway, back to the multiple-assignment trick: It relies on the variable being updated for every row, with the updated value being available for the next row to use.
So what causes the unusual behaviour noted in your first post? SQL Server creates a plan to evaluate the SELECT that is designed to assign @x from itself once per row. The operator within which implements this operation is the Compute Scalar.
The Compute Scalar is responsible for evaluating the expression which is assigned to @x, though there is no guarantee of exactly when the evaluation or the assignment will take place - or even if they are to occur at the same point in the plan.
In the simple case, without the DISTINCT, there is only one Compute Scalar in the plan and it handles the assignment and the evaluation, and chooses to do the assignment to @x on a per-row basis and so all is well. There is still no absolute guarantee that @x will be assigned to once per row - it just happens to work that way in practice.
When we add a DISTINCT to the query, an extra operator appears in the plan (either a Sort or Hash operator to implement the grouping) as we might expect; however we also find an extra Compute Scalar which appears before the grouping!
This operator takes care of the evaluation of the expression involving @x. It evaluates this once per rowbut crucially does not assign the result to @x between rows. In fact, it can't do that yet because that would be logically wrong - we need to do the DISTINCT before assigning to @x. Instead, the per-row result of the expression is stored as an Defined Value result with a name like [Expr1003].
Since we don't assign to @x at this stage, this Compute Scalar finds that @x always has the value it had before the statement started - in this case NULL. So, every row that is processed finds that @x is NULL so [Expr1003] is set to the contents of the name column. No concatenation or adding of commas occurs for any row. This step results in four rows with the [Expr1003] set to Nizam, Sandeep, Ravi, and Sandeep - though not necessarily in any particular order.
Then we do the DISTINCT on these [Expr1003] values - which results in Nizam, Sandeep, and Ravi - again in no guaranteed order. Finally, the second Compute Scalar assigns to @x from [Expr1003]. The result of this is non-deterministic since it depends on the order of rows coming from the prior operator. It may, for example, set @x to Nizam, then set it to Ravi, and then set it to Sandeep. It might do it in some other order, but whatever happens, @x will be set to just one of the values at the end.
In practice, you may find that the order of rows is predictable for small sets so you always get one of the candidate names. This is just another quirk of the code - there are optimizations to deal with small numbers of rows for example which give the impression of a guaranteed order, but that's all it is - an impression of a guarantee.
For example, with a larger number of rows, SQL Server might choose to implement the DISTINCT using a Hash Group operator, or it might choose to use parallelism, either of which would change the order of the rows in unpredictable ways.
There is one last thing about Compute Scalar operators: the execution engine will generally defer their evaluation until the result is actually required by some other operator - so the evaluation may happen later than the operator's position in the plan would suggest.
So, why does SQL Server break the operation into two parts when we introduce the DISTINCT? It is the result of the application of a general rule - the optimizer tries to place operators like Compute Scalar and Filter as early in the plan as it can. This is an optimization that makes sense if you think about it: evaluating early will usually be good for later operators and may avoid extra work. The optimizer applies these sorts of optimizations wherever it can, so long as it results in a provably correct plan.
You may think that the plan isn't correct since it produces unexpected output; but SQL Server does not guarantee that these sorts of multiple assignments should work in any particular way...
Jeff's solution, by the way, works by changing the query so that it is not safe for the optimizer to split the evaluation and assignment operations into two Compute Scalars. I have yet to come across a case where I was unable to rewrite the query to enable this technique to work - but I'll leave it up to the interested reader to decide for themselves which way to go with this 😉
Paul
January 12, 2010 at 3:25 am
Thank you Paul, for your detailed explanation. Really this help me a lot to know something extra about the SQL.
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply