April 11, 2009 at 5:00 pm
Hi I need some help in programing the below scenario, can Some one help me.
Source Table Structure
Resource_Id Skill_Id
101 sqlserver
101 vb.net
101 oracle
102 sqlserver
102 java
102 excel
103 vb.net
103 java
103 oracle
I need the out put in the below format, I tried some methods , but they are taking more than expected time , Performance is the key factor .
Final Output
101 sqlserver vb.net oracle
102 sqlserver ---- ---- java excel
103 ---- vb.net oracle java
April 11, 2009 at 6:24 pm
See if these are quick enough for you. The use of FOR XML is the fastest concatenation technique I've come across, but this sort of string manipulation is often better accomplished by the calling application outside SQL.
The second, CrossTab, technique can be modified to produce the output you wanted; I just thought it was easier to read this way.
declare @skills table (Resource_Id int, Skill_Id varchar(20))
insert into @skills
select 101, 'sqlserver' union all
select 101, 'vb.net' union all
select 101, 'oracle' union all
select 102, 'sqlserver' union all
select 102, 'java' union all
select 102, 'excel' union all
select 103, 'vb.net' union all
select 103, 'java' union all
select 103, 'oracle'
--
select * from @skills s1
-- Concatenated Format
set statistics time on;
SELECT Resource_Id,stuff((SELECT ',' + Skill_Id
FROM @skills s2
WHERE s2.Resource_Id= s1.resource_ID -- must match GROUP BY below
ORDER BY Skill_Id
FOR XML PATH('')
),1,1,'') as [Skills]
FROM @skills s1
GROUP BY s1.Resource_Id -- without GROUP BY multiple rows are returned
ORDER BY s1.Resource_Id
set statistics time off;
-- CrossTab Format
SELECT Resource_Id
,MAX(case when skill_id = 'Excel' then 'Yes' else '' end) as Excel
,MAX(case when skill_id = 'Java' then 'Yes' else '' end) as Java
,MAX(case when skill_id = 'Oracle' then 'Yes' else '' end) as Oracle
,MAX(case when skill_id = 'SQLServer' then 'Yes' else '' end) as SQLServer
,MAX(case when skill_id = 'VB.Net' then 'Yes' else '' end) as [VB.Net]
FROM @skills
Group by Resource_Id
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 11, 2009 at 9:27 pm
Bob Hovious (4/11/2009)
See if these are quick enough for you. The use of FOR XML is the fastest concatenation technique I've come across, but this sort of string manipulation is often better accomplished by the calling application outside SQL.
Like everything else, it depends.
Test setup... (code window doubled up the word "ADD" and will need repair when you copy it)
USE TempDB
GO
--===== Create and populate a 1,000,000 row test table.
-- Column "RowNum" has a range of 1 to 100,000 unique numbers
-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers
-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings
-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times
-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'
-- for all rows.
-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)
SELECT TOP 1000000
SomeID = IDENTITY(INT,1,1),
SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),
SomeHex12 = RIGHT(NEWID(),12)
INTO dbo.JBMTest
FROM Master.dbo.SysColumns t1
CROSS JOIN
Master.dbo.SysColumns t2
GO
--===== Add a Clustered PK
ALTER TABLE dbo.JBMTest
ADD PRIMARY KEY CLUSTERED (SomeID)
GO
--===== Add an index to support the concatenation
CREATE INDEX IX_JBMTest_SomeInt ON dbo.JBMTest (SomeInt,SomeLetters2)
GO
--===== Create a function that uses VARCHAR(8000)
CREATE FUNCTION dbo.Concat8KTest
(@SomeInt INT)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @ReturnString VARCHAR(8000)
SELECT @ReturnString = ISNULL(@ReturnString+',' ,'') + SomeLetters2
FROM dbo.JBMTest
WHERE SomeInt = @SomeInt
ORDER BY SomeLetters2
RETURN @ReturnString
END
GO
--===== Create a function that uses VARCHAR(MAX)
CREATE FUNCTION dbo.ConcatMAXTest
(@SomeInt INT)
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @ReturnString VARCHAR(MAX)
SELECT @ReturnString = ISNULL(@ReturnString+',' ,'') + SomeLetters2
FROM dbo.JBMTest
WHERE SomeInt = @SomeInt
ORDER BY SomeLetters2
RETURN @ReturnString
END
GO
Test 8k, MAX, and XML methods...
--===== Test the 8k solution
SELECT SomeInt,dbo.ConcatTest(SomeInt)
FROM dbo.JBMTest
GROUP BY SomeInt
ORDER BY SomeInt
GO
--===== Test the MAX solution
SELECT SomeInt,dbo.ConcatTest(SomeInt)
FROM dbo.JBMTest
GROUP BY SomeInt
ORDER BY SomeInt
GO
--===== Test the XML solution
SELECT t1.SomeInt,STUFF((SELECT ',' + t2.SomeLetters2
FROM dbo.JBMTest t2
WHERE t2.SomeInt = t1.SomeInt --- must match GROUP BY below
ORDER BY t2.SomeLetters2
FOR XML PATH(''))
,1,1,'') AS SomeLetters2
FROM dbo.JBMTest t1
GROUP BY t1.SomeInt --- without GROUP BY multiple rows are returned
ORDER BY t1.SomeInt
GO
Peformance and resource results...
Passing only 50,000 rows to the app is a heck of a lot cheaper than clogging the pipe with a million rows for the app to do it. And, it looks to me like the XML method is not only slower for total duration, but it's almost twice as hard on the CPU as a good ol' fashioned concatenation function.
Of course, for any of these methods, you need the correct index if you don't want to wait a week for it to finish.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 12, 2009 at 11:25 am
Jeff you shoulda been a lawyer :-D. Wasn't it you asking why people needed to do this on another thread? I'm in total agreement that passing summary data takes a huge load off the system. But I thought I was on the other side of this dispute the last time you schooled me.
Yes it depends and, given your index, the simple concatenation function works faster. But, what if the index doesn't exist in the OPs environment? The XML concatenation still runs in under a minute for your million row table, and I gave up on the concatenation function after it ran for over 5. Also, I couldn't duplicate your cpu numbers. On my machine, the function with Max took over 12,000 ms while the XML ran in the 3300-3900 range, and the 8k function ran around 3100-3300. (The XML was the clear loser on the elapsed time though due to I/O, taking almost twice as long as the concatenation functions, The functions could get the job done with a single scan of the nonclustered index and a fraction of the logical reads.)
I have a question about the nonclustered index. You are depending on it for the sequencing of the someletters combinations, so you avoid having to do an ORDER BY anywhere but on the someint column. But I thought you or Gail said that you couldn't count on anything but a clustered index or an ORDER BY clause to maintain sequence. Did I misunderstand?
Finally, since it depends and given that the OP's question concerned a very specific set of values, couldn't we have also used this approach? It results in CPU and elapsed times under 1850 ms, with the nonclustered index in place.
set statistics time on;
;with CrossTab as (
SELECT someInt
,MAX(case when someLetters2 = 'AA' then ',AA' else '' end) as AA
,MAX(case when someletters2 = 'BB' then ',BB' else '' end) as BB
,MAX(case when someletters2 = 'CC' then ',CC' else '' end) as CC
,MAX(case when someletters2 = 'DD' then ',DD' else '' end) as DD
,MAX(case when someletters2 = 'EE' then ',EE' else '' end) as EE
FROM JBMTest
Group by someInt
)
select isnull(STUFF(AA+BB+CC+DD+EE,1,1,''),'')
from CrossTab
set statistics time off;
Thanks again for furthering my education! 🙂
Bob
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 12, 2009 at 8:46 pm
Bob Hovious (4/12/2009)
Jeff you shoulda been a lawyer :-D. Wasn't it you asking why people needed to do this on another thread? I'm in total agreement that passing summary data takes a huge load off the system. But I thought I was on the other side of this dispute the last time you schooled me.
Heh... I guess you're right about that. The problem with concatenation is that it's not really summary data no matter how you handle it. Bloody concatenation can be justified either way depending on whether you're IO bound or CPU bound. My normal outrage about the subject is that it's very costly either way and I really want to know why any given OP wants to do it to begin with. Either you're gonna send all of the data with a bunch of CrLf's or all of the data with a bunch of commas. Both the server and the client are going to have to work at processing it and the pipe will suffer one way or the other. I pretty much hate the idea of concatenation no matter where it's done.
Yes it depends and, given your index, the simple concatenation function works faster. But, what if the index doesn't exist in the OPs environment? The XML concatenation still runs in under a minute for your million row table, and I gave up on the concatenation function after it ran for over 5. Also, I couldn't duplicate your cpu numbers. On my machine, the function with Max took over 12,000 ms while the XML ran in the 3300-3900 range, and the 8k function ran around 3100-3300. (The XML was the clear loser on the elapsed time though due to I/O, taking almost twice as long as the concatenation functions, The functions could get the job done with a single scan of the nonclustered index and a fraction of the logical reads.)
That's the other reason I don't care for concatenation. I don't understand all of the reasons for it, but the process varies greatly from machine to machine.
So far as the possibility of the missing index goes, if I had a million row table and considering that the fastest method returns in a whopping minute without it, you can be darned sure I'd bust a hump or a DBA's head to ensure the index was placed. 🙂 It's all part of performance tuning, etc.
I have a question about the nonclustered index. You are depending on it for the sequencing of the someletters combinations, so you avoid having to do an ORDER BY anywhere but on the someint column. But I thought you or Gail said that you couldn't count on anything but a clustered index or an ORDER BY clause to maintain sequence. Did I misunderstand?
Nope... I don't see where I left out any ORDER BY's for SomeLetters2... both functions contain it and so does the XML. And, yes, contrary to popular belief, it's not safe to depend on a clustered index as an ORDER BY in Selects... just the "quirky update". That's part of the reason why I'm rewriting the "Running Total" article.
Finally, since it depends and given that the OP's question concerned a very specific set of values, couldn't we have also used this approach? It results in CPU and elapsed times under 1850 ms, with the nonclustered index in place.
Absolutely. But what happens if someone adds another "category" to the table? The code has no warning that it's not picking everything up. And people who say another category will never be added always live long enough to find out they're eventually wrong. 😉
Because of my rants on performance and scalability, many think that speed is my only goal. They don't realize that accuracy and data integrity is always my first goal. I'd gladly have something take 3-5 seconds on a million row table and never have to worry about it producing the wrong answer if a new category is added than to run in half that time with the eventual potential for a wrong answer.
To wit, my mantra is "Make it work, make it fast, make it pretty... and it ain't done 'til it's pretty."
Thanks again for furthering my education! 🙂
The feeling is more than mutual. Whenever someone like yourself asks such good questions, I always have to go back and re-examine the reasons for any convictions I may have. Normally, if a conviction cannot be backed up with code, it's not worth being a conviction.
I sure do hope I answered your questions and concerns on this one. Concatenation is always difficult to defend no matter how you do it.
Thanks for the feedback, Bob.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 13, 2009 at 7:20 am
Sorry about raising the ORDER BY issue. I threw that in while looking at the tests and just had a brain-short not thinking about the functions themselves.
You're absolutely right about the problem with the crossTab being what to do when the set of possible values increases. The only play there is to do it in dynamic SQL. Hopefully, there would be a table somewhere that limits the valid codes that could be entered, or at least an index for a quick SELECT DISTINCT. Possibly a trigger could be written to rebuild a stored procedure or table-valued function whenever the content of that table changes. (I am sometimes amazed that the dba's and security people let me write code that writes code. Haven't they ever heard of Skynet?)
The only justification I can give you for concatenation is that when people are looking at data, it allows a lot more to fit on a page where they can visually scan it. Every time I've had a request for it, it's been in the context of a report of some kind. I am still keeping the FOR XML technique in the toolbox, but it is immediately followed by your code from above, because with the proper index, it's hard to beat.
Jeff, thanks for taking the time to listen to my questions and follow up. I invariably learn something worthwhile from the exchange. It's a pleasure to be able to ask questions and discuss competing considerations without it being all about egos. Now I have to go look at Flo's CLR parsing.
(P.S. Florian is a guy's name.) 😛
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 13, 2009 at 7:36 am
Bob Hovious (4/12/2009)
But I thought you or Gail said that you couldn't count on anything but a clustered index or an ORDER BY clause to maintain sequence.
Wouldn't have been me. I tell people that you can't depend on anything other than an ORDER BY to maintain sequence. The update is a side-effect of how the engine currently works, but there's no guarantee that someone won't find another edge case next week where it breaks.
The only time the optimiser will care about the order rows are returned in is when there's an ORDER BY on the query.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
April 13, 2009 at 7:57 am
Thanks, Gail. I was sure about the ORDER BY, but not the clustered index. I couldn't find any of the threads where either was discussed.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 13, 2009 at 8:08 am
Jeff,
wow..its simply great to see the way you are thinking on the questions and answers...
April 13, 2009 at 8:33 am
I think Nirmal might need something like this
declare @skills table (Resource_Id int, Skill_Id varchar(20))
insert into @skills
select 101, 'sqlserver' union all
select 101, 'vb.net' union all
select 101, 'oracle' union all
select 102, 'sqlserver' union all
select 102, 'java' union all
select 102, 'excel' union all
select 103, 'vb.net' union all
select 103, 'java' union all
select 103, 'oracle'
SELECT Resource_Id, ISNULL(sqlserver, '----'), ISNULL([vb.net], '----'), ISNULL(oracle,'----'), ISNULL(java, '----'), ISNULL(excel, '----')
FROM @skills
PIVOT(MAX(Skill_Id) FOR Skill_Id
IN(sqlserver,[vb.net],oracle,java,excel)) AS P;
If you look carefully at his sample output the data is not just concatenated but sorted as well and when a skill is missing you have four dashes. Of course this solution is going to work with a finite number of skills only. The output generated by the code above is not exactly matching the sample output because there are trailing dashes for skills further down the list.
PS. Bob , thanks for the insert code....
---------------------------------------------
[font="Verdana"]Nothing is impossible.
It is just a matter of time and money.[/font]
April 13, 2009 at 8:43 am
VK... my observation is that there is a world of valuable information to be learned from the people who hang around SSC. I shamelessly take every opportunity to pick their brains.
Jacek0... you're welcome, I think.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 13, 2009 at 9:36 am
Jeff, in your posted test harness, you need to rename the functions in the final tests. It has "SELECT SomeInt, dbo.ConcatTest...", but it needs to be Concat8kTest and ConcatMaxTest, per the create scripts. No big deal, but you might want to edit the post accordingly.
Also, I got very different CPU and total time readings than you on those. Average for the 8k UDF was 5300 milliseconds CPU, 9200 total; Max UDF was 8500 CPU, 13000 total; XML inline was 2000 CPU, 2000 total.
The XML version's IO stats are out the roof, with 154k logical reads, compared to 1985 for each of the other two, but that didn't slow it down apparently. On my system, anyway.
Of course, since the test data is pretty much random, the actual number of reads and scans will vary, as it depends on the number of unique SomeInt values and how many rows each of those has in the table.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 13, 2009 at 9:52 am
Just to test it a bit further:
CREATE FUNCTION dbo.ConcatXMLTest (@SomeInt_in INT)
RETURNS VARCHAR(8000)
AS BEGIN
RETURN (SELECT
STUFF((SELECT
',' + t2.SomeLetters2
FROM
dbo.JBMTest t2
WHERE
t2.SomeInt = @SomeInt_in
ORDER BY
t2.SomeLetters2
FOR
XML PATH(''),type).value('.[1]', 'varchar(MAX)'), 1, 1, ''))
END ;
SELECT
SomeInt,
dbo.ConcatXMLTest(SomeInt)
FROM
dbo.JBMTest
GROUP BY
SomeInt
ORDER BY
SomeInt ;
That got the scan count down to 1 and the reads to 1985, but took the CPU up to over 10 seconds and total to nearly 14 seconds (up from 2 for each for the inline version).
Adding "type" and the value function to the inline XML query increased its run-time to 3800 milliseconds, but allows the string values to contain characters that are XML-specific, like "&", which will otherwise cause it to error out.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 13, 2009 at 9:55 am
Just for one more test, I took the type and value functions out of the XML version of the UDF, and that took it down to 6.5 seconds CPU and 9.5 seconds total. Same scan and read stats either way, of course.
That's slightly worse than the 8k string function, and better than the Max string function, so I modified the XML UDF one more time to make it return varchar(max), and that took it up to 7 seconds CPU, 10 seconds total. No effect on IO. That's better than the Max string function still.
With both the type and value functions, and returning varchar(max), the XML UDF went up to nearly 11 seconds CPU and nearly 14 seconds total, which makes it the worst version of them all!
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 13, 2009 at 11:19 am
Gus, I rewrote your function as an inline table valued function and it gets the job done in half the time of the ConcatMaxTest function, but still three times longer than the Concat8KTest function.
There is just no beating the IO advantage of scanning the nonclustered index a single time.
ALTER FUNCTION [dbo].[itvfConcatXML]
(
@someInt int
)
RETURNS TABLE
AS
RETURN
(
SELECT STUFF((SELECT ',' + t2.SomeLetters2
FROM dbo.JBMTest t2
WHERE t2.SomeInt = @SomeInt
ORDER BY t2.SomeLetters2
FOR XML PATH(''),type).value('.[1]', 'varchar(MAX)'), 1, 1, '') as conCatString
)
I used a CTE to feed it a distinct list.
;with someints as (select distinct someint from jbmtest)
SELECT t1.SomeInt,f.ConCatString
FROM someints t1
cross apply dbo.itvfConcatXML(t1.someint) f
ORDER BY t1.SomeInt
Table 'JBMTest'. Scan count 50001, logical reads 154059, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6864 ms, elapsed time = 6901 ms.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
Viewing 15 posts - 1 through 15 (of 20 total)
You must be logged in to reply to this topic. Login to reply