April 18, 2013 at 1:15 am
Divya Agrawal (4/18/2013)
Great post!! thanks for sharing..it is really very useful
Thank you very much, Divya!
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
April 18, 2013 at 1:19 am
Dwain, Jeff - you guys crack me up! As if I didn't have enough pressure from the escaped dust bunnies 😀
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
April 18, 2013 at 4:48 am
If there is a csv column having different number of comma's in it, how would CCA help in it. In the examples provided by you we can split a csv column having fixed length or known number of comma's, however in variable csv string how we can use CCA. Let me know your views.
--Divya
April 18, 2013 at 5:15 am
Divya Agrawal (4/18/2013)
If there is a csv column having different number of comma's in it, how would CCA help in it. In the examples provided by you we can split a csv column having fixed length or known number of comma's, however in variable csv string how we can use CCA. Let me know your views.
Hi Divya, thank you for reading and taking an interest in my article.
cCA is of limited use for string splitting because of the (lack of) efficiency in resolving more than 5 or 6 elements. Varying number of elements isn't a problem for either DS8K or cCA. If you have a specific test case, how about posting it up and we'll work through it?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
April 19, 2013 at 6:12 am
DECLARE @T TABLE (auditlogid int, userid int , auditlog varchar(max))
INSERT into @T
select 1,1,'ProgramCompletionStatus - ProgramNotCompleted, '
union all
select 1,1,'InputValidationStatus - Success, IsValidUserID - Yes, IsValidProgramID - Yes, IsProgramEnabled - Yes, IsProgramAssignedToUser - Yes, IsUserProgramEnabled - Yes, PPC_ProgramPlatformCompatibilityStatus - Valid, PPC_IPLMappingFound - No, PPC_AccessLevel - FullAccess, '
union all
select 1,1,'ProcessStatus - Begin, TimeStamp - 2/28/2013 1:39:47 AM, '
union all
select 1,1,'ProcessStatus - Begin, TimeStamp - 2/28/2013 1:40:11 AM, GUDFS_ModuleName - GetUserDataFromSession, GUDFS_HasSession - True, PreviousPageNo - 16, '
I have an audit table storing auditlog in csv string. I want to split that.
--Divya
April 22, 2013 at 1:25 am
Divya Agrawal (4/19/2013)
DECLARE @T TABLE (auditlogid int, userid int , auditlog varchar(max))INSERT into @T
select 1,1,'ProgramCompletionStatus - ProgramNotCompleted, '
union all
select 1,1,'InputValidationStatus - Success, IsValidUserID - Yes, IsValidProgramID - Yes, IsProgramEnabled - Yes, IsProgramAssignedToUser - Yes, IsUserProgramEnabled - Yes, PPC_ProgramPlatformCompatibilityStatus - Valid, PPC_IPLMappingFound - No, PPC_AccessLevel - FullAccess, '
union all
select 1,1,'ProcessStatus - Begin, TimeStamp - 2/28/2013 1:39:47 AM, '
union all
select 1,1,'ProcessStatus - Begin, TimeStamp - 2/28/2013 1:40:11 AM, GUDFS_ModuleName - GetUserDataFromSession, GUDFS_HasSession - True, PreviousPageNo - 16, '
I have an audit table storing auditlog in csv string. I want to split that.
With nine (or more) elements in your strings
SELECT
d.RowID,
d.Something,
s.*
FROM (
SELECT 1,1,'
ProgramCompletionStatus - ProgramNotCompleted, ' UNION ALL
SELECT 2,1,'
InputValidationStatus - Success,
IsValidUserID - Yes,
IsValidProgramID - Yes,
IsProgramEnabled - Yes,
IsProgramAssignedToUser - Yes,
IsUserProgramEnabled - Yes,
PPC_ProgramPlatformCompatibilityStatus - Valid,
PPC_IPLMappingFound - No,
PPC_AccessLevel - FullAccess, ' UNION ALL
SELECT 3,1,'
ProcessStatus - Begin,
TimeStamp - 2/28/2013 1:39:47 AM, ' UNION ALL
SELECT 4,1,'
ProcessStatus - Begin,
TimeStamp - 2/28/2013 1:40:11 AM,
GUDFS_ModuleName - GetUserDataFromSession,
GUDFS_HasSession - True, PreviousPageNo - 16, '
) d (RowID, Something, MyString)
CROSS APPLY dbo.DelimitedSplit8K(d.MyString,',') s
I'd go for DelimitedSplit8K. cCA is quicker only if there are less than five or six elements. If you don't already have the DelimitedSplit8K function, there's a link to it in the article.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
April 22, 2013 at 6:53 am
Jeff Moden (4/16/2013)
ChrisM@Work (4/16/2013)
Jeff Moden (4/16/2013)
Awesome job, Chris. Well done!Thanks Jeff! And thank you for the encouragement to put pen to paper. DC's been nagging me for ages. Once you stepped in there was no getting out of it 😀
Heh... I just knew what you were capable of but, ultimately, my opinion doesn't matter. The proof of what you're capable of is now officially documented. More than 7K reads in the first 2 days and 5 stars on the very first try. I say again, well done, ol' friend!
And now over 10,000 views! Congratulations again Chris!
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
April 22, 2013 at 11:42 pm
ChrisM@Work (4/22/2013)
Divya Agrawal (4/19/2013)
DECLARE @T TABLE (auditlogid int, userid int , auditlog varchar(max))INSERT into @T
select 1,1,'ProgramCompletionStatus - ProgramNotCompleted, '
union all
select 1,1,'InputValidationStatus - Success, IsValidUserID - Yes, IsValidProgramID - Yes, IsProgramEnabled - Yes, IsProgramAssignedToUser - Yes, IsUserProgramEnabled - Yes, PPC_ProgramPlatformCompatibilityStatus - Valid, PPC_IPLMappingFound - No, PPC_AccessLevel - FullAccess, '
union all
select 1,1,'ProcessStatus - Begin, TimeStamp - 2/28/2013 1:39:47 AM, '
union all
select 1,1,'ProcessStatus - Begin, TimeStamp - 2/28/2013 1:40:11 AM, GUDFS_ModuleName - GetUserDataFromSession, GUDFS_HasSession - True, PreviousPageNo - 16, '
I have an audit table storing auditlog in csv string. I want to split that.
With nine (or more) elements in your strings
SELECT
d.RowID,
d.Something,
s.*
FROM (
SELECT 1,1,'
ProgramCompletionStatus - ProgramNotCompleted, ' UNION ALL
SELECT 2,1,'
InputValidationStatus - Success,
IsValidUserID - Yes,
IsValidProgramID - Yes,
IsProgramEnabled - Yes,
IsProgramAssignedToUser - Yes,
IsUserProgramEnabled - Yes,
PPC_ProgramPlatformCompatibilityStatus - Valid,
PPC_IPLMappingFound - No,
PPC_AccessLevel - FullAccess, ' UNION ALL
SELECT 3,1,'
ProcessStatus - Begin,
TimeStamp - 2/28/2013 1:39:47 AM, ' UNION ALL
SELECT 4,1,'
ProcessStatus - Begin,
TimeStamp - 2/28/2013 1:40:11 AM,
GUDFS_ModuleName - GetUserDataFromSession,
GUDFS_HasSession - True, PreviousPageNo - 16, '
) d (RowID, Something, MyString)
CROSS APPLY dbo.DelimitedSplit8K(d.MyString,',') s
I'd go for DelimitedSplit8K. cCA is quicker only if there are less than five or six elements. If you don't already have the DelimitedSplit8K function, there's a link to it in the article.
Chris,
i think its better to use xml instead of DelimetedSplit8K function. Here, is my post on it, http://www.sqlservercentral.com/articles/XML/66932/
let me know your views.
--Divya
April 23, 2013 at 1:21 am
Divya Agrawal (4/22/2013)
ChrisM@Work (4/22/2013)
Divya Agrawal (4/19/2013)
DECLARE @T TABLE (auditlogid int, userid int , auditlog varchar(max))INSERT into @T
select 1,1,'ProgramCompletionStatus - ProgramNotCompleted, '
union all
select 1,1,'InputValidationStatus - Success, IsValidUserID - Yes, IsValidProgramID - Yes, IsProgramEnabled - Yes, IsProgramAssignedToUser - Yes, IsUserProgramEnabled - Yes, PPC_ProgramPlatformCompatibilityStatus - Valid, PPC_IPLMappingFound - No, PPC_AccessLevel - FullAccess, '
union all
select 1,1,'ProcessStatus - Begin, TimeStamp - 2/28/2013 1:39:47 AM, '
union all
select 1,1,'ProcessStatus - Begin, TimeStamp - 2/28/2013 1:40:11 AM, GUDFS_ModuleName - GetUserDataFromSession, GUDFS_HasSession - True, PreviousPageNo - 16, '
I have an audit table storing auditlog in csv string. I want to split that.
With nine (or more) elements in your strings
SELECT
d.RowID,
d.Something,
s.*
FROM (
SELECT 1,1,'
ProgramCompletionStatus - ProgramNotCompleted, ' UNION ALL
SELECT 2,1,'
InputValidationStatus - Success,
IsValidUserID - Yes,
IsValidProgramID - Yes,
IsProgramEnabled - Yes,
IsProgramAssignedToUser - Yes,
IsUserProgramEnabled - Yes,
PPC_ProgramPlatformCompatibilityStatus - Valid,
PPC_IPLMappingFound - No,
PPC_AccessLevel - FullAccess, ' UNION ALL
SELECT 3,1,'
ProcessStatus - Begin,
TimeStamp - 2/28/2013 1:39:47 AM, ' UNION ALL
SELECT 4,1,'
ProcessStatus - Begin,
TimeStamp - 2/28/2013 1:40:11 AM,
GUDFS_ModuleName - GetUserDataFromSession,
GUDFS_HasSession - True, PreviousPageNo - 16, '
) d (RowID, Something, MyString)
CROSS APPLY dbo.DelimitedSplit8K(d.MyString,',') s
I'd go for DelimitedSplit8K. cCA is quicker only if there are less than five or six elements. If you don't already have the DelimitedSplit8K function, there's a link to it in the article.
Chris,
i think its better to use xml instead of DelimetedSplit8K function. Here, is my post on it, http://www.sqlservercentral.com/articles/XML/66932/
let me know your views.
I like your article Divya, it's tightly written with concise examples. Exactly what folks are looking for. The method looks promising too - very simple to use. However, XML string splitting suffers from two major flaws - poor performance, and sensitivity to embedded markup characters. For those two reasons I don't use it and neither do I advocate that others should, because they impose severe restrictions on usage of the method which don't apply to DelimitedSplit8K.
If you still feel that "its better to use xml", I'd be interested to see your comparison code.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
April 23, 2013 at 11:38 am
Chris, Thank you for this article. I have been trying to figure out a solution to a problem I've had for a while, and could only come up with work arounds. I have a huge table full of wage data, and asked several times for help on forums, but didn't get what I was looking for. Anyway, here is a mock up of the problem:
DECLARE @WageData TABLE (ID char(9), Wage money, Sector varchar(25))
INSERT INTO @WageData
SELECT '000000001', 25.46, 'Plumbing'
UNION ALL
SELECT '000000001', 42.67, 'Electric'
UNION ALL
SELECT '000000002', 58.25, 'Food Service'
UNION ALL
SELECT '000000002', 26.58, 'Legal'
UNION ALL
SELECT '000000003', 0.00, 'Mining';
To get a sum of wages for a given time, SUM and GROUP BY will due, but if we want to know the sector of primary employment (most wages) I could only come up with this:
WITH cte AS
(
SELECT ID, SUM(Wage) over (partition by ID) TotalWage, Sector
, ROW_NUMBER() over (partition by ID order by Wage DESC) rn
FROM @WageData
)
SELECT ID, TotalWage, Sector
FROM cte
WHERE rn = 1
Which is slow and when used in a view, can't have an index added to it. Then after reading your article, I came up with this:
SELECT w.ID, Wage, Sector, x.tw
FROM @WageData w
cross apply (select ID, MAX(wage) mw, SUM(wage) tw from @WageData group by ID) x
WHERE w.ID = x.ID AND w.Wage = x.mw
Well, I put it in a function and compared it to the cte version. New function took 1 second to run, old method clocked in at 7 minutes, 29 seconds. No contest.
I'm telling you this to say thank you. Also, one disadvantage: How do you approximate a LEFT JOIN when using a table function? Also, I'm willing to bet you can improve on my code. And finally, do people really need to split delimited strings that much?
April 23, 2013 at 5:14 pm
Divya Agrawal (4/22/2013)
Chris,i think its better to use xml instead of DelimetedSplit8K function. Here, is my post on it, http://www.sqlservercentral.com/articles/XML/66932/
let me know your views.
Divya,
First, that's a nice clear article with some good graphics put in just the right places.
You do, however, claim "I have found a much optimized and shorter way of splitting any string based on the delimiter passed." While I agree that it's short code, what about it is optimized?
--Jeff Moden
Change is inevitable... Change for the better is not.
April 24, 2013 at 3:58 am
Amy.G (4/23/2013)
Chris, Thank you for this article. I have been trying to figure out a solution to a problem I've had for a while, and could only come up with work arounds. I have a huge table full of wage data, and asked several times for help on forums, but didn't get what I was looking for. Anyway, here is a mock up of the problem:
DECLARE @WageData TABLE (ID char(9), Wage money, Sector varchar(25))
INSERT INTO @WageData
SELECT '000000001', 25.46, 'Plumbing'
UNION ALL
SELECT '000000001', 42.67, 'Electric'
UNION ALL
SELECT '000000002', 58.25, 'Food Service'
UNION ALL
SELECT '000000002', 26.58, 'Legal'
UNION ALL
SELECT '000000003', 0.00, 'Mining';
To get a sum of wages for a given time, SUM and GROUP BY will due, but if we want to know the sector of primary employment (most wages) I could only come up with this:
WITH cte AS
(
SELECT ID, SUM(Wage) over (partition by ID) TotalWage, Sector
, ROW_NUMBER() over (partition by ID order by Wage DESC) rn
FROM @WageData
)
SELECT ID, TotalWage, Sector
FROM cte
WHERE rn = 1
Which is slow and when used in a view, can't have an index added to it. Then after reading your article, I came up with this:
SELECT w.ID, Wage, Sector, x.tw
FROM @WageData w
cross apply (select ID, MAX(wage) mw, SUM(wage) tw from @WageData group by ID) x
WHERE w.ID = x.ID AND w.Wage = x.mw
Well, I put it in a function and compared it to the cte version. New function took 1 second to run, old method clocked in at 7 minutes, 29 seconds. No contest.
I'm telling you this to say thank you. Also, one disadvantage: How do you approximate a LEFT JOIN when using a table function? Also, I'm willing to bet you can improve on my code. And finally, do people really need to split delimited strings that much?
Hi Amy, thank you for taking the time to read my article and for your kind comments too.
Let's have a look at those queries. I've changed the sample data to a temp table because table variables have limitations which I don't want to impose on the test.
I've added a non-unique clustered index on ID which seems reasonable, and a non-unique ordinary index on Wage to support aggregating wage by ID (the ordinary index will contain ID because it's the clustering key). With these indexes in place, I jigged around a little with the original CTE and wrote two slight modifications of your new query:
DROP TABLE #WageData
CREATE TABLE #WageData (ID char(9), Wage money, Sector varchar(25))
INSERT INTO #WageData
SELECT '000000001', 25.46, 'Plumbing' UNION ALL
SELECT '000000001', 42.67, 'Electric' UNION ALL
SELECT '000000002', 58.25, 'Food Service' UNION ALL
SELECT '000000002', 26.58, 'Legal' UNION ALL
SELECT '000000003', 0.00, 'Mining';
CREATE CLUSTERED INDEX cx_ID ON #WageData (ID);
CREATE INDEX ix_Wage ON #WageData (Wage ASC);
GO
-- Q1 CTE
;WITH cte AS
(
SELECT
ID,
Wage,
Sector,
tw = SUM(Wage) over (partition by ID),
rn = ROW_NUMBER() over (partition by ID order by Wage DESC) -- expensive sort
FROM #WageData
)
SELECT ID, Wage, Sector, tw
FROM cte
WHERE rn = 1;
GO
-- Q2 inner join:
SELECT w.ID, w.Wage, w.Sector, x.tw
FROM #WageData w
INNER JOIN (
SELECT ID, MAX(wage) mw, SUM(wage) tw
FROM #WageData
GROUP BY ID
) x
ON w.ID = x.ID AND w.Wage = x.mw;
GO
-- Q3 amy's original:
SELECT w.ID, w.Wage, w.Sector, x.tw
FROM #WageData w
CROSS APPLY (
SELECT ID, MAX(wage) mw, SUM(wage) tw
FROM #WageData
GROUP BY ID
) x
WHERE w.ID = x.ID AND w.Wage = x.mw;
GO
-- Q4 correlated:
SELECT w.ID, w.Wage, w.Sector, x.tw
FROM #WageData w
CROSS APPLY (
SELECT MAX(wage) mw, SUM(wage) tw
FROM #WageData wi
WHERE wi.ID = w.ID
GROUP BY wi.ID
) x
WHERE w.Wage = x.mw;
GO
-- Q5 without redundant GROUP BY:
SELECT w.ID, w.Wage, w.Sector, x.tw
FROM #WageData w
CROSS APPLY (
SELECT MAX(wage) mw, SUM(wage) tw
FROM #WageData wi
WHERE wi.ID = w.ID) x
WHERE w.Wage = x.mw;
GO
Looking at the execution plans, Q1 uses an expensive sort operator (77% of total cost) for the ROW_NUMBER. Q2, Q3 and Q4 plans are identical to each other. They don't require the expensive sort operator of Q1 and filtration of rows to leave those with max(wage) occurs earlier in the plan. Data is read from the table only once - it's spooled to feed the aggregation. Q5 plan is much simpler graphically, rather than spooling the data and replaying it for the aggregate, it's seeked in the table by ID. The "query cost relative to the batch" of Q5 is oddly reported as twice that of Q2, Q3 and Q4 so I had a quick peek of the read stats using profiler, which reported that Q5 requires about a third of the number of reads compared to Q2, Q3 and Q4. Q1 was similar to the others, indicating that your original observed run time of seven minutes or whatever might be improved by tweaking your indexes.
Q4 is a slight modification of your CROSS APPLY query. Rather than writing the filter in the WHERE clause, I've correlated the inner query to the outer query (the left hand side) on ID. Aggregating by ID, and correlating on ID, ensures that rows on the left hand side will be matched to the correct aggregated values, and also that only one row will be generated on the right for each row on the left.
Q5 takes this one step further. If your set (a nested loop iteration of the inner query) contains only one ID, then the GROUP BY ID is redundant - take it out. I'm stuck with SQL Server 2005 here and it can't tell that Q4 and Q5 are logically the same :hehe:
It would be interesting to see the actual execution plans of all five queries run against your actual data.
How do you approximate a LEFT JOIN when using a table function?
Use OUTER APPLY. Paul White's two excellent papers on APPLY are linked in my sig.
do people really need to split delimited strings that much?
In the real world, no not really - it comes up often on the forums because it's tricky if you don't know how. In 6 months I've used DS8K in a couple of sprocs for address splitting, nothing more.
I'm telling you this to say thank you.
You're very welcome, as JBM would say "Thanks for stopping by". If there's anything you are still unsure about, just ask.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
June 12, 2013 at 3:03 pm
Chris, You've probably forgotten all about this, but after some other projects finished, I finally got back to this. I ran into a problem I hadn't thought of. There are people with two different jobs that make the same amount of money in each. I thought these might be duplicates at first, but there are people who work two full time jobs (=480 hours a quarter) and the same wage at two different places. Not a lot, but the cte with ROW_NUMBER() still just pulls one record per ID, whereas the CROSS APPLY versions will have two records if wage is the same for both. I'm not sure if continuing to work on this will yield results or I'm wasting my time and should stick with the CTE.
Any suggestions?
Thank you, Amy
June 13, 2013 at 1:05 am
Amy.G (6/12/2013)
Chris, You've probably forgotten all about this, but after some other projects finished, I finally got back to this. I ran into a problem I hadn't thought of. There are people with two different jobs that make the same amount of money in each. I thought these might be duplicates at first, but there are people who work two full time jobs (=480 hours a quarter) and the same wage at two different places. Not a lot, but the cte with ROW_NUMBER() still just pulls one record per ID, whereas the CROSS APPLY versions will have two records if wage is the same for both. I'm not sure if continuing to work on this will yield results or I'm wasting my time and should stick with the CTE.Any suggestions?
Thank you, Amy
Hi Amy
If you can knock up some sample data I'll look into it.
Cheers
ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
June 13, 2013 at 10:21 am
Thank you Chris. I've just added another person to the temp table. Person #4 makes $45.00 exactly working at two different jobs. Using ROW_NUMBER, we get one row per person. But in the 4 other solutions that match based on MAX(wage) #4 shows up twice since both jobs are the max wage. The total wages for this quarter for all employees will now be $90.00 too high.
If I haven't mentioned, the goal of this is to create a view that can be indexed and run faster. This is why I'm not satisified with the cte using row_number. Thank you for your help. It's greatly appreciated.
DROP TABLE #WageData
CREATE TABLE #WageData (ID char(9), Wage money, Sector varchar(25))
INSERT INTO #WageData
SELECT '000000001', 25.46, 'Plumbing' UNION ALL
SELECT '000000001', 42.67, 'Electric' UNION ALL
SELECT '000000002', 58.25, 'Food Service' UNION ALL
SELECT '000000002', 26.58, 'Legal' UNION ALL
SELECT '000000003', 0.00, 'Mining' UNION ALL
SELECT '000000004', 45.00, 'Retail' UNION ALL
SELECT '000000004', 45.00, 'Lodging';
CREATE CLUSTERED INDEX cx_ID ON #WageData (ID);
CREATE INDEX ix_Wage ON #WageData (Wage ASC);
GO
Viewing 15 posts - 16 through 30 (of 38 total)
You must be logged in to reply to this topic. Login to reply