Skipping group with FOR XML Path('')

  • I have a table with a text field and a grouping number.

    1, A

    1, B

    1, C

    1, D

    1, E

    2, A

    2, C

    2, D

    2, E

    3, A

    3, B

    3, C

    3, F

    4, A

    4, B

    4, D

    4, F

    4, G

    I assemble the text fields into strings by group, using the FOR XML PATH('') function, so that I end up with

    1, A-B-C-D-E

    2, A-C-D-E

    3, A-B-C-F

    4, A-B-D-F-G

    This all works fine. I now have a requirement to leave out some groups, based on whether the GROUP does or does not contain a certain text. For instance, if I want only GROUPS with 'E', my result set should be

    1, A-B-C-D-E

    2, A-C-D-E

    If I want only those GROUPS that DO NOT have 'E', my result set should be

    3, A-B-C-F

    4, A-B-D-F-G

    I can (and have) put a condition on the assembled string, but it seems to me that it should be possible to make the cut earlier. That is, rather than assembling the string, scanning it with a Like '%E%' condition and discarding what doesn't meet my needs, I would like to stop scanning the incoming GROUP as soon as a mismatch is detected. If I want all groups that will not contain an 'A', for instance, it is pointless for the query engine to read in 'A', 'B', 'C', 'D', 'E', put them all together into a string, compare it with a wildcard and throw it away. As soon as it sees the 'A', it has enough information to completely skip over the rest of that group. The abbreviated examples here are trivial, but the real code is not.

    All I've been able to dream up so far results in skipping the RECORD that contains the 'A', but not the entire group.

  • Something like this?

    SELECT *

    INTO #Test

    FROM (Values

    (1, 'A'),

    (1, 'B'),

    (1, 'C'),

    (1, 'D'),

    (1, 'E'),

    (2, 'A'),

    (2, 'C'),

    (2, 'D'),

    (2, 'E'),

    (3, 'A'),

    (3, 'B'),

    (3, 'C'),

    (3, 'F'),

    (4, 'A'),

    (4, 'B'),

    (4, 'D'),

    (4, 'F'),

    (4, 'G'))x(id, value);

    SELECT id,

    STUFF((SELECT '-' + value FROM #Test it WHERE t.id = it.id

    ORDER BY value

    FOR XML PATH('')), 1, 1, '')

    FROM #Test t

    WHERE id IN (SELECT id FROM #Test WHERE value = 'E')

    GROUP BY id;

    DROP TABLE #Test;

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Or like this:

    declare @TestTable table(

    Id int,

    GroupCode char(1)

    );

    insert into @TestTable

    values (1,'A'),

    (1,'B'),

    (1,'C'),

    (1,'D'),

    (1,'E'),

    (2,'A'),

    (2,'C'),

    (2,'D'),

    (2,'E'),

    (3,'A'),

    (3,'B'),

    (3,'C'),

    (3,'F'),

    (4,'A'),

    (4,'B'),

    (4,'D'),

    (4,'F'),

    (4,'G');

    with BaseIds as (

    select distinct tt.Id from @TestTable tt where not exists(select 1 from @TestTable tt1 where tt.Id = tt1.Id and tt1.GroupCode in ('E'))

    )

    select bi.Id, stuff((select '-' + tt.GroupCode from @TestTable tt where bi.Id = tt.Id order by tt.GroupCode for xml path(''),TYPE).value('.','varchar(max)'),1,1,'')

    from BaseIds bi;

  • That looks promising, but I'm getting an error:

    Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

    Sorry, I meant your code works, but I'm getting an error when I try to apply the method to my query. Working on it...

  • pdanes (5/18/2015)


    That looks promising, but I'm getting an error:

    Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

    Sorry, I meant your code works, but I'm getting an error when I try to apply the method to my query. Working on it...

    Which query and how are you running it?

  • Both solutions run fine. If you're getting an error, it would be because you changed something and we don't know what it is.

    You need to review your code for any subqueries that might return more than one value when they shouldn't.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Well, here's the actual query, if you want to take a stab at it, but I haven't assembled all the necessary TSQL to give you the entire data environment on which I'm running this thing. I figured since your code works as is, you correctly grasped my problem and offered me a functional solution - the least I can do is put some effort into making it work before I start whining.

    SELECT VLTO.SkupinaAutoID, cast(Replace(Replace(Replace(Replace(Replace(Replace(Replace(STUFF

    ((SELECT ',' + TC.BPAuthorAbbreviation

    FROM dbo.SouhrnyLidi VLTI INNER JOIN

    dbo.TableOfCloveks TC ON VLTI.ClovekAutoID = TC.ClovekAutoID

    WHERE VLTI.SkupinaAutoID = VLTO.SkupinaAutoID), 1, 1, ''), '&', '&'), '(,', '('), ',&,', ' & '), ',),', ') '), ',ex,', ' ex '), ',in,', ' in '), ',,,',

    ', ')as varchar(200)) AuthorString

    FROM dbo.SouhrnyLidi VLTO

    where VLTO.SkupinaAutoID in (select SkupinaAutoID from dbo.SouhrnyLidi where SkupinaAutoID = 1)

    GROUP BY VLTO.SkupinaAutoID

  • Got it - fat-fingered an edit when I copied out some parts of the query for my original post. Works great - many thanks, Luis. Complex execution plan, that's going to take bit of studying, but it works, and seems to be addressing tables in the order I expect.

    Appreciate the help.

  • Another way:

    SELECT *

    INTO #Test

    FROM (Values

    (1, 'A'),

    (1, 'B'),

    (1, 'C'),

    (1, 'D'),

    (1, 'E'),

    (2, 'A'),

    (2, 'C'),

    (2, 'D'),

    (2, 'E'),

    (3, 'A'),

    (3, 'B'),

    (3, 'C'),

    (3, 'F'),

    (4, 'A'),

    (4, 'B'),

    (4, 'D'),

    (4, 'F'),

    (4, 'G'))x(id, value);

    SELECT id

    ,value=STUFF(

    (

    SELECT '-' + b.value

    FROM #Test b

    WHERE a.id = b.ID

    ORDER BY b.value

    FOR XML PATH('')

    ), 1, 1, '')

    FROM #Test a

    GROUP BY id

    HAVING COUNT(CASE WHEN value = 'E' THEN 1 END) > 0;

    GO

    DROP TABLE #Test;

    To include only IDs that do not contain E then make the equality test in the HAVING an inequality.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (5/18/2015)


    Another way:

    SELECT *

    INTO #Test

    FROM (Values

    (1, 'A'),

    (1, 'B'),

    (1, 'C'),

    (1, 'D'),

    (1, 'E'),

    (2, 'A'),

    (2, 'C'),

    (2, 'D'),

    (2, 'E'),

    (3, 'A'),

    (3, 'B'),

    (3, 'C'),

    (3, 'F'),

    (4, 'A'),

    (4, 'B'),

    (4, 'D'),

    (4, 'F'),

    (4, 'G'))x(id, value);

    SELECT id

    ,value=STUFF(

    (

    SELECT '-' + b.value

    FROM #Test b

    WHERE a.id = b.ID

    ORDER BY b.value

    FOR XML PATH('')

    ), 1, 1, '')

    FROM #Test a

    GROUP BY id

    HAVING COUNT(CASE WHEN value = 'E' THEN 1 END) > 0;

    GO

    DROP TABLE #Test;

    To include only IDs that do not contain E then make the equality test in the HAVING an inequality.

    That would work, but using the HAVING clause performs the test AFTER the grouping is done, doesn't it? I'm trying to get the exclusion done as early as possible, to avoid shoveling around records that I already know I don't want.

  • pdanes (5/19/2015)


    dwain.c (5/18/2015)


    Another way:

    SELECT *

    INTO #Test

    FROM (Values

    (1, 'A'),

    (1, 'B'),

    (1, 'C'),

    (1, 'D'),

    (1, 'E'),

    (2, 'A'),

    (2, 'C'),

    (2, 'D'),

    (2, 'E'),

    (3, 'A'),

    (3, 'B'),

    (3, 'C'),

    (3, 'F'),

    (4, 'A'),

    (4, 'B'),

    (4, 'D'),

    (4, 'F'),

    (4, 'G'))x(id, value);

    SELECT id

    ,value=STUFF(

    (

    SELECT '-' + b.value

    FROM #Test b

    WHERE a.id = b.ID

    ORDER BY b.value

    FOR XML PATH('')

    ), 1, 1, '')

    FROM #Test a

    GROUP BY id

    HAVING COUNT(CASE WHEN value = 'E' THEN 1 END) > 0;

    GO

    DROP TABLE #Test;

    To include only IDs that do not contain E then make the equality test in the HAVING an inequality.

    That would work, but using the HAVING clause performs the test AFTER the grouping is done, doesn't it? I'm trying to get the exclusion done as early as possible, to avoid shoveling around records that I already know I don't want.

    It would probably be best for you to test the performance of each of the submitted queries for your case before drawing conclusions.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • That was a question, not a conclusion. Notice the question mark at the end of the first sentence. The second was simply a restatement of what I'm trying to do, which is make the query as efficient as possible. To that end, I would expect that discarding rows in a WHERE clause should be more efficient than first grouping them, then discarding them.

    In any case, I did try it. Record counts are only in the tens of thousands, and the tables are indexed on the appropriate fields, so measuring performance is a lost cause - the response is instant in both cases. However, I ran both versions in SSMS with execution plans enabled, and it generates the EXACT same plan in both versions, which beats me how that works out. The ESTIMATED execution plans differ somewhat, but the actuals are identical, down to the estimated number of rows in every intermediate step.

    Another thing has occurred to me to try, and that is a filtered view with an index and included fields for the source, rather than straight from the table. The criteria on which I select (the 'E' in my example) do not change for the various queries that are doing this, so that might be yet another effective tweak.

    I'll give it a try sometime in the next few days, and post back here if I discover anything interesting.

  • pdanes (5/19/2015)


    That was a question, not a conclusion. Notice the question mark at the end of the first sentence.

    Yes I did, but I'm not in the habit of answering directly questions for which I am not 100% sure of the answer.

    pdanes (5/19/2015)


    In any case, I did try it. Record counts are only in the tens of thousands, and the tables are indexed on the appropriate fields, so measuring performance is a lost cause - the response is instant in both cases. However, I ran both versions in SSMS with execution plans enabled, and it generates the EXACT same plan in both versions, which beats me how that works out. The ESTIMATED execution plans differ somewhat, but the actuals are identical, down to the estimated number of rows in every intermediate step.

    Part of the issue I had with your question is in this paragraph. You say "indexed on the appropriate fields" but unfortunately that is a rather subjective statement. An example of DDL with your Indexing might have drawn a more direct response.

    As to the execution plans being the same, that strikes me as a bit odd. I am not saying they cannot be the same, but there are various things happening in each of the queries that suggests to me that they shouldn't be.

    pdanes (5/19/2015)


    Another thing has occurred to me to try, and that is a filtered view with an index and included fields for the source, rather than straight from the table. The criteria on which I select (the 'E' in my example) do not change for the various queries that are doing this, so that might be yet another effective tweak.

    You could try that. I have to wonder why if you've already got a query that is running "instantaneously."

    Be that as it may, if I have time in the next couple of hours, I may take a page out of this article and run each of the queries through their paces and post back results.

    The One Million Row T-SQL Test Harness[/url]


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • For example, here's a proper test harness that covers 3 indexing scenarios (there may be others):

    CREATE TABLE #Test

    (

    id INT NOT NULL

    ,value VARCHAR(5) NOT NULL

    );

    SET NOCOUNT ON;

    WITH Tally (n) AS

    (

    -- 10,000 rows

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)

    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)

    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)

    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n)

    )

    INSERT INTO #Test

    SELECT n, c

    -- 10,000 IDs

    FROM Tally a

    -- 25 rows to each ID

    CROSS APPLY

    (

    SELECT SUBSTRING('ABCDFGHIJKLMNOPQRSTUVWXYZ', n, 1)

    FROM Tally

    WHERE n <= 25

    ) b (c)

    UNION ALL

    SELECT n, 'E'

    -- E in only the even IDs

    FROM Tally a

    WHERE n % 2 = 0;

    DECLARE @ID INT, @Value VARCHAR(5);

    PRINT '+++ Dwain''s solution (heap):';

    SET STATISTICS TIME ON;

    SELECT @ID=id

    ,@Value=STUFF(

    (

    SELECT '-' + b.value

    FROM #Test b

    WHERE a.id = b.ID

    ORDER BY b.value

    FOR XML PATH('')

    ), 1, 1, '')

    FROM #Test a

    GROUP BY id

    HAVING COUNT(CASE WHEN value = 'E' THEN 1 END) > 0;

    SET STATISTICS TIME OFF;

    PRINT '+++ Luis''s solution (heap):';

    SET STATISTICS TIME ON;

    SELECT @ID=id,

    @Value=STUFF((SELECT '-' + value FROM #Test it WHERE t.id = it.id

    ORDER BY value

    FOR XML PATH('')), 1, 1, '')

    FROM #Test t

    WHERE id IN (SELECT id FROM #Test WHERE value = 'E')

    GROUP BY id;

    SET STATISTICS TIME OFF;

    PRINT '+++ Lynn''s solution (heap):';

    SET STATISTICS TIME ON;

    with BaseIds as (

    select distinct tt.Id from #Test tt where not exists(select 1 from #Test tt1 where tt.Id = tt1.Id and tt1.value in ('E'))

    )

    select @ID=bi.Id, @Value=stuff((select '-' + tt.value from #Test tt where bi.Id = tt.Id order by tt.value for xml path(''),TYPE).value('.','varchar(max)'),1,1,'')

    from BaseIds bi;

    SET STATISTICS TIME OFF;

    ALTER TABLE #Test

    ADD CONSTRAINT t_pk PRIMARY KEY (id, value);

    PRINT '+++ Dwain''s solution (PRIMARY KEY):';

    SET STATISTICS TIME ON;

    SELECT @ID=id

    ,@Value=STUFF(

    (

    SELECT '-' + b.value

    FROM #Test b

    WHERE a.id = b.ID

    ORDER BY b.value

    FOR XML PATH('')

    ), 1, 1, '')

    FROM #Test a

    GROUP BY id

    HAVING COUNT(CASE WHEN value = 'E' THEN 1 END) > 0;

    SET STATISTICS TIME OFF;

    PRINT '+++ Luis''s solution (PRIMARY KEY):';

    SET STATISTICS TIME ON;

    SELECT @ID=id,

    @Value=STUFF((SELECT '-' + value FROM #Test it WHERE t.id = it.id

    ORDER BY value

    FOR XML PATH('')), 1, 1, '')

    FROM #Test t

    WHERE id IN (SELECT id FROM #Test WHERE value = 'E')

    GROUP BY id;

    SET STATISTICS TIME OFF;

    PRINT '+++ Lynn''s solution (PRIMARY KEY):';

    SET STATISTICS TIME ON;

    with BaseIds as (

    select distinct tt.Id from #Test tt where not exists(select 1 from #Test tt1 where tt.Id = tt1.Id and tt1.value in ('E'))

    )

    select @ID=bi.Id, @Value=stuff((select '-' + tt.value from #Test tt where bi.Id = tt.Id order by tt.value for xml path(''),TYPE).value('.','varchar(max)'),1,1,'')

    from BaseIds bi;

    SET STATISTICS TIME OFF;

    ALTER TABLE #Test

    DROP CONSTRAINT t_pk;

    CREATE UNIQUE NONCLUSTERED INDEX t_ix1 ON #Test (id, value);

    PRINT '+++ Dwain''s solution (UNIQUE NONCLUSTERED INDEX):';

    SET STATISTICS TIME ON;

    SELECT @ID=id

    ,@Value=STUFF(

    (

    SELECT '-' + b.value

    FROM #Test b

    WHERE a.id = b.ID

    ORDER BY b.value

    FOR XML PATH('')

    ), 1, 1, '')

    FROM #Test a

    GROUP BY id

    HAVING COUNT(CASE WHEN value = 'E' THEN 1 END) > 0;

    SET STATISTICS TIME OFF;

    PRINT '+++ Luis''s solution (UNIQUE NONCLUSTERED INDEX):';

    SET STATISTICS TIME ON;

    SELECT @ID=id,

    @Value=STUFF((SELECT '-' + value FROM #Test it WHERE t.id = it.id

    ORDER BY value

    FOR XML PATH('')), 1, 1, '')

    FROM #Test t

    WHERE id IN (SELECT id FROM #Test WHERE value = 'E')

    GROUP BY id;

    SET STATISTICS TIME OFF;

    PRINT '+++ Lynn''s solution (UNIQUE NONCLUSTERED INDEX):';

    SET STATISTICS TIME ON;

    with BaseIds as (

    select distinct tt.Id from #Test tt where not exists(select 1 from #Test tt1 where tt.Id = tt1.Id and tt1.value in ('E'))

    )

    select @ID=bi.Id, @Value=stuff((select '-' + tt.value from #Test tt where bi.Id = tt.Id order by tt.value for xml path(''),TYPE).value('.','varchar(max)'),1,1,'')

    from BaseIds bi;

    SET STATISTICS TIME OFF;

    GO

    DROP TABLE #Test;

    And here are the timing results:

    +++ Dwain's solution (heap):

    SQL Server Execution Times:

    CPU time = 1092 ms, elapsed time = 1112 ms.

    +++ Luis's solution (heap):

    SQL Server Execution Times:

    CPU time = 1201 ms, elapsed time = 970 ms.

    +++ Lynn's solution (heap):

    SQL Server Execution Times:

    CPU time = 1497 ms, elapsed time = 1032 ms.

    +++ Dwain's solution (PRIMARY KEY):

    SQL Server Execution Times:

    CPU time = 140 ms, elapsed time = 141 ms.

    +++ Luis's solution (PRIMARY KEY):

    SQL Server Execution Times:

    CPU time = 125 ms, elapsed time = 130 ms.

    +++ Lynn's solution (PRIMARY KEY):

    SQL Server Execution Times:

    CPU time = 234 ms, elapsed time = 232 ms.

    +++ Dwain's solution (UNIQUE NONCLUSTERED INDEX):

    SQL Server Execution Times:

    CPU time = 156 ms, elapsed time = 143 ms.

    +++ Luis's solution (UNIQUE NONCLUSTERED INDEX):

    SQL Server Execution Times:

    CPU time = 141 ms, elapsed time = 135 ms.

    +++ Lynn's solution (UNIQUE NONCLUSTERED INDEX):

    SQL Server Execution Times:

    CPU time = 234 ms, elapsed time = 231 ms.

    Note also that Lynn's solution uses the more generalized/safer (but slower) method to concatenate rows, so this is not exactly an apples to apples comparison.

    Edit: Note that things could change radically depending on many factors including different indexing, overlapping indexing, more rows (the optimizer may choose to parallelize some but not all of the plans), etc. Hence my suggestion you try it within your data model specifics.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (5/19/2015)


    Part of the issue I had with your question is in this paragraph. You say "indexed on the appropriate fields" but unfortunately that is a rather subjective statement. An example of DDL with your Indexing might have drawn a more direct response.

    That means indexed on the fields used in the query. The actual query involves some joins that are not pertinent to the matter I was asking about, so I left them out of the original post to keep my question as simple and on-topic as possible.

    dwain.c (5/19/2015)


    As to the execution plans being the same, that strikes me as a bit odd. I am not saying they cannot be the same, but there are various things happening in each of the queries that suggests to me that they shouldn't be.

    I also think it's odd that a HAVING and WHERE clause end up with identical plans, but that's what happened when I ran it on my actual data. I couldn't believe it at first, so I opened two panels side by side in SSMS and put one version in each, with display of execution plans enabled. Clicked through every single execution point and data path in each panel, and not one difference anywhere. I'm not that great at reading execution plans, so I don't completely understand what every step does, but I can certainly tell if it's the same in both panels.

    pdanes (5/19/2015)


    Another thing has occurred to me to try, and that is a filtered view with an index and included fields for the source, rather than straight from the table. The criteria on which I select (the 'E' in my example) do not change for the various queries that are doing this, so that might be yet another effective tweak.

    dwain.c (5/19/2015)


    You could try that. I have to wonder why if you've already got a query that is running "instantaneously."

    Hm - if you're capable of asking that, I don't know that I can formulate an answer that will make any sense to you. All I can say is that I do my best to constantly learn newer and better ways of doing things, even if the impact may not be noticeable immediately.

    dwain.c (5/19/2015)


    Be that as it may, if I have time in the next couple of hours, I may take a page out of this article and run each of the queries through their paces and post back results.

    Thanks, I've looked at them. Interesting results - an order of magnitude difference from the slowest to fastest. Again, my current dataset responds instantly, but datasets and complexity do grow, you know. At some point in the future, I may well be happy that I took the time now to make this as fast as possible. It's happened before, many times.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply