I suspect that there's a better way to minimize IO, but ...

  • Hi gang,

    I'm currently running the following commands, and it's working. But, I suspect that there's a better way, that avoids the distinct and running the subquery for each record where it's not necessary. As it is now, the subquery would be running for each record, and then the distinct gets the unique occurrances. I think this can be accomplished by avoiding the DISTINCT, but I just haven't been able to get a grip on how to accomplish it.

    The desired output is as this code produces - a distinct list of Values, with a comma-separated list of all Codes per Value.

    On a related note... is there a way to get the CSV list without the leading comma as part of the subquery? Right now, I'm having to come back after it's been run to remove it.

    declare @temp TABLE (

    Value int,

    Code numeric(5,3))

    insert into @temp

    select 1, 1.725 UNION

    select 1, 1.818 UNION

    select 2, 2.110 UNION

    select 2, 2.575 UNION

    select 3, 8.257 UNION

    select 3, 9.258

    declare @t2 table (Value int, Code varchar(50))

    insert into @t2

    select distinct

    Value,

    Code = (select ', ' + convert(varchar(6), Code) from @temp where Value = t.Value FOR XML PATH(''))

    from @temp t

    update @t2 set Code = stuff(Code, 1, 2, '')

    select * from @t2

    Thanks,

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Hi Wayne,

    Does your table have to be a variable, I had a go with a bigger data set and got really bad results on it, it got a lot quicker once I added an index and group by.

    --Jeff Moden’s test table

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1

    CROSS JOIN Master.dbo.SysColumns t2

    --===== Add the Primary Key

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (RowNum)

    -- equivalent of your query on the large test data – takes _forever_ to run! 16 minutes and counting!

    select distinct

    SomeLetters2,

    Code = stuff((select ', ' + convert(varchar(6), SomeMoney) from JBMTest where SomeLetters2 = t.SomeLetters2 FOR XML PATH('')), 1, 2, '')

    from JBMTest t

    -- add an index to SomeLetters2 column

    CREATE NONCLUSTERED INDEX [IX_SomeLetters] ON [dbo].[JBMTest]

    (

    [SomeLetters2] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    --Group by on indexed column runs pretty quick c. 10 seconds

    SELECT

    SomeLetters2,

    Code = stuff((select ', ' + convert(varchar(6), SomeMoney) from JBMTest WHERE SomeLetters2 = t.SomeLetters2 FOR XML PATH('')), 1, 2, '')

    FROM JBMTest t

    GROUP BY someletters2

  • Heh... After about half an hour testing I figured it out. The answer is as simple that I didn't think about.

    @Allister: Thanks for the test data!

    I took Allister's test data and executed the query (which is just like your, Wayne). I stopped the execution after about 5 minutes.

    I tried a quirky update solution which doesn't work, a CLR version which took about 5 seconds (great!).

    After that I took a second look to the initial query. The major problem was just the DISTINCT which performs on the final result and causes the sub-query to be executed 1,000,000 times. Just remove the DISTINCT and use a GROUP BY and it takes about 7-8 seconds on my system.

    [font="Courier New"]DBCC FREEPROCCACHE WITH NO_INFOMSGS

    DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS

    GO

    ---============= FOR XML =======================

    IF (OBJECT_ID('tempdb..#ResultsXML') IS NOT NULL)

       DROP TABLE #ResultsXML

    CREATE TABLE #ResultsXML (Id INT NOT NULL IDENTITY PRIMARY KEY CLUSTERED, SomeLetters VARCHAR(100), Code VARCHAR(MAX))

    -- equivalent of your query on the large test data – takes _forever_ to run! 16 minutes and counting!

    INSERT INTO #ResultsXML

       SELECT

            SomeLetters2,

            Code = STUFF(

                      (

                         SELECT ', ' + CONVERT(VARCHAR(6), SomeMoney)

                         FROM JBMTest

                         WHERE SomeLetters2 = t.SomeLetters2

                         FOR XML PATH('')

                      ),

                      1, 2, '')

         FROM JBMTest t

         GROUP BY SomeLetters2

    [/font]

    Greets

    Flo

  • Sorry Flo,

    shouldn't have left the distinct query in the middle of my solution!

    For reference it finally finished running after 23:53... indexed GROUP BY took 00:10.

    The CLR route sounds promising, I really must make some time to research this area, a lot of my time is primarily c# development.

    Allister

  • Allister Reid (5/24/2009)


    Sorry Flo,

    shouldn't have left the distinct query in the middle of my solution!

    For reference it finally finished running after 23:53... indexed GROUP BY took 00:10.

    The CLR route sounds promising, I really must make some time to research this area, a lot of my time is primarily c# development.

    Allister

    Hi Allister

    Nothing to apologize! The distinct query reflected Wayne's approach and I didn't have to write it since you did πŸ˜›

    Nice to hear that you are interested in SQL CLR. CLR is cool for some things in SQL Server but most times T-SQL is the better way for usual data tasks.

    Greets

    Flo

  • heh-he... and i wrote the group by solution in me first post as well as the distinct πŸ˜‰

    Have been thinking about dynamic custom data types via CLR for user defined forms, I take it you have to implement access functions for each type... maybe you can point me in the direction of any decent articles/walkthroughs?

  • Allister Reid (5/24/2009)


    heh-he... and i wrote the group by solution in me first post as well as the distinct πŸ˜‰

    :w00t:

    I didn't see the group by. Really. So I have to apologize!

    Have been thinking about dynamic custom data types via CLR for user defined forms, I take it you have to implement access functions for each type... maybe you can point me in the direction of any decent articles/walkthroughs?

    I'm not sure if I understood you here. (Er.. I'm sure I didn't :-D)

    I don't have very good articles/walkthroughs for SQL CLR. I'm primary a C# developer so I just had a look to BOL and their (not very good) samples. I never read a book about SQL CLR.

    Flo

  • Florian Reischl (5/24/2009)


    Allister Reid (5/24/2009)


    heh-he... and i wrote the group by solution in me first post as well as the distinct πŸ˜‰

    :w00t:

    I didn't see the group by. Really. So I have to apologize!

    No worries... easy to miss if it takes forever for script to acually reach this point! I can't take any credit for it anyway coz I got from another post (actually about 4 or 5 other posts) here-abouts.

    Have been thinking about dynamic custom data types via CLR for user defined forms, I take it you have to implement access functions for each type... maybe you can point me in the direction of any decent articles/walkthroughs?

    I'm not sure if I understood you here. (Er.. I'm sure I didn't :-D)

    Seriously, don't worry about it, I don't entirely understand myself yet, and it's pretty off topic! But I've kinda been going in the direction of having system that application administrators can define forms, which are serialised and saved to data base, then application users fill out specific instances of these forms with specific details.

    You could think of the system as a configurable questionaire, where application admin define the structure of the forms and application users create specific instances of the forms by filling them out; obviously these instances would be saved to database for future collation/processing/querying. It is these custom objects that I was considering as candidates for creating as custom types. The reason being that I thought this may get round problem of having un-queryable serialised data in database.

    Allister

  • Allister Reid (5/24/2009)


    Have been thinking about dynamic custom data types via CLR for user defined forms, I take it you have to implement access functions for each type... maybe you can point me in the direction of any decent articles/walkthroughs?

    I'm not sure if I understood you here. (Er.. I'm sure I didn't :-D)

    Seriously, don't worry about it, I don't entirely understand myself yet, and it's pretty off topic! But I've kinda been going in the direction of having system that application administrators can define forms, which are serialised and saved to data base, then application users fill out specific instances of these forms with specific details.

    You could think of the system as a configurable questionaire, where application admin define the structure of the forms and application users create specific instances of the forms by filling them out; obviously these instances would be saved to database for future collation/processing/querying. It is these custom objects that I was considering as candidates for creating as custom types. The reason being that I thought this may get round problem of having un-queryable serialised data in database.

    Sounds like MS InfoPath ;-).

    It depends on the type of serialization you use. Avoid binary but you can use XML data type. You can use one table which contains the plain XML to ensure the availability of the original data. In addition use a EAV design (if you have to be able to support any kind of user data) and get the important information into usual database columns. So you can use SQL to query main information. For really special things you can query the XML data.

    For front-end you can either use a 3rd party tool like Altova Authentic or use XSLT which dynamically transforms your templates into a HTML layout. A submit button can call a java-script to get all entered user data into a XML format and call a web-service to store the data.

    Maybe consider to open a new thread for this discussion and post a link here πŸ˜‰

    Flo

  • Allister Reid (5/24/2009)


    Hi Wayne,

    Does your table have to be a variable, I had a go with a bigger data set and got really bad results on it, it got a lot quicker once I added an index and group by.

    No, and in fact it is only a table variable for this post to make it easy for you'll to test against. I'm actually using a temp table with indexes myself.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Florian Reischl (5/24/2009)


    Heh... After about half an hour testing I figured it out. The answer is as simple that I didn't think about.

    @Allister: Thanks for the test data!

    I took Allister's test data and executed the query (which is just like your, Wayne). I stopped the execution after about 5 minutes.

    I tried a quirky update solution which doesn't work, a CLR version which took about 5 seconds (great!).

    After that I took a second look to the initial query. The major problem was just the DISTINCT which performs on the final result and causes the sub-query to be executed 1,000,000 times. Just remove the DISTINCT and use a GROUP BY and it takes about 7-8 seconds on my system.

    Flo & Allister - I want to thank you for helping out with this. I could swear that I had tried the group by... I know that I had at least thought about it. And I had tried out the stuff function in the first select without the additional set of () around it. Geez, it really is this "simple" stuff that trips one up sometimes. And your testing pretty much confirmed what I thought about using the distinct in the first place.. there would be some performance problems when scaling up (though I was pretty happy with what I was getting). Plus, I was kinda concerned about being the victim of Gail's double-edged ninja sword.:-)

    Again, thanks!!!!

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Hi Wayne

    Glad that we could help!

    Flo

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply