Need help on performance and integrity

  • The A's only is a very easy fix. The second part should have been handled by the variable check, but it's not reacting the way I thought it would. Looking into that now.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Heh. You know when I saw the title of this thread, I thought that it was another editorial about the Financial Crisis. 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • rbarryyoung (10/20/2008)


    Heh. You know when I saw the title of this thread, I thought that it was another editorial about the Financial Crisis. 🙂

    heh heh..sorry bout that, couldn't think of a good title for this one.

  • Ok, I just discovered something interesting. The reason it wasn't working before is because the @LastDupe variable inside that CASE statement does not get properly evaluated for the subselect. It will correctly insert into the table if you try to add it later in the update (or earlier), but inside of that subquery, it never gets evaluated. Hard coding a value works fine, but a variable won't work at all. I tried it 10 different ways, and every time it was ignored. That might help explain why it runs as well as it does.

    [EDIT]

    I had another solution posted here, but I tested it against some of your alternative test data(I did in fact just get lucky with the small data set 😉 ), and it is wrong as well, so at this point I'm scrapping that one too.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Garadin (10/20/2008)


    Ok, I just discovered something interesting. The reason it wasn't working before is because the @LastDupe variable inside that CASE statement does not get properly evaluated for the subselect. It will correctly insert into the table if you try to add it later in the update (or earlier), but inside of that subquery, it never gets evaluated. Hard coding a value works fine, but a variable won't work at all. I tried it 10 different ways, and every time it was ignored. That might help explain why it runs as well as it does.

    With that in mind, this gets a bit more complicated, as I have to now do it in multiple steps. I'm not quite as happy with this solution as I was before, but it should still be a *huge* improvement over your current method (assuming the new part actually works, and I didn't just get lucky with this small result set)....

    It almost got the result that I'm expecting but when I tried it on this kind of pattern, the result was lacking..

    INSERT #TempTable(RowID,RowKey,Locale,Code)

    SELECT 15,'1','zh','A' UNION ALL

    SELECT 16,'1','zh','A' UNION ALL

    SELECT 20,'2','en','C' UNION ALL

    SELECT 21,'3','de','C' UNION ALL

    SELECT 22,'4','de','A' UNION ALL

    SELECT 23,'2','en','A' UNION ALL

    SELECT 24,'4','de','A' UNION ALL

    SELECT 25,'4','de','A' UNION ALL

    SELECT 30,'1','en','C' UNION ALL

    SELECT 31,'1','en','C'

    The code you supplied outputted this (which is lacking):

    RowIDRowKeyLocaleCodeGroupIDDupe

    161zhA11

    202enC20

    224deA30

    244deA31

    301enC40

    The correct output should be:

    RowIDRowKeyLocaleCode

    161zhA

    202enC

    224deA

    244deA

    254deA

    301enC

    25 4deA - This should be included since in the second range/batch of A's it is a duplicate on 244deA

    Thanks for your time and help.

    P.S. The logic on how to get the result is on the attached file (demo.gif).

  • My apologies for all the semi correct attempts, had I taken the time to read all the posts I glossed over from a couple days ago, I'd have seen most of the errors I made in this. I will take a crack at fixing the last piece of this later tonight if you haven't gotten an answer by then. (Got a couple projects I need to finish before the end of the day)

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Garadin (10/20/2008)


    My apologies for all the semi correct attempts, had I taken the time to read all the posts I glossed over from a couple days ago, I'd have seen most of the errors I made in this. I will take a crack at fixing the last piece of this later tonight if you haven't gotten an answer by then. (Got a couple projects I need to finish before the end of the day)

    Thank you very much for your time Garadin, I really appreciate your help.

  • /Quote

    SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A

    SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A

    SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A

    SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A <- dup on RowID 14 (checked)

    SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A <- I don't need to include this even if there is dup on RowID 16 since it is already covered in the range from RowID 14-18

    SELECT 20,'1','en','C' UNION ALL

    SELECT 21,'1','en','C'

    Quote/

    I am a bit confused why you would not want to identify RowId = 19 as a duplicate of RowId =16. Is this simply because you want to run this process iteratively i.e

    first identify RowID 18 as a Duplicate and delete

    thenn in next cycle identify 19 as dupe and delete??

  • AnzioBake (10/21/2008)


    /Quote

    SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A

    SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A

    SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A

    SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A <- dup on RowID 14 (checked)

    SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A <- I don't need to include this even if there is dup on RowID 16 since it is already covered in the range from RowID 14-18

    SELECT 20,'1','en','C' UNION ALL

    SELECT 21,'1','en','C'

    Quote/

    I am a bit confused why you would not want to identify RowId = 19 as a duplicate of RowId =16. Is this simply because you want to run this process iteratively i.e

    first identify RowID 18 as a Duplicate and delete

    thenn in next cycle identify 19 as dupe and delete??

    Yes that would be like that of some sort...since RowID 18 is a first occurrence found as duplicate (dupe from RowID 14), the next time I search for another duplicate will start on RowID 18 onwards..

  • Dan,

    Not to rain on your quest here, but I don't think you've actually provided a complete statement of the problem. I know that it's very clear to YOU what result you want, but all we have is a picture, and some limited descriptions. What we do NOT have is a step by step algorithm for deciding that any given row is to be included. If you can write out a step by step completely specific procedure, that for ANY given row of input, decides whether or not that row is included, you'll be 99.999% of the way to a solution. I've read your diagram several times, and am still not clear on the exact set of rules to follow. I'm pretty sure we understand about grouping the sets of A's together, and that's fine, but once one is inside a group of A's, what's next? The algorithm needs to be able to take into account ALL possible input combinations that the rest of a given group of A's can have.

    Steve

    (aka smunson)

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • smunson (10/21/2008)


    Dan,

    Not to rain on your quest here, but I don't think you've actually provided a complete statement of the problem. I know that it's very clear to YOU what result you want, but all we have is a picture, and some limited descriptions. What we do NOT have is a step by step algorithm for deciding that any given row is to be included. If you can write out a step by step completely specific procedure, that for ANY given row of input, decides whether or not that row is included, you'll be 99.999% of the way to a solution. I've read your diagram several times, and am still not clear on the exact set of rules to follow. I'm pretty sure we understand about grouping the sets of A's together, and that's fine, but once one is inside a group of A's, what's next? The algorithm needs to be able to take into account ALL possible input combinations that the rest of a given group of A's can have.

    Steve

    (aka smunson)

    :):):)

    Thank you for pointing this out Steve and sorry for that one. Here is another attachment (Demo2.JPG) to explain how to get the result inside group of A. I hope this will make everything clear.

    Thank you for your help.

  • I swear I've done this problem on this forum before... :hehe:

    Dan, in the GIF file example, please tell me why RowID 10 is NOT considered to be a dupe of RowID 3, but you include RowID 23 as a dupe of RowID 21?

    Is it just because the rowkey changed to 2 and then went back to 1 between 10 and 3 but did not between 23 and 21?

    (and, yeah, you have an error in the instructions on that nice graphic you made. There is no RowID 13... you meant RowID 23.). 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • On my Desktop, because you mentioned performance, I created a table with 1 Million random records that a tweaked to ensure that I got Duplicates and series of data

    I have series ranging from 1- 35 records and multiple Duplicates.

    The result required is:

    the first Row of a series

    the first duplicate row

    followed by the duplicate of a row after the previous duplicate row.

    My solution is a bit long but executes in under three (3) minutes.

    Here is s sample of the longer series and the results set. If Correct I will post my Solution

    Data:

    RowIDRowKeyLocaleCode

    45691deA

    45701enA

    45711enA

    45721enA

    45731zhA

    45741enA

    45751enA

    45761enA

    45771enA

    45781deA

    45791enA

    45801zhA

    45811deA

    45821enA

    45831enA

    45841enA

    45851enA

    45861deA

    45871enA

    45881zhA

    45891enA

    45901enA

    45911zhA

    45921enA

    45931enA

    45941deA

    45951deA

    45961zhA

    45971enA

    45981deA

    45991zhA

    46001zhA

    46011deC

    Results:

    RowIDRowKeyLocaleCode

    45691deA

    45711enA

    45741enA

    45761enA

    45791enA

    45831enA

    45851enA

    45891enA

    45921enA

    45951deA

    45991zhA

    46011deC

  • I had some trouble logging on so read the other posts after my reply

    I had to make a change but now I am not sure if the result set is correct (see below)

    4574 appears as a duplicate of 4572 which is a duplicate of 4571 -4570

    RowIDRowKeyLocaleCode

    45691deA

    45711enA

    45721enA

    45741enA

    45751enA

    45761enA

    45771enA

    45791enA

    45821enA

    45831enA

    45841enA

    45851enA

    45871enA

    45891enA

    45901enA

    45921enA

    45931enA

    45951deA

    45981deA

    46001zhA

    46011deC

  • Jeff Moden (10/21/2008)


    I swear I've done this problem on this forum before... :hehe:

    Dan, in the GIF file example, please tell me why RowID 10 is NOT considered to be a dupe of RowID 3, but you include RowID 23 as a dupe of RowID 21?

    Is it just because the rowkey changed to 2 and then went back to 1 between 10 and 3 but did not between 23 and 21?

    Jeff, RowID 10 is NOT considered to be a dupe of RowID 3 because from RowID 3 onwards, iteratively the first dupe found is at RowID 5 (In this case I already have the range from RowID 3-5 which has unique data within it and need not to consider those data within that certain range to find another one) and therefore I should start looking again for another dupe from RowID 5 onwards.

    Jeff Moden (10/21/2008)


    (and, yeah, you have an error in the instructions on that nice graphic you made. There is no RowID 13... you meant RowID 23.). :)[

    Oh yeah, damn haven't notice that...yes, I mean RowID 23 and NOT RowID 13.

    Thanks Jeff.

Viewing 15 posts - 16 through 30 (of 115 total)

You must be logged in to reply to this topic. Login to reply