Need help on performance and integrity

  • AnzioBake (10/22/2008)


    I had some trouble logging on so read the other posts after my reply

    I had to make a change but now I am not sure if the result set is correct (see below)

    4574 appears as a duplicate of 4572 which is a duplicate of 4571 -4570

    RowIDRowKeyLocaleCode

    45691deA

    45711enA

    45721enA

    45741enA

    45751enA

    45761enA

    45771enA

    45791enA

    45821enA

    45831enA

    45841enA

    45851enA

    45871enA

    45891enA

    45901enA

    45921enA

    45931enA

    45951deA

    45981deA

    46001zhA

    46011deC

    Hi AnzioBake, yes you've got the CORRECT output on this second post of yours.

    Thank you for your help.

  • Sorry for not getting back with you Dan. Lost the primary hard drive in my dev machine yesterday, so it was an all day affair getting the thing back up and running. Think I'm just about there, although the resolution / color balance on the monitors now seems all screwed up. In any case, looks like AnzioBake has discovered a solution.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • I tried to post this late in the day, yesterday, but was having trouble with the website:

    Dan - I believe AnzioBake has now likely found the solution, but even he still had a question about exactly what constituted a duplicate: something for which you've never provided a complete and exact statement or definition that describes it in sufficiently excruciating detail AND handles all possible input.

    I'm pretty sure that's why it's taking so long for someone to provide a solution. Also, you've never stated one word about WHY you need to do this. If we knew the ultimate objective, we'd have a far better chance at helping you get there.

    Steve

    (aka smunson)

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Garadin (10/22/2008)


    Sorry for not getting back with you Dan. Lost the primary hard drive in my dev machine yesterday, so it was an all day affair getting the thing back up and running. Think I'm just about there, although the resolution / color balance on the monitors now seems all screwed up. In any case, looks like AnzioBake has discovered a solution.

    No problem Seth, just waiting on what the solution AnzioBake does though but I think he got it already. But if you think you have time to fix the solution you have then that would be great. I really appreciate your help guys. Thank you for that.

  • AnzioBake (10/22/2008)


    My solution is a bit long but executes in under three (3) minutes.

    Here is s sample of the longer series and the results set. If Correct I will post my Solution...

    Anzio, it is generally good form to always post your code here, as this gives others the chance to review it, check it and even test it. If you think that it might be too long for a post then just include it as an attachment. 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • I seem to have had some internet issues at work (where I was posting from earlier ) I will post the code as an attachment in the morning. Sorry for the delay, but I ended up in a meeting that took whole afternoon and left immediately.

  • Dan Segalles (10/22/2008)


    But if you think you have time to fix the solution you have then that would be great.

    Unfortunately, I've hit a dead wall with the method I was trying. I've tried it every which way I can think of (without completely ditching that method and using something different), and I get outdone by the query optimizer every time. It seems that it simply will not recursively run that subquery for each row. While that's really cool for the fact that it won't let me create RBAR(and has solved several other issues)... it's not very cool in the sense that it makes this one not work at all. Grabbing duplicates per group is easy. Grabbing them whilst ignoring other rows that happened before the last one, as well as including 'patches' of duplicates until the next change seems to add too much logic for this solution to handle.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • This is my solution. Having a problem at work connecting to website.

    --Select * From #TempTable

    IF OBJECT_ID ('TempDb..#Seq','U') IS NOT NULL

    BEGIN

    DROP TABLE #Seq

    END

    CREATE TABLE #Seq

    (

    RowID int PRIMARY KEY CLUSTERED,

    Code char(1)

    ,NextSeqStartRowID int

    )

    IF OBJECT_ID ('TempDb..#Dupe','U') IS NOT NULL

    BEGIN

    DROP TABLE #Dupe

    END

    CREATE TABLE #Dupe (

    SeqStartRowID int

    , Code Char(1)

    , NextSeqStartRowID int

    , FirstRowID int

    , DuplicateRowID int

    , Primary Key Clustered( SeqStartRowID , Code , NextSeqStartRowID , FirstRowID )

    )

    IF OBJECT_ID ('TempDb..#Result','U') IS NOT NULL

    BEGIN

    DROP TABLE #Result

    END

    CREATE TABLE #Result (

    RowID int PRIMARY KEY CLUSTERED,

    RowKey varchar(1000),

    Locale varchar(10),

    Code char(1)

    )

    DECLARE @OptRows int; -- variable which records the optimum number of rows to scan

    DECLARE@RowCnt int

    SET @OptRows = 100;

    ----Find All the induvidual sequences -------------------

    Insert into #Seq

    Select

    RowID = Min( RowiD)

    , Code

    , NextSeqStartRowID = isNUll( NextSeqStartRowID, Max(RowID)+1 )

    From(

    Select

    a.RowID

    , a.Code

    , NextSeqStartRowID = Min( b.RowID )

    From #TempTable as a

    Inner Join #TempTable as b

    on b.rowID > a.RowID

    And b.RowID <= a.RowID +@OptRows

    And b.Code <> a.Code

    Group by

    a.RowID

    , a.Code

    )as c

    Group by

    Code

    , NextSeqStartRowID

    -----------------------------------------------------------

    ----Identify all Duplicates in sequences--------------------

    Insert into #Dupe

    Select

    SeqStartRowID = s.rowID

    , Code = s.Code

    , s.NextSeqStartRowID

    , FirstRowID = t.RowID

    , DuplicateRowID = Min( x.RowId )

    From #Seq as s

    Left Join #TempTable as t

    on t.rowID >= s.RowID

    And t.RowID < s.NextSeqStartRowID

    Inner JOin #TempTable as x

    on x.RowId >= s.RowID

    And x.RowID < s.NextSeqStartRowID

    And x.RowID > t.rowID

    And x.Locale = t.locale

    And x.rowKey = t.RowKey

    Group by

    s.rowID

    , s.Code

    , s.NextSeqStartRowID

    , t.RowID

    Order by

    s.RowID

    , t.rowID

    ---------------------------------------------------------

    ----Constuct the result set-------------------------------

    Delete from #Result

    Insert into #Result

    Select d.*

    From #Seq as s

    Inner join #TempTable as d

    on d.rowid = s.rowid

    Set @RowCnt = 1

    While @Rowcnt > 0 Begin

    Insert into #Result

    Select d.*

    From

    (

    Select

    S.LstRes

    , RowId = Min( d.DuplicateRowId )

    From(

    ----Find Last RowID in Result Set

    Select

    s.RowID

    , s.Code

    , s.NextSeqStartRowID

    , LstRes = Max( r.RowID )

    From #Seq as s

    Inner join #Result as r

    on r.RowId >= s.rowid

    And r.RowID < s.NextSeqStartRowId

    Group by

    s.RowID

    , s.Code

    , s.NextSeqStartRowID

    --------------------------------

    ) as s

    Inner join #Dupe as d

    on d.SeqStartRowID = s.rowid

    And d.FirstRowID >= s.LstRes ----Can be cahnged to > if you do not want to copare to the last duplicate row

    Group by

    LstRes

    ) as x

    Inner join #TempTable as d

    on d.RowID = x.RowID

    Select @RowCnt = @@RowCount

    End

    ----------------------------------------------------

    Select * From #Result

  • It has been 24 hours and no response. I just wanna confirm that the solution works and performs adequately.

  • AnzioBake (10/23/2008)


    It has been 24 hours and no response. I just wanna confirm that the solution works and performs adequately.

    Hello Anzio, I apologize for not getting back to you right away since I have to take care of some emergency matters. Anyway, I tested your code during the weekend for integrity but it wasn't giving me the correct output for some scenarios.

    I simply test it with this kind of scenario:

    INSERT #TempTable

    SELECT 1,'1','en','A' UNION ALL

    SELECT 2,'1','en','A' UNION ALL

    SELECT 5,'1','en','C' UNION ALL

    SELECT 7,'1','en','A' UNION ALL

    SELECT 8,'1','zh','A' UNION ALL

    SELECT 10,'1','en','A' UNION ALL

    SELECT 11,'1','zh','A' UNION ALL

    SELECT 20,'1','en','D' UNION ALL

    SELECT 21,'1','en','D'

    GO

    Which should output:

    RowIDRowKeyLocaleCode

    21enA

    51enC

    71enA

    101enA

    201enD

    Your script outputted this:

    RowIDRowKeyLocaleCode

    11enA

    21enA

    51enC

    71enA

    101enA

    Thank for your help Anzio.

  • No problem with the delay. I just wanted to know if it has solved your problem...I will check the scenario where it gives the incorrect answer

  • Sorry for posting the whole script again. some minor changes required but the code below gives the required result.

    I changed the script to

    1. Take All codes into Account

    2. Only look at duplicates for code = A and

    3. Remove the first occurance of code = A from the result set.

    Question

    Do you want to remove the first occurance of Code ='A' or the first record? You can adjust the delete record script based on your answer.

    IF OBJECT_ID ('TempDB..#TempTable','U') IS NOT NULL

    BEGIN

    DROP TABLE #TempTable

    END

    CREATE TABLE #TempTable

    (

    RowID int PRIMARY KEY CLUSTERED,

    RowKey varchar(1000),

    Locale varchar(10),

    Code char(1)

    )

    GO

    INSERT #TempTable

    SELECT 1,'1','en','A' UNION ALL

    SELECT 2,'1','en','A' UNION ALL

    SELECT 5,'1','en','C' UNION ALL

    SELECT 7,'1','en','A' UNION ALL

    SELECT 8,'1','zh','A' UNION ALL

    SELECT 10,'1','en','A' UNION ALL

    SELECT 11,'1','zh','A' UNION ALL

    SELECT 20,'1','en','D' UNION ALL

    SELECT 21,'1','en','D'

    GO

    /*

    Which should output:

    RowID RowKey Locale Code

    2 1 en A

    5 1 en C

    7 1 en A

    10 1 en A

    20 1 en D

    Your script outputted this:

    RowID RowKey Locale Code

    1 1 en A

    2 1 en A

    5 1 en C

    7 1 en A

    10 1 en A

    */

    --Select * From #TempTable

    IF OBJECT_ID ('TempDb..#Seq','U') IS NOT NULL

    BEGIN

    DROP TABLE #Seq

    END

    CREATE TABLE #Seq

    (

    RowID int PRIMARY KEY CLUSTERED,

    Code char(1)

    ,NextSeqStartRowID int

    )

    IF OBJECT_ID ('TempDb..#Dupe','U') IS NOT NULL

    BEGIN

    DROP TABLE #Dupe

    END

    CREATE TABLE #Dupe (

    SeqStartRowID int

    , Code Char(1)

    , NextSeqStartRowID int

    , FirstRowID int

    , DuplicateRowID int

    , Primary Key Clustered( SeqStartRowID , Code , NextSeqStartRowID , FirstRowID )

    )

    IF OBJECT_ID ('TempDb..#Result','U') IS NOT NULL

    BEGIN

    DROP TABLE #Result

    END

    CREATE TABLE #Result (

    RowID int PRIMARY KEY CLUSTERED,

    RowKey varchar(1000),

    Locale varchar(10),

    Code char(1)

    )

    DECLARE @OptRows int; -- variable which records the optimum number of rows to scan

    DECLARE@RowCnt int

    SET @OptRows = 100;

    ----Find All the induvidual sequences -------------------

    Insert into #Seq

    Select

    RowID = Min( RowiD)

    , Code

    , NextSeqStartRowID = isNUll( NextSeqStartRowID, Max(RowID)+1 )

    From(

    Select

    a.RowID

    , a.Code

    , NextSeqStartRowID = Min( b.RowID )

    From #TempTable as a

    Left Join #TempTable as b

    on b.rowID > a.RowID

    And b.RowID <= a.RowID +@OptRows

    And b.Code <> a.Code

    Group by

    a.RowID

    , a.Code

    )as c

    Group by

    Code

    , NextSeqStartRowID

    -----------------------------------------------------------

    ----Identify all Duplicates in sequences--------------------

    Insert into #Dupe

    Select

    SeqStartRowID = s.rowID

    , Code = s.Code

    , s.NextSeqStartRowID

    , FirstRowID = t.RowID

    , DuplicateRowID = Min( x.RowId )

    From #Seq as s

    Left Join #TempTable as t

    on t.rowID >= s.RowID

    And t.RowID < s.NextSeqStartRowID

    Inner JOin #TempTable as x

    on x.RowId >= s.RowID

    And x.RowID < s.NextSeqStartRowID

    And x.RowID > t.rowID

    And x.Locale = t.locale

    And x.rowKey = t.RowKey

    Group by

    s.rowID

    , s.Code

    , s.NextSeqStartRowID

    , t.RowID

    Order by

    s.RowID

    , t.rowID

    ---------------------------------------------------------

    ----Construct the result set-------------------------------

    Delete from #Result

    Insert into #Result

    Select d.*

    From #Seq as s

    Inner join #TempTable as d

    on d.rowid = s.rowid

    Set @RowCnt = 1

    While @Rowcnt > 0 Begin

    Insert into #Result

    Select d.*

    From

    (

    Select

    S.LstRes

    , RowId = Min( d.DuplicateRowId )

    From(

    ----Find Last RowID in Result Set

    Select

    s.RowID

    , s.Code

    , s.NextSeqStartRowID

    , LstRes = Max( r.RowID )

    From #Seq as s

    Inner join #Result as r

    on r.RowId >= s.rowid

    And r.RowID < s.NextSeqStartRowId

    Group by

    s.RowID

    , s.Code

    , s.NextSeqStartRowID

    --------------------------------

    ) as s

    Inner join #Dupe as d

    on d.SeqStartRowID = s.rowid

    And d.FirstRowID >= s.LstRes ----Can be changed to > if you do not want to compare to the last duplicate row

    And d.Code = 'A'--Only interested in Duplicates of A

    Group by

    LstRes

    ) as x

    Inner join #TempTable as d

    on d.RowID = x.RowID

    Select @RowCnt = @@RowCount

    End

    ----------------------------------------------------

    --Remove First Record of the sequence for Code = 'A'---------

    Delete from tgt

    From #Result as tgt

    Inner Join(

    Select Min(RowID) as RowID

    From #Seq

    Where Code = 'A'

    )as seq

    on seq.RowID = Tgt.RowID

    ------------------------------

    Select * From #Result

  • Heh... but the question remains... has anyone done it correctly, yet?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • For records {1, 2} OP wants the second record with ID 2 because there is a duplicate record.

    For records {7, 8, 10, 11} OP wants the first record with ID 7 because there are duplicate records.

    Why first record for first batch and second record for second batch?

    What are the business rules for this?

    I know the business rule is to remove the first duplicate of every batch, but applying that rule on second batch would lead to return record with ID 8, not ID 7 as OP writes.

    The reason I posted the many links is that in every link there are different descriptions of the business rules. I don't know which is correct.


    N 56°04'39.16"
    E 12°55'05.25"

Viewing 15 posts - 31 through 45 (of 115 total)

You must be logged in to reply to this topic. Login to reply