Need help on performance and integrity

  • AnzioBake (10/28/2008)


    The above change produces the following results

    RowIDRowKeyLocaleCode

    11enA

    21enA

    51enC

    71enA

    101enA

    121zhA

    131zhA

    141zhA

    201enD

    Note

    the first row in the table is removed RowId= 0

    Rows 12, 13, 14 are all in the result set as required

    Am I still missing something in the solution?

    Anzio, your approach looks good...I tested it for data integrity and looks great, its giving me the correct output but have to test it though in terms of performance later tonight when I get home. I'll keep you posted. Thanks.

  • Dan,

    Despite your explanation, I'm still not sure I completely understand the requirements, but I now have what I think is a solution. I'm not too sure on the performance of this, but it does use an UPDATE to a table variable, which could be easily replaced with a temp table. I took the initial 3 patterns from your original post, and this code produces correct results for those 3 patterns. I don't have time to look through the other 6 pages for other test patterns, so you'll have to test this against those other patterns. I have patterns 2 and 3 commented out using the /* and */ construct:

    --===== Set up the initial test data

    DECLARE @data TABLE(

    RowID int PRIMARY KEY CLUSTERED,

    RowKey varchar(2),

    Locale char(2),

    Code char(1)

    )

    INSERT INTO @data

    --===== PATTERN ONE

    SELECT 1,'1','en','C' UNION ALL

    SELECT 2,'1','de','C' UNION ALL

    SELECT 3,'1','zh','A' UNION ALL

    SELECT 4,'2','en','A' UNION ALL

    SELECT 5,'2','en','A' UNION ALL

    SELECT 10,'1','zh','A' UNION ALL

    SELECT 11,'1','zh','A' UNION ALL

    SELECT 15,'1','zh','C' UNION ALL

    SELECT 16,'1','de','C' UNION ALL

    SELECT 20,'1','en','A' UNION ALL

    SELECT 21,'3','en','A' UNION ALL

    SELECT 22,'3','de','A' UNION ALL

    SELECT 23,'3','en','A' UNION ALL

    SELECT 24,'3','de','A' UNION ALL

    SELECT 30,'1','en','C' UNION ALL

    SELECT 31,'1','en','C'

    --===== PATTERN TWO

    /*

    SELECT 1,'1','en','D' UNION ALL

    SELECT 2,'1','de','C' UNION ALL

    SELECT 3,'1','zh','A' UNION ALL

    SELECT 4,'1','zh','A' UNION ALL

    SELECT 5,'1','zh','A' UNION ALL

    SELECT 10,'1','zh','A' UNION ALL

    SELECT 11,'1','zh','C' UNION ALL

    SELECT 15,'1','zh','C' UNION ALL

    SELECT 16,'1','de','C' UNION ALL

    SELECT 20,'3','en','A' UNION ALL

    SELECT 21,'3','de','A' UNION ALL

    SELECT 22,'3','de','A' UNION ALL

    SELECT 23,'3','en','A' UNION ALL

    SELECT 24,'3','de','D' UNION ALL

    SELECT 30,'1','en','C' UNION ALL

    SELECT 31,'1','en','C'

    */

    --===== PATTERN THREE

    /*

    SELECT 1,'5','en','D' UNION ALL

    SELECT 2,'5','de','C' UNION ALL

    SELECT 3,'4','zh','C' UNION ALL

    SELECT 5,'5','zh','A' UNION ALL

    SELECT 6,'4','zh','A' UNION ALL

    SELECT 7,'3','zh','A' UNION ALL

    SELECT 11,'2','zh','A' UNION ALL

    SELECT 15,'1','zh','A' UNION ALL

    SELECT 16,'1','zh','A' UNION ALL

    SELECT 20,'2','en','C' UNION ALL

    SELECT 21,'3','de','C' UNION ALL

    SELECT 22,'4','de','A' UNION ALL

    SELECT 23,'2','en','A' UNION ALL

    SELECT 24,'4','de','A' UNION ALL

    SELECT 30,'1','en','C' UNION ALL

    SELECT 31,'1','en','C'

    */

    --===== Set up a results table

    DECLARE @EDATA TABLE(

    RN int PRIMARY KEY CLUSTERED,

    RowID int,

    RowKey char(1),

    Locale char(2),

    Code char(1),

    STARTS_GRP char(1),

    IS_DUPE char(1),

    IS_NEXT_NON_A char(1)

    )

    --===== Populate our results table with initial data and row numbers

    INSERT INTO @EDATA (RN, RowID, RowKey, Locale, Code)

    SELECT ROW_NUMBER() OVER(ORDER BY RowID) AS RN, RowID, RowKey, Locale, Code

    FROM @data

    --===== Update our results to indicate where groups of A's start

    UPDATE E

    SET STARTS_GRP = 'Y'

    FROM @EDATA AS E

    WHERE E.Code = 'A' AND

    EXISTS (

    SELECT E2.RN

    FROM @EDATA AS E2

    WHERE E2.Code <> 'A' AND

    E2.RN = E.RN - 1

    )

    --===== Update our results to indicate any consecutive duplicates

    UPDATE E

    SET IS_DUPE = 'Y'

    FROM @EDATA AS E

    WHERE E.Code = 'A' AND

    EXISTS (

    SELECT E2.RN

    FROM @EDATA AS E2

    WHERE E2.Code = 'A' AND

    E2.RN = E.RN - 1 AND

    E2.RowKey = E.RowKey AND

    E2.Locale = E.Locale

    )

    --===== Update our results to indicate every break from 'A'

    UPDATE E

    SET IS_NEXT_NON_A = 'Y'

    FROM @EDATA AS E

    WHERE E.Code <> 'A' AND

    (EXISTS (

    SELECT E2.RN

    FROM @EDATA AS E2

    WHERE E2.Code = 'A' AND

    E2.RN = E.RN - 1

    ) OR

    EXISTS (

    SELECT E5.RN

    FROM @EDATA AS E5

    WHERE E5.RN = E.RN - 1 AND

    E5.Code <> 'A' AND

    E5.Code <> E.Code

    )

    )

    --===== Update our results to find non-consecutive dupes

    UPDATE E

    SET IS_DUPE = 'Y'

    FROM @EDATA AS E

    WHERE Code = 'A' AND

    IS_DUPE IS NULL AND

    (SELECT E4.IS_DUPE FROM @EDATA AS E4 WHERE E4.RN = E.RN - 1) IS NULL AND

    EXISTS (

    SELECT E2.RN

    FROM @EDATA AS E2

    WHERE E2.Code = 'A' AND

    E2.RN < E.RN - 1 AND

    E2.RN >= (

    SELECT MAX(E3.RN)

    FROM @EDATA AS E3

    WHERE E3.RN < E.RN AND

    (E3.STARTS_GRP = 'Y' OR E3.IS_DUPE = 'Y')

    ) AND

    E2.RowKey = E.RowKey AND

    E2.Locale = E.Locale AND

    E2.Code = E.Code

    )

    --===== Update our results to eliminate non-consecutive dupes that don't qualify

    UPDATE E

    SET IS_DUPE = NULL

    FROM @EDATA AS E INNER JOIN @EDATA AS E2

    ON E.RN = E2.RN + 1

    WHERE E.IS_DUPE = 'Y' AND

    E2.IS_DUPE = 'Y' AND

    E.Code = 'A' AND

    E2.Code = 'A' AND

    (E.RowKey <> E2.RowKey OR

    E.Locale <> E2.Locale)

    --===== Select our entire results table and also the proper results rows

    SELECT *

    FROM @EDATA

    SELECT RowID

    FROM @EDATA

    WHERE STARTS_GRP IS NOT NULL OR

    IS_DUPE IS NOT NULL OR

    IS_NEXT_NON_A IS NOT NULL

    Let me know if you can test this against other patterns and see if it works. The reason I can't be sure of performance is because I timed my query after perfecting it, and the difference in milliseconds between a GETDATE() before and after the query is 0, which means the results were already in cache, and as I have no means (nor authority) to clear it, I can't generate a valid timed result until at least tomorrow.

    Steve

    (aka smunson)

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • UPDATE: just got lucky and generated a 13ms execution time on pattern 1. Subsequent attempts went right back to zero, despite changing to other patterns. I have no idea whether or not this is the least bit representative or not, nor how it might apply to a considerably larger test dataset.

    Steve

    (aka smunson)

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • HI

    As I mentioned in my first post, I tested the code against a table with 1 million records and it completed in under 3 minutes, so let me know if you are getting a very different result...

  • I'm glad to see you guys finally slaying this beast 🙂 Wish I could have particpated more but I had to work on my personal performance to improve the integrity of my job retention. 😉

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Anzio,

    If you get the chance, could you test MY code against that same 1 million records? I'd love to know a) if it produces the same result, and b) what kind of performance it has. I don't have an environment available for this kind of testing. Thanks!

    Steve

    (aka smunson)

    :):):)

    AnzioBake (10/28/2008)


    HI

    As I mentioned in my first post, I tested the code against a table with 1 million records and it completed in under 3 minutes, so let me know if you are getting a very different result...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • smunson (10/28/2008)


    Dan,

    .....

    Let me know if you can test this against other patterns and see if it works. The reason I can't be sure of performance is because I timed my query after perfecting it, and the difference in milliseconds between a GETDATE() before and after the query is 0, which means the results were already in cache, and as I have no means (nor authority) to clear it, I can't generate a valid timed result until at least tomorrow.

    Steve

    (aka smunson)

    :):):)

    Hi Steve, I have tested your code on the following pattern but its giving me incorrect result.

    --===== Set up the initial test data

    DECLARE @data TABLE(

    RowID int PRIMARY KEY CLUSTERED,

    RowKey varchar(2),

    Locale char(2),

    Code char(1)

    )

    INSERT INTO @data

    SELECT 1,'1','en','A' UNION ALL

    SELECT 2,'2','en','A' UNION ALL

    SELECT 3,'3','en','A' UNION ALL

    SELECT 4,'1','en','A' UNION ALL

    SELECT 5,'2','en','A' UNION ALL

    SELECT 13,'3','en','A' UNION ALL

    SELECT 14,'1','en','A' UNION ALL

    SELECT 15,'2','en','A' UNION ALL

    SELECT 23,'3','en','A' UNION ALL

    SELECT 24,'1','en','A' UNION ALL

    SELECT 31,'2','en','A'

    ---// Output/Result should be: 4,14,24

    /*

    SELECT 1,'1','en','A' UNION ALL

    SELECT 2,'2','en','A' UNION ALL

    SELECT 3,'1','en','A' UNION ALL

    SELECT 4,'2','en','A' UNION ALL

    SELECT 5,'1','en','A' UNION ALL

    SELECT 13,'2','en','A' UNION ALL

    SELECT 14,'2','en','A' UNION ALL

    SELECT 15,'2','en','A' UNION ALL

    SELECT 23,'4','en','A' UNION ALL

    SELECT 24,'5','en','A' UNION ALL

    SELECT 31,'1','en','A'

    --//Output/Result should be: 3,5,14,15,

    */

    Thanks for your help and time on this.

  • Oooh. Do me! Do me! (New solution a few posts back.) Despite the cursor, I'd wager that it'd run *fairly* quick. Unless I screwed something up, I think I finally have the logic of this down, which isn't actually as complicated as it originally sounded.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Garadin (10/28/2008)


    Oooh. Do me! Do me! (New solution a few posts back.) Despite the cursor, I'd wager that it'd run *fairly* quick. Unless I screwed something up, I think I finally have the logic of this down, which isn't actually as complicated as it originally sounded.

    Seth, your solution works great. It gives me the correct output every possible pattern that i've created. Have to test it also with performance later tonight after work.

    Thanks for the help and time.

  • Ok. For anyone who is reading this thread who is still somewhat confused on how this works, I'm going to attempt to break it down the way I understand it (now that I'm fairly certain I understand it correctly). One of the pivotal concepts that I think might get missed is that he *does* include the row that *is* a duplicate, but not any before it in the next search. This is how you see 3 in a row that are all marked as duplicates. The second and third are duplicates of the one directly before them, which is included in the new scan range.

    To lower the number of rows I need to effectively demonstrate my point, I will use the same Locale in all example rows. Keep in mind however, that locale must match as well for this to be considered a dupe. If the row should be marked, it has a comment next to it in the following list. Rows not flagged are at the end.

    RowID RowKey Locale Code Reason for Flagging

    1 1'en'A

    2 1'en'A- Duplicate of 1 (Range 1-2)

    3 2'en'C- New Code Group

    4 1'en'C

    5 1'en'A- New Code Group

    6 2'en'A

    7 1'en'A- Duplicate of 5 (Range 5-7)

    8 1'en'A- Duplicate of 7 (Range 7-8)

    9 2'en'A

    102'en'A- Duplicate of 9 (Range 8-10)

    111'en'A

    122'en'A- Duplicate of 10(Range 10-12)

    131'en'C- New Code Group

    141'en'C

    151'en'A- New Code Group

    Reasons the others were not chosen:

    1. First record in table. This is the only one that does not follow the rule of first in a new code group.

    4. He does not want duplicates of C's... only A's.

    6. No Duplicate in the group which started at 5.

    9. No match starting at 8 (when the last dupe was found)

    11. No match starting at 10 (when the last dupe was found)

    14. He does not want duplicates of C's... only A's.

    I don't know if that is any clearer than all of his explanations and diagrams, but I figured it couldn't hurt (unless of course it's wrong)

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • I believe that is true Seth, well pointed out. Thank you for helping me explain and thank you for your help.

    Although, the only thing I'm also concerned about is the data that has concentrated duplicates like the example below which would affect performance more I guess.

    SELECT 1,'1','en','A' UNION ALL

    SELECT 2,'1','en','A' UNION ALL

    SELECT 3,'1','en','A' UNION ALL

    SELECT 4,'1','en','A' UNION ALL

    SELECT 5,'1','en','A' UNION ALL

    SELECT 6,'1','en','A' UNION ALL

    ....

    ....

    .....

    SELECT 199999,'1','en','A' UNION ALL

    SELECT 200000,'1','en','A'

    I also did come up with a solution but its pretty much slower with that scenario and quite faster otherwise. Hopefully someone can also enlighten me on how to improve the performance of this solution to the above scenario.

    DECLARE @RecordCount bigint

    SET @RecordCount = (SELECT COUNT(*) FROM #TempTable)

    IF @RecordCount > 0

    BEGIN

    SELECT RowID,RowKey,Locale,Code,NULL GroupID

    INTO #MyTable

    FROM #TempTable

    CREATE CLUSTERED INDEX Index_RowID_Clstd ON #MyTable(RowID)

    DECLARE @PrevAction char(1),

    @GroupID int,

    @DeletedRowID int,

    @PrevRowID int,

    @MinRowID int,

    @MinRow int,

    @MaxRow int,

    @NumMRange int,

    @RowCnt int,

    @OptimumRows int

    DECLARE @RangeOfMs TABLE (

    ID int IDENTITY PRIMARY KEY CLUSTERED,

    MinRow int,

    MaxRow int

    )

    DECLARE @Results TABLE (

    RowID int,

    Code char(1)

    )

    SET @PrevAction = (SELECT Code FROM #MyTable WHERE RowID = 1)

    SET @GroupID = 1

    UPDATE #MyTable

    SET @GroupID = CASE WHEN Code <> @PrevAction THEN @GroupID + 1 ELSE @GroupID END,

    GroupID = @GroupID,

    @PrevAction = Code

    FROM #MyTable WITH (INDEX(0))

    SELECT * INTO #RankedData

    FROM (SELECT *,

    rank() OVER (PARTITION BY Code,GroupID ORDER BY RowID) AS MyRank

    FROM #MyTable

    )[Subset]

    ;WITH RangePerActionCTE (MinRange,MaxRange) AS (

    SELECT MIN(RowID), MAX(RowID)

    FROM #RankedData

    WHERE Code = 'A'

    GROUP BY Code,GroupID

    )

    INSERT @RangeOfMs

    SELECT * FROM RangePerActionCTE

    SET @NumMRange = (SELECT COUNT(*) FROM @RangeOfMs)

    SET @DeletedRowID = -1

    SET @RowCnt = 1

    WHILE @RowCnt <= @NumMRange

    BEGIN

    SELECT

    @MinRow = MinRow,

    @MaxRow = MaxRow

    FROM @RangeOfMs

    WHERE ID = @RowCnt

    SET @MinRowID = (SELECT MIN(RowID)

    FROM (SELECT *,

    rank() OVER (PARTITION BY RowKey,Locale ORDER BY RowID) MyDup

    FROM #MyTable

    WHERE RowID BETWEEN @MinRow AND @MaxRow

    AND RowID >= @DeletedRowID)[Aggregate]

    WHERE MyDup > 1)

    IF @MinRowID IS NOT NULL BEGIN

    INSERT INTO @Results SELECT @MinRowID,'A'

    SET @DeletedRowID = @MinRowID

    CONTINUE;

    END

    ELSE BEGIN

    SET @RowCnt = @RowCnt + 1

    SET @DeletedRowID = -1

    END

    END

    --//Process result

    SELECT * FROM @Results

    UNION ALL

    SELECT RowID,Code FROM #RankedData

    WHERE MyRank = 1

    AND RowID > 1

    ORDER BY RowID

    DROP TABLE #RankedData

    DROP TABLE #MyTable

    END

  • hi Steve (SMunson)

    To Test your Solution, I ran mine again as I made some changes. Execution time was 5 min 36 sec

    I added a print statement after each update statement in your solution. These are the results

    Process Start 2008-10-29 08:47:48

    Completed Step 0 2008-10-29 08:47:55

    Completed Step 1 2008-10-29 08:47:59

    Completed Step 2 2008-10-29 08:48:03

    Completed Step 3 2008-10-29 08:48:07

    The statement has been terminated.

    Query was cancelled by user.

    I stopped the query after 54 min 24 sec.

  • AnzioBake (10/29/2008)


    hi Steve (SMunson)

    To Test your Solution, I ran mine again as I made some changes. Execution time was 5 min 36 sec

    I added a print statement after each update statement in your solution. These are the results

    Process Start 2008-10-29 08:47:48

    Completed Step 0 2008-10-29 08:47:55

    Completed Step 1 2008-10-29 08:47:59

    Completed Step 2 2008-10-29 08:48:03

    Completed Step 3 2008-10-29 08:48:07

    The statement has been terminated.

    Query was cancelled by user.

    I stopped the query after 54 min 24 sec.

    For how many rows?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Although, the only thing I'm also concerned about is the data that has concentrated duplicates like the example below which would affect performance more I guess.

    Due to the nature of the way my solution works (haven't looked at Anzio or Steve's well enough to tell if theirs are the same), these should not affect the processing time. Did you get a chance to test the performance of any of these methods last night?

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • I tested the Solutions against the Test dat set that I created. The Table Contained 1 Million records with sequences ranging from 1-35 rows.

    My Solution returned 670000+ Rows in the result set

Viewing 15 posts - 61 through 75 (of 115 total)

You must be logged in to reply to this topic. Login to reply