Need help on performance and integrity

  • I continue to think that a complete statement of the problem coming from the OP has yet to occur. There is no written description of the EXACT process to follow for any given record within a group of A's. I'm not even sure the OP knows what he wants.

    Dan - PLEASE provide a COMPLETE description of EXACTLY how to handle ANY GIVEN RECORD within a set of A's, that will give the CORRECT RESULT, regardless of the input, and TAKES INTO ACCOUNT ALL POSSIBLE INPUTS. The description needs to be written, and it needs to leave absolutely NO ROOM for doubt about what the words mean. I'm thoroughly convinced at this point that this problem will not be solvable until we understand what you want in the EXACT SAME way that YOU understand it. We need to be able to read the description and come away without any questions. So far, every description you've provided has left any of a number of questions, and some of the behavior you're asking for in producing the correct result could only be a result of an inconsistent set of rules or some other undisclosed rule that we haven't seen you post as yet. Make NO assumptions, ok?

    Steve

    (aka smunson)

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Or... the even more elusive question... what on earth the end goal of this thing really is. Part of the reason we are unable to come up with any good set based approach is because I don't think anyone (with the possible exception of yourself) really understands what you're trying to do here.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • This thread is still going !!?!! I'm sorry to have missed all the fun, but I'm feeling lucky, so I'll try again. If the results are correct, then someone else can make it go faster. From what I've read, there are only two things to identify:

    1. The rows where the Code column "breaks" to a new code, which doesn't include the first row in the table.

    2. Any duplicates (same Key, Locale and Code) within the range between breaks, OTHER than the first row in each set of duplicates.

    So anyway, here goes.

    ============================================

    CREATE TABLE #TempTable

    ( SeqID int identity(1,1) primary key

    ,RowID int

    ,RowKey varchar(1000)

    ,Locale varchar(10)

    ,Code char(1)

    )

    INSERT INTO #TempTable

    SELECT 1 as rowID,'1','en','A' UNION ALL

    SELECT 2,'1','en','A' UNION ALL

    SELECT 5,'1','en','C' UNION ALL

    SELECT 7,'1','en','A' UNION ALL

    SELECT 8,'1','zh','A' UNION ALL

    SELECT 10,'1','en','A' UNION ALL

    SELECT 11,'1','zh','A' UNION ALL

    SELECT 20,'1','en','D' UNION ALL

    SELECT 21,'1','en','D'

    order by rowID -- be sure to order by rowID

    select * from #temptable

    ;with cte1 as -- identify breaks

    (select t1.code,t1.rowid

    from #temptable t1

    left join #temptable t2 on t1.seqID = t2.seqID+1

    where t1.code <> t2.code

    )

    ,cte1a as -- include first row for ranging purposes

    (select top 1 code,rowID

    from #temptable

    union all

    select * from cte1)

    ,cte2 as -- identify ranges for duplicates

    (select c1.code,c1.rowid as minRow,isnull(min(c1a.rowID)-1,c1.rowid+1) as maxRow

    from cte1a c1

    left join cte1a c1a on c1a.code <> c1.code and c1a.rowID > c1.rowID

    -- where c1.code = 'A'

    group by c1.code,c1.rowID

    )

    ,cte3 as -- identify duplicates

    (select c.minrow,max(rowid) as dupID,t.rowKey,t.locale,c.code

    from #temptable t

    cross join cte2 c

    where t.rowid >=minRow and t.rowID <= maxRow

    group by c.minRow,t.rowKey,t.locale,c.code

    having count(*) > 1

    )

    ,cte4 as

    (select dupID as rowID

    from cte3

    union all

    select rowID

    from cte1

    )

    select distinct c.rowID,t.rowkey,t.locale,t.code

    from cte4 c

    join #temptable t on c.rowID = t.rowID

    order by c.rowID

    drop table #temptable

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • bhovious (10/27/2008)


    1. The rows where the Code column "breaks" to a new code, which doesn't include the first row in the table.

    2. Any duplicates (same Key, Locale and Code) within the range between breaks, OTHER than the first row in each set of duplicates.

    Unfortunately it's considerably more complicated than that. He wants it done iteratively, meaning you can't just look for duplicates, you have to go through it line by line taking into account where the last duplicate was within the code group, and restarting a duplicate match search from the last one. However, if there is a copy of a dupe immediately following the exact same dupe, that one is counted as well, but not if there is a different key/locale between them. Beyond that, there's discrepancies over whether the first or second dupe in an iterative series is kept. That's only my current understanding of it, and I don't think that quite encompasses everything. I've tried to do this using a clustered index update (ie. running total technique) about 30 different ways now, but it simply will not work with any local variable inside of the update statement. Nor will it look at the last row updated. Beyond that, with the row by row logic being applied, I simply do not see any set based logic working out here.

    My only thoughts for optimization of this currently go towards breaking up your huge table of data into smaller chunks with a second outer loop (with each chunk also taking 100 rows of the next chunk to satisfy your @optrows for scanning) that lets you perform this operation on smaller data sets in hopes of improving the speed of the scans. This may just as easily hurt the performance as help it, but it's something to try.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Hey Seth,

    Could you give me one example of the iterative situation you describe. I didn't see that in all of the examples I looked at. If you could even point me to the example data he used, I'd be grateful. (Is this what he means back on page 1 when he says row 19 is already covered by the range 14-18?)

    Bob

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Basically, this:

    IDRowKeyLocaleCode

    11ENA

    22ENA

    32DUA

    41ENA

    52DUA

    You see that Rows 3 and 5 are duplicates. However, because Row 4 is a duplicate of Row 1, the matching starts over at row 4. There is no duplicate of row 5 that is greater than row 4. The proper output in this scenario would be 1 and 4. (as far as I understand it)

    Here's where it gets really hairy.

    If you have this:

    IDRowKeyLocaleCode

    11ENA

    22ENA

    32DUA

    41ENA

    51ENA

    61ENA

    72DUA

    81ENA

    92DUA

    My understanding is that he wants 1,4,5,6,9 returned. 5 and 6 get added in because it's a "group" of duplicates. 7 still gets skipped because there is nothing above the last dupe (6) which matches it, and 8 gets skipped because the "group" of 1 EN A has now been broken. 9 gets returned because it is a duplicate of 7 which was the next candidate after the last dupe of 6.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • bhovious (10/27/2008)


    Hey Seth,

    Could you give me one example of the iterative situation you describe. I didn't see that in all of the examples I looked at. If you could even point me to the example data he used, I'd be grateful. (Is this what he means back on page 1 when he says row 19 is already covered by the range 14-18?)

    Bob

    Seth's solution is actually iterative. He goes through each row in the table in order and flags duplicates. Take a look at his second post on this thread.

    Read this article for further explanation on how this approach works http://www.sqlservercentral.com/articles/Advanced+Querying/61716/

  • ggraber (10/27/2008)


    bhovious (10/27/2008)


    Hey Seth,

    Could you give me one example of the iterative situation you describe. I didn't see that in all of the examples I looked at. If you could even point me to the example data he used, I'd be grateful. (Is this what he means back on page 1 when he says row 19 is already covered by the range 14-18?)

    Bob

    Seth's solution is actually iterative. He goes through each row in the table in order and flags duplicates. Take a look at his second post on this thread.

    Read this article for further explanation on how this approach works http://www.sqlservercentral.com/articles/Advanced+Querying/61716/

    Or more aptly in this case, doesn't work. Because I need to start the searching over after the last row that matched the criteria, it completely breaks that method. I would dearly love to see someone get that to work, but no matter how much I change the thing around, the optimizer will not recursively:

    A. Look for the highest row that has been flagged as a duplicate or

    B. Use a local variable inside the search.

    I've written that statement in a dozen different ways that were logically correct and should have solved it... but the problem is, the optimizer converts them all to derived tables that it generates before the update statement runs either off the local variable being 1 or the max duplicated rownumber in the table being NULL. It refuses to recursively run anything to take into account the last record that was a duplicate and search from there. It's actually driving me nuts.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • smunson (10/27/2008)


    I continue to think that a complete statement of the problem coming from the OP has yet to occur. There is no written description of the EXACT process to follow for any given record within a group of A's. I'm not even sure the OP knows what he wants.

    Dan - PLEASE provide a COMPLETE description of EXACTLY how to handle ANY GIVEN RECORD within a set of A's, that will give the CORRECT RESULT, regardless of the input, and TAKES INTO ACCOUNT ALL POSSIBLE INPUTS. The description needs to be written, and it needs to leave absolutely NO ROOM for doubt about what the words mean. I'm thoroughly convinced at this point that this problem will not be solvable until we understand what you want in the EXACT SAME way that YOU understand it. We need to be able to read the description and come away without any questions. So far, every description you've provided has left any of a number of questions, and some of the behavior you're asking for in producing the correct result could only be a result of an inconsistent set of rules or some other undisclosed rule that we haven't seen you post as yet. Make NO assumptions, ok?

    Steve

    (aka smunson)

    :):):)

    Hi Steve,

    I just thought that the images that I provided was clear enough to achieve my goal here but I guess i'm wrong. I just need two things on this:

    A. Get the RowIDs of the interleaving rows by Code, meaning the rows where the Code column "breaks" to a new code, which doesn't include the first row in the table as bhovious have said.

    B. And second, is to look into every group of Code 'A'. Get the interleaving rows of duplicates within that group but should start from every previous duplicate found. This is to separate the duplicate so that for every range of rowids within A, no duplicates would be present.

    Lets say for example, I have this data inside Code 'A'.

    SELECT 1 as RowID,'1' as RowKey,'en' as Locale,'A' as Code UNION ALL

    SELECT 2,'1','zh','A' UNION ALL

    SELECT 3,'1','zh','A' UNION ALL

    SELECT 4,'2','en','A' UNION ALL

    SELECT 5,'2','en','A' UNION ALL

    SELECT 10,'1','zh','A' UNION ALL

    SELECT 11,'1','zh','A'

    The result of this would be:

    RowIDs 3,5, and 11.

    Notice that the range result of this example has unique data inside it, duplicates were separated.

    The ranges produced in the result was RowIDs 1-2, 3-4, 5-10, and 11 in which for every range, there

    are no duplicates since it was already divided/separated and that's the reason why I wan't to iteratively locate

    the duplicates and always start from the previous duplicate found to find another one.

    After that, I need to combined the result of A & B.

  • What if there were an additional row in your example?

    SELECT 6,'2','en','A' UNION ALL

    Include 5? Include 6? Include 5 and 6 both?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • bhovious (10/27/2008)


    What if there were an additional row in your example?

    SELECT 6,'2','en','A' UNION ALL

    Include 5? Include 6? Include 5 and 6 both?

    It should include both 5 & 6 since 5 is a dupe on 4 and 6 is a dupe on 5. Always start looking for another duplicate from the previous duplicate found.

  • Hi Dan

    If you Check my last post your should be getting the correct results But need to change the last query before viewing results to below

    INSERT #TempTable

    SELECT 0,'1','en','C' UNION ALL

    SELECT 1,'1','en','A' UNION ALL

    SELECT 2,'1','en','A' UNION ALL

    SELECT 5,'1','en','C' UNION ALL

    SELECT 7,'1','en','A' UNION ALL

    SELECT 8,'1','zh','A' UNION ALL

    SELECT 10,'1','en','A' UNION ALL

    SELECT 11,'1','zh','A' UNION ALL

    SELECT 12,'1','zh','A' UNION ALL

    SELECT 13,'1','zh','A' UNION ALL

    SELECT 14,'1','zh','A' UNION ALL

    SELECT 20,'1','en','D' UNION ALL

    SELECT 21,'1','en','D'

    GO

    --Remove First Record in the table---------

    Delete from tgt

    From #Result as tgt

    Inner Join(

    Select Min(RowID) as RowID

    From #Seq

    )as seq

    on seq.RowID = Tgt.RowID

    ------------------------------

    Select * From #Result

    ]/code

  • The above change produces the following results

    RowIDRowKeyLocaleCode

    11enA

    21enA

    51enC

    71enA

    101enA

    121zhA

    131zhA

    141zhA

    201enD

    Note

    the first row in the table is removed RowId= 0

    Rows 12, 13, 14 are all in the result set as required

    Am I still missing something in the solution?

  • Thanks for the reply, I believe I understand now. Regardless of the "why", this was an interesting problem to look at. If Anziobake's solution still doesn't get it right, I'll give it another shot this evening.

    It seems like a set-based solution should be possible, but no guarantees about the speed.

    The final set should include the breaks from the Code Column and a subset of the "A" duplicates. If I'm on the right track, the definition of this subset eliminates any duplicates which were matched to a rowID which is less than the rowID of any other duplicate. Although this seems iterative, I still think someone can get there. I'm posting this thought because work has gotten busy and I may not be able to follow up tonight. Good hunting, Anzio and Garadin!!

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Ok, so I've given up on the old method, and come up with this. Not set based :crying:. Hopefully I didn't completely bugger up my cursor syntax, I had to look it up several times. Dan, if you could, let us know how both of these solutions work time and consistency wise.

    Warning to sensitive eyes, RBAR follows :hehe:.

    [font="Courier New"]USE tempdb;

    SET NOCOUNT ON;

    GO

    IF OBJECT_ID ('#TempTable','U') IS NOT NULL

    BEGIN

            DROP TABLE #TempTable

    END

    IF OBJECT_ID ('#Deleted','U') IS NOT NULL

    BEGIN

            DROP TABLE #Deleted

    END

    CREATE TABLE #TempTable

    (

            RowID      INT PRIMARY KEY CLUSTERED,

            RowKey         VARCHAR(1000),

            Locale         VARCHAR(10),

            Code       CHAR(1),

            GroupID        INT NULL

    )

    GO

    INSERT #TempTable(RowID, RowKey, Locale, Code)

    SELECT 1,'1','en','C' UNION ALL

    SELECT 2,'1','de','C' UNION ALL

    SELECT 3,'1','zh','A' UNION ALL

    SELECT 4,'2','en','A' UNION ALL

    SELECT 5,'2','en','A' UNION ALL

    SELECT 10,'1','zh','A' UNION ALL

    SELECT 11,'1','zh','A' UNION ALL

    SELECT 15,'1','zh','C' UNION ALL

    SELECT 16,'1','de','C' UNION ALL

    SELECT 20,'1','en','A' UNION ALL

    SELECT 21,'3','en','A' UNION ALL

    SELECT 22,'3','de','A' UNION ALL

    SELECT 23,'3','en','A' UNION ALL

    SELECT 24,'3','de','A' UNION ALL

    SELECT 30,'1','en','C' UNION ALL

    SELECT 31,'1','en','C'

    DECLARE @GroupID    INT,

    @PrevCode          CHAR(5),

    @RowID             INT,

    @Locale            VARCHAR(5),

    @RowKey            VARCHAR(1000),

    @PrevGroupID           INT

    SET @PrevCode = (SELECT Code FROM #TempTable WHERE RowID = 1)

    SET @GroupID = 1

    -----------------------------------------------------------

    -- Breaks your data out into groups so it can be scanned --

    UPDATE #TempTable

    SET @GROUPID = CASE WHEN Code <> @PrevCode THEN @GroupID + 1 ELSE @GroupID END,

       GroupID = @GroupID,

       @PrevCode = Code

    FROM #TempTable WITH (INDEX(0))

    -----------------------------------------------------------

    DECLARE @CurRows TABLE (

    RowKey VARCHAR(1000),

    Locale VARCHAR(10))

    CREATE TABLE #Deleted(

    RowID      INT)

    DECLARE TTCursor CURSOR FORWARD_ONLY FOR

    SELECT RowID, GroupID, RowKey, Locale

    FROM #TempTable

    WHERE Code = 'A'

    ORDER BY RowID

    OPEN TTCursor

    FETCH NEXT FROM TTCursor INTO @RowID, @GroupID, @RowKey, @Locale

    WHILE @@FETCH_STATUS > -1

    BEGIN

       IF EXISTS (SELECT * FROM @CurRows WHERE RowKey = @RowKey AND Locale = @Locale)

       BEGIN

           INSERT INTO #Deleted(RowID)

           VALUES(@RowID)

          

           DELETE FROM @CurRows

       END

      

           INSERT INTO @CurRows(RowKey, Locale) -- Delete all the rows from the table)

           VALUES(@RowKey, @Locale) -- Repopulate the table with the current row (this

    IF @GroupID <> @PrevGroupID

           DELETE FROM @CurRows

    SET @PrevGroupID = @GroupID

    FETCH NEXT FROM TTCursor INTO @RowID, @GroupID, @RowKey, @Locale

    END

    CLOSE TTCursor

    DEALLOCATE TTCursor

    SELECT * FROM #Deleted  -- See just the ones to be deleted from the cursor.

    SELECT * FROM #TempTable T -- See all rows matching criteria.

       LEFT JOIN #Deleted D ON T.RowID = D.RowID

       LEFT JOIN (SELECT GroupID, MIN(RowID) FirstGroupRow FROM #TempTable WHERE GROUPID > 1 GROUP BY GroupID) T2 ON T.RowID = T2.FirstGroupRow

    WHERE D.RowID IS NOT NULL OR T2.FirstGroupRow IS NOT NULL[/font]

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Viewing 15 posts - 46 through 60 (of 115 total)

You must be logged in to reply to this topic. Login to reply