Need to get consecutive duplicate rows

  • Jeff,

    I appreciate that (and hopefully this isn't coming off as violently defensive or anything like that, because that's definitely not the intent 🙂 ), but that's exactly my point.

    Adding the ROW_NUMBER() and looking for consecutive values there does NOT make it bullet-proof.

    It's what is necessary for this sort of solution when you want the "next" value regardless of gaps. For "consecutive" as I've been using it, that will actually give incorrect results. 

    Imagine, for example, that you're looking for consecutive days ("ColA" in this example) that share some features (ColB,ColC,ColD). Doing a ROW_NUMBER() ordered by day and then looking for consecutive values of the ROW_NUMBER() would just be incorrect (20180305 and 20180218 are not "consecutive" days, and are not made so just because a table has no rows with date values in between those two).

    If the OP means consecutive like that (which is my default reading of "consecutive"), then it's not that using the extra ROW_NUMBER() is a bullet-proofing improvement; it's just wrong. Similarly, if he wants simply the next value, then mine is just wrong.

    They're just different requirements. On that note, now that the OP has clarified his scenario, it seems he is just checking to see if the row with the next value in ColA shares values of all the other columns, even if they're not consecutive, so my solutions would have to be modified.

    Cheers!

  • The request, and consequently all the solutions provided, are wrong.

    The nature of requirements has nothing to do with sequence numbers.

    Between 3:20am and 4:15 am a user may click on the same link 3 times, having >20 minutes between clicks, and it would be still 3sequential records in the database which would be considered duplicates according to your request.

    You must identify repetitive clicks within the same web session within a certain time limit - that would be more accurate indication of duplicates.

    And for god's sake - work on your database performance!

    So users don't need to click in frustration on that link again and again, because it just does not do anything.

    _____________
    Code for TallyGenerator

  • Sergiy - Monday, February 19, 2018 7:55 PM

    The request, and consequently all the solutions provided, are wrong.The nature of requirements has nothing to do with sequence numbers.Between 3:20am and 4:15 am a user may click on the same link 3 times, having >20 minutes between clicks, and it would be still 3sequential records in the database which would be considered duplicates according to your request.You must identify repetitive clicks within the same web session within a certain time limit - that would be more accurate indication of duplicates.And for god's sake - work on your database performance!So users don't need to click in frustration on that link again and again, because it just does not do anything.

    Agreed that they're wrong if the rows are based on user clicks.  Not seeing anything like that in the original request, though.  Are you sure the data is based on user clicks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jacob Wilkins - Sunday, February 18, 2018 8:33 AM

    Jeff,

    I appreciate that (and hopefully this isn't coming off as violently defensive or anything like that, because that's definitely not the intent 🙂 ), but that's exactly my point.

    Adding the ROW_NUMBER() and looking for consecutive values there does NOT make it bullet-proof.

    It's what is necessary for this sort of solution when you want the "next" value regardless of gaps. For "consecutive" as I've been using it, that will actually give incorrect results. 

    Imagine, for example, that you're looking for consecutive days ("ColA" in this example) that share some features (ColB,ColC,ColD). Doing a ROW_NUMBER() ordered by day and then looking for consecutive values of the ROW_NUMBER() would just be incorrect (20180305 and 20180218 are not "consecutive" days, and are not made so just because a table has no rows with date values in between those two).

    If the OP means consecutive like that (which is my default reading of "consecutive"), then it's not that using the extra ROW_NUMBER() is a bullet-proofing improvement; it's just wrong. Similarly, if he wants simply the next value, then mine is just wrong.

    They're just different requirements. On that note, now that the OP has clarified his scenario, it seems he is just checking to see if the row with the next value in ColA shares values of all the other columns, even if they're not consecutive, so my solutions would have to be modified.

    Cheers!

    There are no dates involved here. Do you have an example of how having gaps in ColA would still work with your original solution?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    First, I'm well aware there are no dates in this particular case; that was obviously an example of the general point that requirements for "consecutive" values are not always merely requirements for the next value, gap or not. Sometimes you want values that are actually consecutive, and not just the next value that exists in the data (e.g. the values 1 and 3 are not consecutive just because the data doesn't include a value of 2).

    Sometimes, yes, the requirement is just to get the next value that exists in the data, and if that is the requirement, then the queries I provided will not work, as I have agreed over and over again. I never once stated that if the requirement is that the values on either side of a gap count as consecutive then my queries will work, so I'm not sure why you're pushing a point on which we've always agreed.

    Again, the flip side of that is that is that if you require truly consecutive values (again, where 3 is not counted as consecutive with 1 just because no 2 exists in the data), then doing the ROW_NUMBER() over the existing values and looking for consecutive values of ROW_NUMBER() is just wrong.

    They're two different requirements. My queries work for a requirement for truly consecutive values, while doing a ROW_NUMBER() over the existing values and using those values works for getting the next value in the data in some defined order, but not for getting truly consecutive values.

    To state it again, if there are gaps in the ColA values AND the requirement is that the values on either side of a gap count as "consecutive" then, no, my queries will not work, as I've stated multiple times in multiple posts now. The queries work perfectly if the requirement is to find truly consecutive values (and the ROW_NUMBER() approach would fail for those requirements).

    At the risk of beating a dead horse, they're completely different requirements; it doesn't make sense to keep asking if a query written to meet requirement A would work to meet some other requirement B with which requirement A is inconsistent.

    As I mentioned in my previous post, now that the OP has clarified the intent, it seems by "consecutive" he just meant "next value that exists in the data", and not truly consecutive, so something like the ROW_NUMBER() change would have to be implemented in the queries I gave (although even the clarified requirements run into the additional issue mentioned by Sergiy).

    Cheers!

  • Agreed that they're wrong if the rows are based on user clicks. Not seeing anything like that in the original request, though. Are you sure the data is based on user clicks.

    Jeff, see the explanation from TS four posts before my one:

    Let me explain the original problem:

    Users could click many times, not realizing they were updating the item over and over (bad latency protection, I know). Each one of those updates caused a history snapshot in a history table (simulated in the table in my original post). Since the users are paid by the click, you can see the problem.

    _____________
    Code for TallyGenerator

  • Sergiy - Tuesday, February 20, 2018 1:48 AM

    Agreed that they're wrong if the rows are based on user clicks. Not seeing anything like that in the original request, though. Are you sure the data is based on user clicks.

    Jeff, see the explanation from TS four posts before my one:

    Let me explain the original problem:Users could click many times, not realizing they were updating the item over and over (bad latency protection, I know). Each one of those updates caused a history snapshot in a history table (simulated in the table in my original post). Since the users are paid by the click, you can see the problem.

    Ah... so I see.  Definitely agree with you on your previous about none of the solutions being correct for the problem.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • pfefferl - Friday, February 16, 2018 1:57 PM

    I need a query that can get me (the ID of) consecutive duplicate rows, where I am grouping on some of the columns, don't care about rest.
    I know I have to use row  over to look for consecutive, and group by certain columns that make the dupes, but I am having a hard time with the grouping, and where to put the grouping (sub query, main query).

    The following sets up my data in a temp table :
    ---------------------------------------------------------------
    IF OBJECT_ID('TempDB..#Temp') IS NOT NULL
    DROP TABLE #Temp;

    CREATE TABLE #Temp
    ( ColA INT
    ,ColB INT
    ,ColC CHAR(1)
    ,ColD CHAR(1)
    ,ColE DECIMAL(3,2)
    )

    INSERT INTO #Temp ( ColA , ColB , ColC, ColD, ColE)
        SELECT 0, 1, 'A', 'A',1.1
    UNION ALL SELECT 1, 1, 'A', 'A',1.2
    UNION ALL SELECT 2, 1, 'A', 'A',1.3
    UNION ALL SELECT 3, 1, 'D', 'X',1.4
    UNION ALL SELECT 4, 1, 'A', 'A',1.5
    UNION ALL SELECT 5, 1, 'A', 'A',1.6
    UNION ALL SELECT 6, 1, 'A', 'A',1.7
    UNION ALL SELECT 7, 1, 'A', 'A',1.8
    UNION ALL SELECT 8, 2, 'A', 'B',2.1
    UNION ALL SELECT 9, 2, 'A', 'B',2.2
    UNION ALL SELECT 10, 2, 'D', 'X',2.3
    UNION ALL SELECT 11, 2, 'A', 'B',2.4
    UNION ALL SELECT 12, 2, 'A', 'B',2.5
    UNION ALL SELECT 13, 2, 'A', 'B',2.6
    UNION ALL SELECT 14, 2, 'A', 'B',1.1

    SELECT * FROM #Temp

    -A dupe is only considering ColB, C and D (highlighted below). I Need the IDs (ColA)

    ColA    ColB    ColC    ColD    ColE

    0    1    A    A    1.10
    1    1    A    A    1.20
    2    1    A    A    1.30
    3    1    D    X    1.40
    4    1    A    A    1.50
    5    1    A    A    1.60
    6    1    A    A    1.70
    7    1    A    A    1.80
    8    2    A    B    2.10
    9    2    A    B    2.20
    10    2    D    X    2.30
    11    2    A    B    2.40
    12    2    A    B    2.50
    13    2    A    B    2.60
    14    2    A    B    1.10

    ColA
    -----
     1 
    2

     5
     6
     7
     9
    12
    13
    14

    Any help would be appreciated!

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • >> I need a query that can get me (the ID of) consecutive duplicate rows, where I am grouping on some of the columns, don't care about rest. <<

    since rows have no ordering whatsoever in RDBMS, the whole concept of “consecutive” has no meaning. You can talk about groups and a count (cardinality) of a group.

    Is there any chance that you will take the time to learn what a table is and post correct DDL? Or do you expect other people to do this for you for the rest of your life? Did you understand the concept that a table must have a key? The garbage you posted can never have a key, because all the columns can be nulls. Oh dear God, let’s try to fix the mess. Oh besides the nullable columns I see you still use punchcards. Back in the 1960s when I was programming, we used the leading, on a punch card to make sure we can rearrange them and reuse them. Yes, it destroys readability, but our time was cheaper the computer time.

    >> I know I have to use row over to look for consecutive, and group by certain columns that make the dupes, but I am having a hard time with the grouping, and where to put the grouping (sub query, main query). <<
    will and I will and
    >> The following sets up my data in a temp table :<<

    CREATE TABLE Foobar
    (col_a INTEGER NOT NULL PRIMARY KEY,
    col_b INTEGER NOT NULL,
    col_c CHAR(1) NOT NULL,
    col_d CHAR(1) NOT NULL,
    col_e DECIMAL(3, 2) NOT NULL);

    why are you using the obsolete, old, original Sybase syntax for table construction?

    INSERT INTO Foobar (col_a , col_b , col_c, col_d, col_e)
    VALUES
    (0, 1, 'A', 'A', 1.1),
    (1, 1, 'A', 'A', 1.2),
    (2, 1, 'A', 'A', 1.3),
    (3, 1, 'D', 'X', 1.4),
    (4, 1, 'A', 'A', 1.5),
    (5, 1, 'A', 'A', 1.6),
    (6, 1, 'A', 'A', 1.7),
    (7, 1, 'A', 'A', 1.8),
    (8, 2, 'A', 'B', 2.1),
    (9, 2, 'A', 'B', 2.2),
    (10, 2, 'D', 'X', 2.3),
    (11, 2, 'A', 'B', 2.4),
    (12, 2, 'A', 'B', 2.5),
    (13, 2, 'A', 'B', 2.6),
    (14, 2, 'A', 'B', 1.1);

    >> -A dupe is only considering col_b, col_c and col_d (highlighted below). I Need the IDs (col_a)<<

    There is no such thing as a generic ID in RDBMS; it must be the identifier of something in particular. You are just using a number to physically locate a record in a deck of punch cards, but you using rows in SQL table. No!!

    my first guess was that you were looking for islands and gaps, a well-known problem in SQL. But this would have given you a col_a groupings of (0 to 2) (3 to 3) (4 to 7) (8 TO 9) (10 TO 10) (11 TO 14). by definition, the groupings cannot be represented by a single column.

    Can you explain, in relational logical or set oriented terms what you’re after? This language is not good at sequences. Would have to do all kinds of road numbering in math to sort of fake what you’re trying to ask for.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko212 32090 - Wednesday, February 21, 2018 2:05 PM

    pfefferl - Friday, February 16, 2018 1:57 PM

    I need a query that can get me (the ID of) consecutive duplicate rows, where I am grouping on some of the columns, don't care about rest.
    I know I have to use row  over to look for consecutive, and group by certain columns that make the dupes, but I am having a hard time with the grouping, and where to put the grouping (sub query, main query).

    The following sets up my data in a temp table :
    ---------------------------------------------------------------
    IF OBJECT_ID('TempDB..#Temp') IS NOT NULL
    DROP TABLE #Temp;

    CREATE TABLE #Temp
    ( ColA INT
    ,ColB INT
    ,ColC CHAR(1)
    ,ColD CHAR(1)
    ,ColE DECIMAL(3,2)
    )

    INSERT INTO #Temp ( ColA , ColB , ColC, ColD, ColE)
        SELECT 0, 1, 'A', 'A',1.1
    UNION ALL SELECT 1, 1, 'A', 'A',1.2
    UNION ALL SELECT 2, 1, 'A', 'A',1.3
    UNION ALL SELECT 3, 1, 'D', 'X',1.4
    UNION ALL SELECT 4, 1, 'A', 'A',1.5
    UNION ALL SELECT 5, 1, 'A', 'A',1.6
    UNION ALL SELECT 6, 1, 'A', 'A',1.7
    UNION ALL SELECT 7, 1, 'A', 'A',1.8
    UNION ALL SELECT 8, 2, 'A', 'B',2.1
    UNION ALL SELECT 9, 2, 'A', 'B',2.2
    UNION ALL SELECT 10, 2, 'D', 'X',2.3
    UNION ALL SELECT 11, 2, 'A', 'B',2.4
    UNION ALL SELECT 12, 2, 'A', 'B',2.5
    UNION ALL SELECT 13, 2, 'A', 'B',2.6
    UNION ALL SELECT 14, 2, 'A', 'B',1.1

    SELECT * FROM #Temp

    -A dupe is only considering ColB, C and D (highlighted below). I Need the IDs (ColA)

    ColA    ColB    ColC    ColD    ColE

    0    1    A    A    1.10
    1    1    A    A    1.20
    2    1    A    A    1.30
    3    1    D    X    1.40
    4    1    A    A    1.50
    5    1    A    A    1.60
    6    1    A    A    1.70
    7    1    A    A    1.80
    8    2    A    B    2.10
    9    2    A    B    2.20
    10    2    D    X    2.30
    11    2    A    B    2.40
    12    2    A    B    2.50
    13    2    A    B    2.60
    14    2    A    B    1.10

    ColA
    -----
     1 
    2

     5
     6
     7
     9
    12
    13
    14

    Any help would be appreciated!

    Please read, Joe.  The OP DID provide the necessary DDL and readily consumable data.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • jcelko212 32090 - Wednesday, February 21, 2018 2:06 PM

    since rows have no ordering whatsoever in RDBMS, the whole concept of “consecutive†has no meaning. You can talk about groups and a count (cardinality) of a group.

    Absolute garbage, Joe.  I'll agree that there is no guarantee of a "natural order" but the order defined by ColA (in this case) is acceptable.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Wednesday, February 21, 2018 2:34 PM

    jcelko212 32090 - Wednesday, February 21, 2018 2:06 PM

    Where is the DDL that enforces this? I didn't see a create sequence statement anywhere. I did not even see an IDENTITY column declared. (Which I would not like, but at least it would work). Sorry after so many decades of teaching SQL. One of the things I've learned is that you can't assume something. If you make a column insanely long, eventually someone will fill it up with an insanely long meaningless string. If you don't prohibit negative numbers, someone will order -500 widgets. Etc. it is important to make people think of a declarative language in terms of, well, let's go and say it, declarations. Once upon a time, I had a client tell me "yeah, but everybody knows that!" And I had to tell him that the optimizer sure as hell didn't.since rows have no ordering whatsoever in RDBMS, the whole concept of “consecutive†has no meaning. You can talk about groups and a count (cardinality) of a group.

    Absolute garbage, Joe.  I'll agree that there is no guarantee of a "natural order" but the order defined by ColA (in this case) is acceptable.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko212 32090 - Wednesday, February 21, 2018 3:23 PM

    Jeff Moden - Wednesday, February 21, 2018 2:34 PM

    jcelko212 32090 - Wednesday, February 21, 2018 2:06 PM

    Where is the DDL that enforces this? I didn't see a create sequence statement anywhere. I did not even see an IDENTITY column declared. (Which I would not like, but at least it would work). Sorry after so many decades of teaching SQL. One of the things I've learned is that you can't assume something. If you make a column insanely long, eventually someone will fill it up with an insanely long meaningless string. If you don't prohibit negative numbers, someone will order -500 widgets. Etc. it is important to make people think of a declarative language in terms of, well, let's go and say it, declarations. Once upon a time, I had a client tell me "yeah, but everybody knows that!" And I had to tell him that the optimizer sure as hell didn't.since rows have no ordering whatsoever in RDBMS, the whole concept of “consecutive†has no meaning. You can talk about groups and a count (cardinality) of a group.

    Absolute garbage, Joe.  I'll agree that there is no guarantee of a "natural order" but the order defined by ColA (in this case) is acceptable.

    While I agree with THAT thought, the posted data is supposedly a result of a query that does impart a temporal order in ColA.  Whether that's true or not doesn't change the fact that your comment about "rows have no ordering whatsoever in RDBMS" is logically incorrect.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Wednesday, February 21, 2018 3:59 PM

    jcelko212 32090 - Wednesday, February 21, 2018 3:23 PM

    Jeff Moden - Wednesday, February 21, 2018 2:34 PM

    jcelko212 32090 - Wednesday, February 21, 2018 2:06 PM

    While I agree with THAT thought, the posted data is supposedly a result of a query that does impart a temporal order in ColA.  Whether that's true or not doesn't change the fact that your comment about "rows have no ordering whatsoever in RDBMS" is logically incorrect.

    unh?? Rows are identified by a key, which exist within a table. Columns are also identified by names and not by positions. Jeff, what the heck? This is basic RDBMS

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko212 32090 - Wednesday, February 21, 2018 7:30 PM

    Jeff Moden - Wednesday, February 21, 2018 3:59 PM

    jcelko212 32090 - Wednesday, February 21, 2018 3:23 PM

    Jeff Moden - Wednesday, February 21, 2018 2:34 PM

    jcelko212 32090 - Wednesday, February 21, 2018 2:06 PM

    While I agree with THAT thought, the posted data is supposedly a result of a query that does impart a temporal order in ColA.  Whether that's true or not doesn't change the fact that your comment about "rows have no ordering whatsoever in RDBMS" is logically incorrect.

    unh?? Rows are identified by a key, which exist within a table. Columns are also identified by names and not by positions. Jeff, what the heck? This is basic RDBMS

    Indeed it is basic RDBMS and ColA, in this example, is being used to impart the required logical order.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 29 (of 29 total)

You must be logged in to reply to this topic. Login to reply