Remove successive rows with same column value

  • Jeff Moden (5/4/2012)


    I think the extra sorts are absolutely worth it, though. They get rid of the accidental cross join, nearly an order of magnitude of reads, and runs several orders of magnitude faster according to priofiler.

    It depends, as always (e.g. the sorts might spill to disk). Anyway, the point I'm trying to make here is that the optimizer has limitations which mean it does not choose as good an execution plan as it might here. Aside from anything else, things like ROW_NUMBER are not relational, and the optimizer reasons about things primarily in relational terms. ROW_NUMBER does not produce a set (strictly, a multi-set) it produces a sequence. Adding the row number column (projecting a new column in the jargon) means projecting a sequence; hence the Sequence Project operator in the plan. More limitations mean the common table expression is evaluated, and rows numbered, twice. Yet more mean the merge join has to account for the possibility of duplicates, and it runs in many-to-many mode with a needless work table.

    Often, it is possible to give the optimizer better information, so it will produce good plans now and in the future when the number of rows and distribution changes. One way to do this is to materialize the CTE:

    CREATE TABLE #Materialized

    (

    sortOrder bigint PRIMARY KEY CLUSTERED,

    idLigne character (2) NOT NULL,

    id integer NOT NULL,

    statut character(1) NOT NULL,

    );

    INSERT #Materialized

    (

    sortOrder,

    idLigne,

    id,

    statut

    )

    SELECT

    sortOrder = ROW_NUMBER() OVER (ORDER BY mt.idLigne),

    mt.idLigne,

    mt.id,

    mt.statut

    FROM

    dbo.MyTest AS mt;

    -- Pretend there are a few more rows

    UPDATE STATISTICS #Materialized WITH ROWCOUNT = 50, PAGECOUNT = 1;

    -- One to many merge

    SELECT

    this.idLigne,

    this.id,

    this.statut

    FROM #Materialized AS this

    WHERE

    NOT EXISTS

    (

    SELECT 1

    FROM #Materialized AS nxt

    WHERE

    nxt.sortOrder = this.sortOrder + 1

    AND nxt.statut = this.statut

    );

    DROP TABLE #Materialized;

  • Jeff Moden (5/4/2012)


    To continue, both queries use a "Work Table". The one with the MERGE hint uses it a whole lot less, though.

    Sure. The work table in the loops join case is for the spool, the one in the merge join case is for the logical possibility of duplicates (the many-to-many part) but is obviously never actually required (though it is still created).

  • SQL Kiwi (5/4/2012)


    One way to do this is to materialize the CTE

    I'm definitely a fan of that especially when a CTE is referenced more than once.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/5/2012)


    SQL Kiwi (5/4/2012)


    One way to do this is to materialize the CTE

    I'm definitely a fan of that especially when a CTE is referenced more than once.

    I think that is one thing Oracle does right with their subquery refactoring clause (aka, SQL Servers cte). It can either execute it inline or use it as a temporary table.

  • Lynn Pettis (5/5/2012)


    Jeff Moden (5/5/2012)


    SQL Kiwi (5/4/2012)


    One way to do this is to materialize the CTE

    I'm definitely a fan of that especially when a CTE is referenced more than once.

    I think that is one thing Oracle does right with their subquery refactoring clause (aka, SQL Servers cte). It can either execute it inline or use it as a temporary table.

    Have you voted for the Connect item?

    https://connect.microsoft.com/SQLServer/feedback/details/218968/provide-a-hint-to-force-intermediate-materialization-of-ctes-or-derived-tables

  • I have, indeed.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • So have I.

  • Hello everyone. I'm sorry if I took time to answer you I was not home this weekend.

    Thank you for your response ! your proposals are very interesting especially the solution without creating and dropping a table.

    I gave you an example too simple to understand the real problem seeing this example:

    I have the table "TableSorted"

    idLigne | id | stat | datetime

    1 1 A 1/01/2011 14:00:00

    2 1 A 1/01/2011 14:00:05

    3 1 A 1/01/2011 15:00:00

    4 1 B 1/01/2011 16:00:00

    5 1 A 1/01/2011 17:00:00

    6 1 B 1/01/2011 18:00:00

    7 1 B 1/01/2011 19:00:00

    8 2 A 1/01/2011 12:00:00

    9 2 A 1/01/2011 12:30:00

    10 2 A 2/01/2011 11:00:00

    11 2 A 2/01/2011 11:10:00

    12 2 B 2/01/2011 15:00:00

    and here is the desired result:

    idLigne | id | stat | datetime

    3 1 A 1/01/2011 15:00:00

    4 1 B 1/01/2011 16:00:00

    5 1 A 1/01/2011 17:00:00

    7 1 B 1/01/2011 19:00:00

    9 2 A 1/01/2011 12:30:00

    11 2 A 2/01/2011 11:10:00

    12 2 B 2/01/2011 15:00:00

    I opt finally to use of the following query that i found simple :

    with transformed as (

    select idLigne, id,

    stat,datetime,

    row_number () over (order by idLigne) rn

    from TableSorted)

    select TableSorted.*

    from TableSorted

    inner join transformed t1

    on TableSorted.idLigne = t1.idLigne

    left join transformed t2

    on t1.rn = t2.rn - 1

    and t1.stat = t2.stat and t1.id = t2.id and year(t1.datetime) = year(t2.datetime) and month(t1.datetime) = month(t2.datetime) and day(t1.datetime) = day(t2.datetime)

    where t2.rn is null

Viewing 8 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply