tuning query

  • I am using following query to build a table. right now this query takes about 3 hours. which is way too long. I see that in the execution plan 73% of weight is assigned to a sort. 9% to parallelism distribute streams and 9% to parallelism gather streams.

    I was wondering how can speed up this query based on this info?

    currently there are no relationships defined or any constraints or indexes on either table in the selection.

    select

    TempProjectString,InfoProjID

    INTOCPMRSProjectData

    from (

    select -- a.proj_id,

    a.proj_id as TempProjectString

    ,b.invc_proj_id as InfoProjID

    ,rn = row_number() over (partition by a.proj_id order by len(a.proj_id) - len(b.invc_proj_id))

    from proj_bill B

    join TempData A

    on charindex(reverse(b.invc_proj_id), reverse(a.proj_id)) > 0

    ) d

    where rn = 1

    go

  • Creating indexes would be a good start.

    if possible simplify your join

  • Two things, get an index on the tables, especially a clustered index. Put that on the most likely access path, meaning, if they always select by PK, on the PK or if they always select by a foreign key and a date, use those two columns. It really depends on your app.

    Second, check your parallelism threshold. That's an awfully simply query to go through all the process of distributing and gathering streams for parallel execution. If the threshold is set to a good number, you might experiment, on this query only, of trying the max degree of parallelism hint (MAXDOP).

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Also, create the table first if you can. When you run a SELECT INTO statement to create a table, system resources are locked for the duration of the create statement. If you already know the schema, create the table using a CREATE TABLE statement, then load it with INSERT INTO....SELECT FROM construct.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • shahab (5/20/2008)


    I am using following query to build a table. right now this query takes about 3 hours. which is way too long. I see that in the execution plan 73% of weight is assigned to a sort. 9% to parallelism distribute streams and 9% to parallelism gather streams.

    I was wondering how can speed up this query based on this info?

    currently there are no relationships defined or any constraints or indexes on either table in the selection.

    select

    TempProjectString,InfoProjID

    INTOCPMRSProjectData

    from (

    select -- a.proj_id,

    a.proj_id as TempProjectString

    ,b.invc_proj_id as InfoProjID

    ,rn = row_number() over (partition by a.proj_id order by len(a.proj_id) - len(b.invc_proj_id))

    from proj_bill B

    join TempData A

    on charindex(reverse(b.invc_proj_id), reverse(a.proj_id)) > 0

    ) d

    where rn = 1

    go

    the derived table's row_number() calculation will cause every row in proj_bill to be scanned for every matching row in TempData so the query will be very slow and resource intensive.

    since both terms of the charindex() are reversed, the reversals are redundant. since you're comparing charindex() to zero, that's the equivalent of like (which is probably better optimized internally than charindex()).

    since you're selecting rn=1, it's somewhat arbitrary as to which invc_proj_id you'll receive if multiple invc_proj_id 's have the same length. so, selecting the min(invc_proj_id) from a group by should be equivalent (and more predictable). the group by greatly reduces the number of rows that are considered by row_number().

    select

    TempProjectString,InfoProjID

    INTOCPMRSProjectData

    from (

    select a.proj_id as TempProjectString,

    min(b.invc_proj_id) as InfoProjID,

    -- len(a.proj_id) - len(b.invc_proj_id) as xxx,

    row_number() over (partition by a.proj_id

    order by len(a.proj_id) - len(b.invc_proj_id)) as rn

    from

    proj_bill B join TempData A

    on a.proj_id like '%'+b.invc_proj_id+'%'

    group by a.proj_id, len(a.proj_id) - len(b.invc_proj_id)

    ) d

    where d.rn = 1

  • I'm pretty sure indexing won't do much good on this because of all the formulas in the join and the Row_Number partition/order by.

    Why must it be ordered by the LEN?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply