compare of 2 big tables

  • i have a table with 170million rows (tableA),

    and a second table which have about 500,000 rows(tableB).

    i want to compare a certain column from tableA (type bigint)

    to check if a value of it exist in tableB.

    i tried to do this using inner join, making a select with IN expression, but it runs for hours.

    any idea how i can improve te result?

    (the column from tableA dosent have an index, and the column from tableB is defined as primary key)

    Thanks

    Peleg

  • You should try "WHERE EXISTS" instead of "IN".

    Also, INTERSECT and EXCEPT might be better if you can't use EXISTS.

    Converting oxygen into carbon dioxide, since 1955.
  • i used the Exists and it seems to help,

    the problem is that i split the compare into bulk of 1 million rows each times(which take 1.18 min), but when i try to make a query

    with a parameter (@minid), the time increases :

    select distinct ReasonRef from Con_RecipientToMessages (nolock)

    where Con_RecipientToMessages.Id>=@minid and Con_RecipientToMessages.Id<= @minid+1000000

    and exists (

    select ReasonRef from #tblReasonList(nolock) where Con_RecipientToMessages.ReasonRef=#tblReasonList.ReasonRef

    )

  • Any reason why you cannot also create an index on table a, then drop it when you are done?

  • i will check it,

    but what can i do with the query where i use a variable?

  • If you create the index, you may not need to break it up into smaller chunks. Creating an index on the id field will still improve response time, even with the parameters. Also, you may look at using a join instead of a subquery, esp when you have the index in place.

  • i added the index and it improved the query time,

    i have 2 questions :

    1)when i added an index - can i check how long it will take or approximatly how much was done?

    2)when i am using parameters in the query it still slower then working with the values themselfs - any idea how can i improve it?

    thanks

    peleg

  • peleg k (3/18/2010)


    1)when i added an index - can i check how long it will take or approximatly how much was done?

    How long it will take - no.

    How much was done - depends what you mean. How much what?

    2)when i am using parameters in the query it still slower then working with the values themselfs - any idea how can i improve it?

    Yes. Add the OPTION (RECOMPILE) hint as shown:

    -- Something like this index should probably exist anyway

    CREATE NONCLUSTERED INDEX nc1

    ON dbo.Con_RecipientToMessages (Id)

    INCLUDE (ReasonRef)

    WITH (

    FILLFACTOR = 100,

    MAXDOP = 1,

    ONLINE = OFF,

    SORT_IN_TEMPDB = ON

    );

    -- Optimal index for the temporary table

    -- Add UNIQUE constraint if it is valid

    CREATE NONCLUSTERED INDEX nc1

    ON #tblReasonList (ReasonRef)

    WITH (

    FILLFACTOR = 100,

    MAXDOP = 1,

    ONLINE = OFF,

    SORT_IN_TEMPDB = ON

    );

    SELECT DISTINCT

    CRTM.ReasonRef

    FROM dbo.Con_RecipientToMessages CRTM

    WHERE CRTM.Id BETWEEN @minid AND @minid + 1000000

    AND EXISTS

    (

    SELECT *

    FROM #tblReasonList RL

    WHERE RL.ReasonRef = CRTM.ReasonRef

    )

    OPTION (RECOMPILE);

  • Could you post the actual execution plans with and without use of the parameters?

    Edit: May be redundant after seeing Paul's post 🙂

  • HowardW (3/18/2010)


    Could you post the actual execution plans with and without use of the parameters?

    No worries Howard, it is a good point. It is always nice to see execution plans, but I am thinking that this example is sufficiently simple that we can assume it is a parameter sniffing issue, and OPTION (RECOMPILE) seems a good and simple fix here.

    @peleg: If you want to know more about parameter sniffing, see Batch Compilation, Recompilation, and Plan Caching Issues In SQL Server 2005.

  • i will add the execution plan later beacuse i am in the middle of the task

    but just to update you : when adding the OPTION (RECOMPILE); -> it made the job run up to

    4 minutes, instead of 60-120 seconds

  • peleg k (3/18/2010)


    i will add the execution plan later beacuse i am in the middle of the task

    but just to update you : when adding the OPTION (RECOMPILE); -> it made the job run up to

    4 minutes, instead of 60-120 seconds

    Make sure it is the plan from an actual execution please - an estimated plan omits many of the helpful details.

    Are the statistics up to date on the large table?

  • i attached both execution plans, he first is with variables, and the second with numbers.

    i went over them and didnt see any diffrence?!?

    how do i check the last statistics update?

  • peleg k (3/19/2010)


    i attached both execution plans, he first is with variables, and the second with numbers.

    i went over them and didnt see any diffrence?!?

    Those are estimated plans.

  • i hope this time it ok

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply