Indexing question

  • Thanks for the long version... I'll read that when I have a few hours to spare :w00t:.

  • The DNC comes as a flat text file of telephone numbers. We have to match it up against our database of customers and mark them as unable to call unless we have signed consent. In order to know which ones are new we would need to import the flat file into a temp table and then import the ones that are not already there. It was deemed easier to truncate and import the file into the main table rather than create a multitep process, and maintan import dates.

    The original question was not for way to improve the import(that will come later) but to improve the marking of the invalid flag in the customer table. I was hoping that an index on the DNC would help but it didn't. Nor could I come up with a faster query although I think number 3 is easier to read.


  • Sorry if we missed the question. For the query optimisation we'd need the execution plan... but I'm afraid Sergiy is right on this one... the process may need rethinking.

  • mrpolecat,

    1st, if you stick to PK then create it before uploading data.

    In this case it will insert directly into index, no wasting time for reordering already loaded rows.

    And make sure there is an index on matching column in existing table.

    2nd, get rid of "IN" statements.

    I would suggest to use FULL HASH JOIN because it seems there is no way to avoid index scan, so let server do it once.

    Catch only those rows which have NULL in either joined columns.

    NULL in "uploaded" column means "deleted", NULL in "existing" column means "new".

    _____________
    Code for TallyGenerator

  • This might be somewhat cheating - but is your own list (not the DNC - your client list) regional in nature? If it were - that'd be one way to cut the DNC down right quick.

    Otherwise - might be worthy looking at either using a partitioned table, and running the queries multiple times over on smaller sets. There are some thresholds that seem to make a difference, and I'd bet that running 10 x 14 Million number matches will run quite a bit faster than 1x140 number matches.

    Of course if you're running just one update against this and it's fairly simple - then the overhead of the partitioning alone might invalidate that thought.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt,

    partitioning won't work.

    He supposes to delete all records not in upload.

    On 1st 10% of upload he'll delete about 90% of existing data - oooops.

    The point is not to do update on 140M rows.

    Insert into some temp table only mismatched records:

    INSERT INTO #Table

    SELECT Upload.Number, Existing.Number

    FROM ... FULL HASH JOIN ... ON Upload.Number = Existing.Number

    WHERE Upload.Number IS NULL OR Existing.Number IS NULL

    Then update only records appeared in this temp table.

    _____________
    Code for TallyGenerator

  • I'm not following you - he mentioned tagging those that MATCH, not the other way around. That is something that would lend itself to partitioning

    If you only ever look at the new ones (new updates or new deletes - how does he get to deal with new customers in his file? Besides - how do you know if they're new if you don't already have the list stored somewhere?

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • I'm not sure I explained the process well since I started with an abstract question and now we are moving to my specific issue. Here we go.

    We have a customer table that we add a few thousand records to each day, some through batch and some through data entry applications.

    tb_leadbox contains 2 Million records

    uid (uniqueidentifier) PK

    uid2 (int) identity indexed unique

    sourcecode varchar(10)

    fname varchar(50)

    lname varchar(50)

    hphone varchar(13) indexed

    invalidflag varchar(3) indexed

    tb_sourcecode contains 150 records

    sourcecode varchar(10) PK

    nat_dnc int null

    We get a flat file of 140 Million telephone numbers we can only call if we have a signed release form which is indicated by the sourcecode. There can be adds and deletes from this table and I have no control over it. It only contains the telephone number.

    We download and import this into a table each week. The process truncates the current table and imports the new file.

    tb_dncnat contains 140 Million Rows

    dnc_phone varchar(10)

    Everyday we have to verify that our unsigned records are not on the DNC list. If they are we update their invalidcode.

    I have tried rewriting the update query a few different ways and adding adding a PK on dnc_phone but neither had any significant improvements.


  • Sergiy, I looked up hash joins in BOL and while there were good descriptions there were no examples. I did write a version of the query that got rid of the INs and used joins but is there something else that makes it a FULL HASH JOIN?

    select uid2,invalidflag from tb_leadbox l

    join tb_dncnat on homephone = dnc_phone

    join tb_lb_sourcecode s on l.sourcecode = s.sourcecode

    where (nat_dnc <> 1 or nat_dnc is null) and invalidflag in (0,7)


  • Just curious - why are the two phone numbers different data lengths (varchar(13) vs varchar(10))? Is there "garbage" you get in you need to clean out? Also - why use variable length at all - why not fixed length CHAR values? I know - a lot of that is usually really minor nonsense type issues - but if you don't need to deal with the overhead of figuring out the length of the data in a field (especially when it always should be exactly 10 chars), why do it (especially doing it on 140M rows)?

    Is it possible to clean up those numbers first? Also - since numeric comparisons tend to be a whole heck of a lot more efficient - would it make sense to import the DNC as an integer value instead of a string (should still be a unique compare?). Compare it against a computed column in your table (computed does the conversion, and is indexed).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • SELECTlb.Uid2,

    lb.InvalidFlag

    FROMtb_LeadBox AS lb

    INNER JOINtb_DncNat AS dn ON dn.Dnc_Phone = lb.HomePhone

    LEFT JOINtb_LB_SourceCode AS lbsc ON lbsc.SourceCode = lb.SourceCode

    AND lbsc.Nat_DNC = 1

    WHERElb.InvalidFlag IN (0, 7)

    AND lbsc.SourceCode IS NULL


    N 56°04'39.16"
    E 12°55'05.25"

  • Matt- At one point in time the table allowed for a formatted phone number xxx-xxx-xxxx. It is not used so I think could change it to char(10) or numeric. Have to check the apps first. The DNC can definitely be changed to numeric.

    Peter- I got the same result as the other queries. I like it because it will consider unmatched sourcecodes as unsigned which is safer.


  • mrpolecat (10/3/2007)


    Matt- At one point in time the table allowed for a formatted phone number xxx-xxx-xxxx. It is not used so I think could change it to char(10) or numeric. Have to check the apps first. The DNC can definitely be changed to numeric.

    Multimillion-rows tables are not right place to play childish games with formatting.

    Make it straight.

    And varchar fields are slowest for comparing. You actually use 10/256 part of every byte but it's the whole byte to be compared by server.

    _____________
    Code for TallyGenerator

Viewing 13 posts - 16 through 27 (of 27 total)

You must be logged in to reply to this topic. Login to reply