How can I query it fast ?

  • I have a few million records in a csv file.

    This has only one field which is a driver md5 ( eg. 0x00001DE0033358C743199217832D895C0 )

    I need to look up this in a table which has billions of records.

    Currently this is what I do:

    1. split this into batches of 10000

    2. BULK INSERT in a temp table

    3. Join that temp table with lookup table and get the result

    Is there any way to make it faster ?

    The columns in both tables are in binary format.

  • What you are doing seems to be the right way. Can you elaborate on the requirement, why you are doing this.

  • Thanks.

    This was to lookup for some information only. I got it done using a stored procedure and TVP. It's working fine. Thanks for your time in replying.

  • Why are you doing this in batches? It's usually faster to do a single load than multiple ones.

    Are the columns indexed?

    Are you really trying to return a million rows result set? What for?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • I would import the records from the .csv file into a user table (not a temp table), and then create a primary key on DriverMD5 column. The other table, the one with billions of rows, should also have an index on DriverMD5. Once done, I would expect a medium powered server to complete a join of these tables, not quickly, but at least within a timeframe of several minutes.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (12/27/2016)


    I would import the records from the .csv file into a user table (not a temp table), and then create a primary key on DriverMD5 column. The other table, the one with billions of rows, should also have an index on DriverMD5. Once done, I would expect a medium powered server to complete a join of these tables, not quickly, but at least within a timeframe of several minutes.

    If the DB can be taken to BULK LOGGED recovery, build the table with the CI in place and using a minimally-logged BULK INSERT. If you turn on TF 610, the import doesn't even need to be sorted in the same order as the CI to get minimal logging.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Valid point. I have already removed batch processing.

    I wanted result of all the hashes. But it can be in batches as well. Not necessarily in one shot.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply