De-duping solution

  • I need to come up with an efficient way to de-dupe records prior to adding them to a system.  This needs to be done prior to sending them as this check cannot be done once the record is sent.  The first solution that comes to mind is to create a table and insert every record that i'm adding so then all I have to do is check that table for the record prior to inserting it to make sure it doesn't already exist.  However, after the table grows to several hundred thousand rows, and eventually millions of rows, this solution doesn't seem like it would be the best.  I'm not sure if I'll have a unique ID that I can insert into this table to check on, it might just have to be an email address which will make the search of a million+ rows even worse.  Any suggestions on a better way of doing this or a way of optimizing this way would be greatly appreciated. 

    Thanks

  • A left join or exists query doesn't take that much time.  It may take a few seconds but surely not long assuming you are using the right indexing strategy.

     

    Also before doing any optimisation work, you must define what qualify as a duplicate.  The e-mail seems like a good candidate.  Also this can make a fairly fast index search which could be further optimised down the line to using a checksum search... but this is much farther down the line.  Also even if you have 100 000 000 rows in the table, but that you are only loading a few more 1000 rows, an indexed search on the e-mail should perform very well.

     

    I suggest you first set this up the best way you know how, then see if there's any problem with the solution.  Then come back for more help as we will have a better idea of what the problem is... assuming there is one.

  • Thanks for your input.  I'll give it a try and see how it works out

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply