Mark similar records in a table with millions of records

  • Actually the issue of not updating multiple rows was happening because theof the workaround second update query I wrote to update the duplicate_with field to have the lowest id and even same id as invoiceid if it is greater. this worked fine if there were only 2 rows but in case of multiple matching rows it caused this problem. I am pasting the second query here; please let me know what I can do to fix this

    UPDATE mjtest

    SET duplicate_with = InvoiceID

    where duplicate_type = 'TRUE DUPLICATE'

    AND InvoiceID < duplicate_with

    AND mjtest.duplicate_with is not null and mjtest.duplicate_with <> 0

  • Try this:

    UPDATE mjtest

    SET duplicate_with = MyAlias.DupId

    FROM mjtest

    LEFT JOIN

    (

    select IDSrc.InvoiceID, DupID=MIN(DupSrc.InvoiceID)

    FROM MJTest IDSrc

    CROSS JOIN MJTest DupSrc

    WHERE IDSrc.VendorNumber like '%' + DupSrc.VendorNumber + '%'

    AND IDSrc.CheckNumber like '%' + DupSrc.CheckNumber + '%'

    AND IDSrc.InvoiceNumber like '%' + DupSrc.InvoiceNumber + '%'

    AND IDSrc.InvoiceDate between DupSrc.InvoiceDate - 15 and DupSrc.InvoiceDate + 15

    AND IDSrc.projectid = DupSrc.projectid

    AND IDSrc.duplicate_with is null

    GROUP BY IDSrc.InvoiceID

    HAVING COUNT(DupSrc.InvoiceID) > 1

    ) MyAlias

    ON MJTest.InvoiceID = MyAlias.InvoiceID

  • thanks. This worked great.

  • You're welcome.

    - Les

    P.S.

    Just out of curiosity, rough guess, what kind of speedup did you see over using your cursor approach?

  • Its taking 3minutes now compared to 25 minutes earlier.

  • If I may make a suggestion, if this is a query you run often, consider restructuring your tables so that, for the fields on which you are searching, the portions on which you are looking for a match are separated from the data which varies by record (thus, you could change your test from "like '%'+ <whatever> + '%'" to "= <whatever>". With proper indexing, your queries will be much, much faster and the exercise will probably lead you to rethinking some of your current data organization perhaps resulting in a cleaner implementation.

    If you don't wish to lose the fields you currently have in favor of separate fields for the fixed vs. variable portions, you could create the split fields and then change the definition of the current fields to computed fields, moving your indexes to the fields on which you actually search. If you do other forms of searches on the composite fields you could either just relax normalization and duplicate the data in normal, non-computed fields (I would add check constraints to ensure they stay in sync or at least default constraints to create them automatically) or, if you are using SQL Server 2005 or later, you can leave them as computed fields but create indexes on the computed fields as well as on the component fields from which they are built.

    In general, if you find the need to manipulate the data in the fields in your "where" clause (with the exception, perhaps, of the use of IFNULL() -- I suspect the optimizer likely handles that pretty efficiently) you should consider restructuring your tables such that you are storing the data on which you actually search rather than something from which the search data must be derived. It is much more efficient to derive the output data once the search is done than to derive your search data for every single comparison on every single record in the table.

  • This query is run only once for any of the tables and the data is received in a text format and then imported so we have to use like statements

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply