Mark similar records in a table with millions of records

Question

Mark similar records in a table with millions of records

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply

manojkanoi SSCrazy Points: 2292 More actions · Answer 1

Actually the issue of not updating multiple rows was happening because theof the workaround second update query I wrote to update the duplicate_with field to have the lowest id and even same id as invoiceid if it is greater. this worked fine if there were only 2 rows but in case of multiple matching rows it caused this problem. I am pasting the second query here; please let me know what I can do to fix this

UPDATE mjtest

SET duplicate_with = InvoiceID

where duplicate_type = 'TRUE DUPLICATE'

AND InvoiceID < duplicate_with

AND mjtest.duplicate_with is not null and mjtest.duplicate_with <> 0

lnoland SSCommitted Points: 1753 More actions · Answer 2

Try this:

UPDATE mjtest

SET duplicate_with = MyAlias.DupId

FROM mjtest

LEFT JOIN

(

select IDSrc.InvoiceID, DupID=MIN(DupSrc.InvoiceID)

FROM MJTest IDSrc

CROSS JOIN MJTest DupSrc

WHERE IDSrc.VendorNumber like '%' + DupSrc.VendorNumber + '%'

AND IDSrc.CheckNumber like '%' + DupSrc.CheckNumber + '%'

AND IDSrc.InvoiceNumber like '%' + DupSrc.InvoiceNumber + '%'

AND IDSrc.InvoiceDate between DupSrc.InvoiceDate - 15 and DupSrc.InvoiceDate + 15

AND IDSrc.projectid = DupSrc.projectid

AND IDSrc.duplicate_with is null

GROUP BY IDSrc.InvoiceID

HAVING COUNT(DupSrc.InvoiceID) > 1

) MyAlias

ON MJTest.InvoiceID = MyAlias.InvoiceID

manojkanoi SSCrazy Points: 2292 More actions · Answer 3

manojkanoi

SSCrazy

Points: 2292

January 4, 2010 at 8:44 pm

#1098730

thanks. This worked great.

lnoland SSCommitted Points: 1753 More actions · Answer 4

You're welcome.

- Les

P.S.

Just out of curiosity, rough guess, what kind of speedup did you see over using your cursor approach?

manojkanoi SSCrazy Points: 2292 More actions · Answer 5

Its taking 3minutes now compared to 25 minutes earlier.

lnoland SSCommitted Points: 1753 More actions · Answer 6

If I may make a suggestion, if this is a query you run often, consider restructuring your tables so that, for the fields on which you are searching, the portions on which you are looking for a match are separated from the data which varies by record (thus, you could change your test from "like '%'+ <whatever> + '%'" to "= <whatever>". With proper indexing, your queries will be much, much faster and the exercise will probably lead you to rethinking some of your current data organization perhaps resulting in a cleaner implementation.

If you don't wish to lose the fields you currently have in favor of separate fields for the fixed vs. variable portions, you could create the split fields and then change the definition of the current fields to computed fields, moving your indexes to the fields on which you actually search. If you do other forms of searches on the composite fields you could either just relax normalization and duplicate the data in normal, non-computed fields (I would add check constraints to ensure they stay in sync or at least default constraints to create them automatically) or, if you are using SQL Server 2005 or later, you can leave them as computed fields but create indexes on the computed fields as well as on the component fields from which they are built.

In general, if you find the need to manipulate the data in the fields in your "where" clause (with the exception, perhaps, of the use of IFNULL() -- I suspect the optimizer likely handles that pretty efficiently) you should consider restructuring your tables such that you are storing the data on which you actually search rather than something from which the search data must be derived. It is much more efficient to derive the output data once the search is done than to derive your search data for every single comparison on every single record in the table.

manojkanoi SSCrazy Points: 2292 More actions · Answer 7

This query is run only once for any of the tables and the data is received in a text format and then imported so we have to use like statements