Mark similar records in a table with millions of records

  • I am doing a project where I need to Mark similar records in a table with millions of records. By similar I mean some fields have similar values.

    The way I am doing this is running a select statement to get all records for that project.

    Select id, field1, field2, field3, field4 from transactions where projectid=1

    Then I loop through them to find similar records and update a field matchingid with the matching record number and also check if it has a matchingid then not to update it. I cannot update anything where matchingid field has a value because I do different types of matching starting from matching most fields and then reducing them and so each one has a match type and I don't want to overwrite that.

    update transactions set matchingid = @id where matchingid is null AND matchtype = 'type1' AND field1 like '%@field1%' AND field2 like '%@field2%' AND field3 like '%@field3%' AND field4 like '%@field4%'

    Since there millions of records and each can have about a million records this loop takes very long about 6 hours because I have about 15 match types and it has loop every time. Is there a way to run this faster may be using temp tables and looping through them or some better way to do this. It would very helpful if you can give me a suggestion with an example. Thanks.

  • mano you are not taking advantage of the power of SQL's set based operations;

    you could do your update, to all millions of records, in a single statement....even better, it would update in seconds, vs your looping construct.

    if you post the REAL CREATE TABLE definitions, instead of pseudocode(field1,field2) etc we could help you better. sample data is a huge help as well, and finally show us that the project-id field is null/wrong value for some project_id's , but correct for others? is that what you are trying to show us as the desired results?

    I'll reread your post and put an example in a minute based on the psuedocode, but here's a question: why are you using LIKE on the 4 fields? would they not equal?

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • stating the obvious, do a backup, test this on a development server, ask questions, confirm it is close to what you are after.

    --update the whole table

    UPDATE transactions

    SET matchingid = MyAlias.id

    --get a grouping of 5 or more fields?

    FROM (Select id, matchtype, field1, field2, field3, field4 from transactions GROUP BY field1, field2, field3, field4 ) MyAlias

    --don't self update the row

    where transactions .id <> MyAlias.ID

    AND transactions.matchingid is null

    AND transactions.matchtype = MyAlias.matchtype

    AND transactions.field1 = MyAlias.field1

    AND transactions.field2 = MyAlias.field2

    AND transactions.field3 = MyAlias.@field3

    AND transactions.field4 = MyAlias.@field4

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Thanks for the quick reply. I am using like statements because they are not equal but similar. Will a modified version of your statement with like instead of equal work in my case.

  • manojkanoi (12/27/2009)


    Thanks for the quick reply. I am using like statements because they are not equal but similar. Will a modified version of your statement with like instead of equal work in my case.

    probably, but I don't know for sure; you didn't provide any of the essentials for an accurate answer: the DDL to create the tables, some INSERT INTO statements for some sample data.

    since you are planning on doing a LIKE statement, it will require a full table scan, but it should be several orders of magnitude faster than a cursor.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • I was traveling yesterday so could not reply.

    Here is the create statement

    CREATE TABLE [dbo].[mjtest](

    [InvoiceId] [int] IDENTITY(1,1) NOT NULL,

    [ProjectId] [int] NOT NULL,

    [CompanyCode] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [VendorNumber] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [CheckNumber] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [InvoiceNumber] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [VoucherNumber] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [DocumentNumber] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [Ap1] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [Ap2] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [Ap3] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [UserId] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [PurchaseOrdNumber] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [CheckStatus] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [VendorName] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [InvoiceDate] [datetime] NULL,

    [InvoiceNo] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [CheckDate] [datetime] NULL,

    [InvoiceAmount] [float] NULL,

    [CheckAmount] [float] NULL,

    [tagging] [varchar](250) COLLATE SQL_Latin1_General_CP1_CI_AS NULL CONSTRAINT [DF_mjtest_tagging] DEFAULT ('no'),

    [auto_tag] [varchar](250) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [duplicate_type] [varchar](250) COLLATE SQL_Latin1_General_CP1_CI_AS NULL CONSTRAINT [DF_mjtest_duplicate_type] DEFAULT ('no'),

    [convertion_done] [tinyint] NULL CONSTRAINT [DF_mjtest_convertion_done] DEFAULT ((0)),

    [duplicate_with] [bigint] NULL CONSTRAINT [DF_mjtest_duplicate_with] DEFAULT ((0)),

    [StatusId] [int] NULL CONSTRAINT [DF_mjtest_StatusId] DEFAULT ((0)),

    [unmark] [tinyint] NULL,

    [reason] [varchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [discount] [float] NULL,

    [tagstatus] [int] NULL CONSTRAINT [DF_mjtest_tagstatus] DEFAULT ((0)),

    [sortbyid] [bigint] NULL,

    [chkfornegative] [tinyint] NULL CONSTRAINT [DF_mjtest_chkfornegative] DEFAULT ((0)),

    [id] [bigint] NULL,

    CONSTRAINT [PK_dbo.mjtest] PRIMARY KEY CLUSTERED

    (

    [InvoiceId] ASC

    )WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]

    ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

    Insert statements

    INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(1,50,'MTGI3','NOR03','A-286','286','16270','16270','a1414','b1211','c15281 ','11111','A134590','P','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Oct 16 2008 12:00:00:000AM',-5.220000000000000e+002,-5.220000000000000e+002,'TD','YES','TRUE DUPLICATE',0,1,0,0,NULL,4.800000000000000e+001,0,1,0,0)

    INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(2,50,'MTGI3','NOR03','5315','286','16370','16370','a1415','b1212','c15282 ','11112','A134591','V','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Nov 14 2008 12:00:00:000AM',2.000000000000000e-002,2.000000000000000e-002,'TDCR','MATCHED','MATCHED CREDIT',0,2,0,0,NULL,4.800000000000000e+001,0,2,0,0)

    INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(3,50,'MTGI3','NOR03','5315','286','16339','16339','a1416','b1213','c15283 ','11113','A134592','P','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Nov 13 2008 12:00:00:000AM',-2.000000000000000e-002,-2.000000000000000e-002,'TDCR','MATCHED','MATCHED CREDIT',0,2,0,0,NULL,4.800000000000000e+001,0,2,0,0)

    INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(4,50,'MTGI3','NOR03','5328','286','16425','16425','a1417','b1214','c15284 ','11114','A134593','P','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Nov 26 2008 12:00:00:000AM',-5.220000000000000e+002,-5.220000000000000e+002,'TD','YES','TRUE DUPLICATE',0,1,0,0,NULL,4.800000000000000e+001,0,1,0,0)

    INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(5,50,'MTGI3','NOR03','7899','DM00000286','16426','16426','a1418','b1215','c15285 ','11115','A134594','D','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Oct 15 2008 12:00:00:000AM',2.000000000000000e-002,2.000000000000000e-002,'no','no',NULL,0,0,0,0,NULL,4.800000000000000e+001,0,22,0,0)

    INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(6,50,'MTGI3','NOR03','A-286','286','16353','16353','a1419','b1216','c15286 ','11116','A134595','V','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Nov 13 2008 12:00:00:000AM',5.220000000000000e+002,5.220000000000000e+002,'TD','YES','TRUE DUPLICATE',0,6,0,0,NULL,4.800000000000000e+001,0,1,0,0)

    INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(7,50,'MTGI3','RAM03','5318','CK RQST 10/31/0','16342','16342','a1420','b1217','c15287 ','11117','A134596','P','FELIPE RAMIREZ','Oct 31 2008 12:00:00:000AM',NULL,'Nov 13 2008 12:00:00:000AM',-4.136400000000000e+002,-4.136400000000000e+002,'no','no',NULL,0,0,0,0,NULL,4.800000000000000e+001,0,22,0,0)

    INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(8,50,'MTGI3','SYS03','101214','810220564','17146','17146','a1421','b1218','c15288 ','11118','A134597','P','SYSCO FOOD SVS OF ATLANTA','Nov 1 2008 12:00:00:000AM',NULL,'Dec 16 2008 12:00:00:000AM',-8.834999999999999e+001,-8.834999999999999e+001,'A2','YES','INVOICE DATE',0,8,0,0,NULL,4.800000000000000e+001,0,11,0,0)

    INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(9,50,'MTGI3','NOR03','5328','286','16725','16725','a1422','b1219','c15289 ','11119','A134598','V','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Nov 30 2008 12:00:00:000AM',5.220000000000000e+002,5.220000000000000e+002,'TD','YES','TRUE DUPLICATE',0,6,0,0,NULL,4.800000000000000e+001,0,1,0,0)

    INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(10,50,'MTGI3','SYS03','101218','812231389','17160','17160','a1423','b1220','c15290 ','11120','A134599','P','SYSCO FOOD SVS OF ATLANTA','Dec 23 2008 12:00:00:000AM',NULL,'Dec 30 2008 12:00:00:000AM',-9.509270000000000e+003,-9.509270000000000e+003,'no','no',NULL,0,0,0,0,NULL,4.800000000000000e+001,0,22,0,0)

    I have different conditions where I match different fields like the example I had given you before.

  • Hi,

    Were you able to check the create and insert statements I had sent. I did not give the following information in my last message and thats probably why you could not reply.

    The way I am doing this is running a select statement to get all records for that project.

    Select id, VendorNumber, checknumber, invoicenumber, InvoiceDate from mjtest where projectid=50

    Then I loop through them to find similar records and update a field duplicate_with with the matching record number and also check if it has a duplicate_with then not to update it. I cannot update anything where duplicate_with field has a value because I do different types of matching using field duplicate_type starting from matching most fields and then reducing them and so each one has a match type and I don't want to overwrite that.

    update transactions set duplicate_with = @id, duplicate_type='P4' where duplicate_with is null AND VendorNumber like '%@VendorNumber%' AND checknumber like '%@checknumber%' AND invoicenumber like '%@invoicenumber%' AND InvoiceDate <= @InvoiceDate+15 AND AND InvoiceDate >= @InvoiceDate-15

    Thanks for your help and I will be waiting for your answer.

  • There were a couple of issues I found. One was that the insert statements were trying to insert into an identity column. You need to put this in front of the inserts.

    SET IDENTITY_INSERT [MJTEST] ON;

    Don't forget to turn it off afterward

    SET IDENTITY_INSERT [MJTEST] OFF;

    Then the select was using an id columns which is set to 0. I'm not sure that's what you want. I thought you wanted to update one record with the ID from a matching record. I used the InvoiceID instead which is the identity column. I modified Lowell's code to match you table and fields. This worked for me

    UPDATE mjtest

    SET duplicate_with = MyAlias.id

    --get a grouping of 5 or more fields?

    FROM (Select InvoiceID id, projectid, VendorNumber, CheckNumber, InvoiceNumber, InvoiceDate

    from MJTest GROUP BY projectid, InvoiceID, VendorNumber, CheckNumber, InvoiceNumber, InvoiceDate ) MyAlias

    --don't self update the row

    where mjtest.InvoiceID <> MyAlias.ID

    AND mjtest.duplicate_with is null

    AND mjtest.VendorNumber like '%' + MyAlias.VendorNumber + '%'

    AND mjtest.CheckNumber like '%' + MyAlias.CheckNumber + '%'

    AND mjtest.InvoiceNumber like '%' + MyAlias.InvoiceNumber + '%'

    AND mjtest.InvoiceDate between MyAlias.InvoiceDate - 15 and MyAlias.InvoiceDate + 15

    AND mjtest.projectid = MyAlias.projectid

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • this looks good. why are you grouping those 5 fields in the subquery?

  • to make sure only one row comes back per set

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • THanks. I ran the following for my exact duplicate type records and it was much faster but since you have vpsinvoice.InvoiceID <> MyAlias.InvoiceID it did not update the parent record duplicate_with field which I also want to update. If I remove this condition it updates the whole database with the same id. How do I change it so only the ones that have duplicates get updated?

    UPDATE vpsinvoice

    SET duplicate_with = MyAlias.InvoiceID

    FROM (Select InvoiceID, projectid, VendorNumber, CheckNumber, InvoiceNumber, InvoiceDate,InvoiceAmount,VendorName

    from vpsinvoice GROUP BY projectid, InvoiceID, VendorNumber, CheckNumber, InvoiceNumber, InvoiceDate,InvoiceAmount,VendorName ) MyAlias

    where vpsinvoice.InvoiceID <> MyAlias.InvoiceID

    AND vpsinvoice.VendorNumber = MyAlias.VendorNumber

    AND vpsinvoice.VendorName = MyAlias.VendorName

    AND vpsinvoice.InvoiceNumber = MyAlias.InvoiceNumber

    AND vpsinvoice.InvoiceDate = MyAlias.InvoiceDate

    AND vpsinvoice.InvoiceAmount = MyAlias.InvoiceAmount

    AND vpsinvoice.projectid = MyAlias.projectid

  • I'm not sure what you're talking about. Do you have an example? Based on the data you provided, it updated all duplicate records. The only records not updated were "non-duplicate". I noticed your example didn't use like. Is there a reason for that?

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • What I am saying is if invoiceid 1 is same as invoice 2, 3 and 4 I want the duplicate_with field to have the value 1 for all invoiceids 1,2,3,4. The way it is now it only updates 2,3,4 but not for invoiceid 1. Can we modify it so that all of them get updated including the row it is being matched to and use the lowest invoiceid for the duplicate_with field in all cases. I am not using like because this is the first duplicate match condition without like and then the others will use like. Thanks and Happy New Year.

  • I ran it again and noticed that 1 and 4 were duplicates and the duplicate_with field for 1 had the value 4 and duplicate_with for 4 had the value 1.What I want is for both rows to have the lower invoiceid value for duplicate_with field which is 1 so I can group them together. How can I do that?

  • Please check my last 2 messages and I noticed 1 more issue with this. If there are more than 2 matching rows only the first matching row gets updated and not the next one. Suppose 1,2,3,4 are matching to one then I want all rows duplicate_with to have the id 1 but it only updates row 2 with the number and not 3 and 4. How can I do that?

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply