December 27, 2009 at 11:07 am
I am doing a project where I need to Mark similar records in a table with millions of records. By similar I mean some fields have similar values.
The way I am doing this is running a select statement to get all records for that project.
Select id, field1, field2, field3, field4 from transactions where projectid=1
Then I loop through them to find similar records and update a field matchingid with the matching record number and also check if it has a matchingid then not to update it. I cannot update anything where matchingid field has a value because I do different types of matching starting from matching most fields and then reducing them and so each one has a match type and I don't want to overwrite that.
update transactions set matchingid = @id where matchingid is null AND matchtype = 'type1' AND field1 like '%@field1%' AND field2 like '%@field2%' AND field3 like '%@field3%' AND field4 like '%@field4%'
Since there millions of records and each can have about a million records this loop takes very long about 6 hours because I have about 15 match types and it has loop every time. Is there a way to run this faster may be using temp tables and looping through them or some better way to do this. It would very helpful if you can give me a suggestion with an example. Thanks.
December 27, 2009 at 11:58 am
mano you are not taking advantage of the power of SQL's set based operations;
you could do your update, to all millions of records, in a single statement....even better, it would update in seconds, vs your looping construct.
if you post the REAL CREATE TABLE definitions, instead of pseudocode(field1,field2) etc we could help you better. sample data is a huge help as well, and finally show us that the project-id field is null/wrong value for some project_id's , but correct for others? is that what you are trying to show us as the desired results?
I'll reread your post and put an example in a minute based on the psuedocode, but here's a question: why are you using LIKE on the 4 fields? would they not equal?
Lowell
December 27, 2009 at 12:08 pm
stating the obvious, do a backup, test this on a development server, ask questions, confirm it is close to what you are after.
--update the whole table
UPDATE transactions
SET matchingid = MyAlias.id
--get a grouping of 5 or more fields?
FROM (Select id, matchtype, field1, field2, field3, field4 from transactions GROUP BY field1, field2, field3, field4 ) MyAlias
--don't self update the row
where transactions .id <> MyAlias.ID
AND transactions.matchingid is null
AND transactions.matchtype = MyAlias.matchtype
AND transactions.field1 = MyAlias.field1
AND transactions.field2 = MyAlias.field2
AND transactions.field3 = MyAlias.@field3
AND transactions.field4 = MyAlias.@field4
Lowell
December 27, 2009 at 1:40 pm
Thanks for the quick reply. I am using like statements because they are not equal but similar. Will a modified version of your statement with like instead of equal work in my case.
December 27, 2009 at 3:11 pm
manojkanoi (12/27/2009)
Thanks for the quick reply. I am using like statements because they are not equal but similar. Will a modified version of your statement with like instead of equal work in my case.
probably, but I don't know for sure; you didn't provide any of the essentials for an accurate answer: the DDL to create the tables, some INSERT INTO statements for some sample data.
since you are planning on doing a LIKE statement, it will require a full table scan, but it should be several orders of magnitude faster than a cursor.
Lowell
December 28, 2009 at 10:29 pm
I was traveling yesterday so could not reply.
Here is the create statement
CREATE TABLE [dbo].[mjtest](
[InvoiceId] [int] IDENTITY(1,1) NOT NULL,
[ProjectId] [int] NOT NULL,
[CompanyCode] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[VendorNumber] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[CheckNumber] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[InvoiceNumber] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[VoucherNumber] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[DocumentNumber] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Ap1] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Ap2] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Ap3] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[UserId] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[PurchaseOrdNumber] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[CheckStatus] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[VendorName] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[InvoiceDate] [datetime] NULL,
[InvoiceNo] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[CheckDate] [datetime] NULL,
[InvoiceAmount] [float] NULL,
[CheckAmount] [float] NULL,
[tagging] [varchar](250) COLLATE SQL_Latin1_General_CP1_CI_AS NULL CONSTRAINT [DF_mjtest_tagging] DEFAULT ('no'),
[auto_tag] [varchar](250) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[duplicate_type] [varchar](250) COLLATE SQL_Latin1_General_CP1_CI_AS NULL CONSTRAINT [DF_mjtest_duplicate_type] DEFAULT ('no'),
[convertion_done] [tinyint] NULL CONSTRAINT [DF_mjtest_convertion_done] DEFAULT ((0)),
[duplicate_with] [bigint] NULL CONSTRAINT [DF_mjtest_duplicate_with] DEFAULT ((0)),
[StatusId] [int] NULL CONSTRAINT [DF_mjtest_StatusId] DEFAULT ((0)),
[unmark] [tinyint] NULL,
[reason] [varchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[discount] [float] NULL,
[tagstatus] [int] NULL CONSTRAINT [DF_mjtest_tagstatus] DEFAULT ((0)),
[sortbyid] [bigint] NULL,
[chkfornegative] [tinyint] NULL CONSTRAINT [DF_mjtest_chkfornegative] DEFAULT ((0)),
[id] [bigint] NULL,
CONSTRAINT [PK_dbo.mjtest] PRIMARY KEY CLUSTERED
(
[InvoiceId] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
Insert statements
INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(1,50,'MTGI3','NOR03','A-286','286','16270','16270','a1414','b1211','c15281 ','11111','A134590','P','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Oct 16 2008 12:00:00:000AM',-5.220000000000000e+002,-5.220000000000000e+002,'TD','YES','TRUE DUPLICATE',0,1,0,0,NULL,4.800000000000000e+001,0,1,0,0)
INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(2,50,'MTGI3','NOR03','5315','286','16370','16370','a1415','b1212','c15282 ','11112','A134591','V','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Nov 14 2008 12:00:00:000AM',2.000000000000000e-002,2.000000000000000e-002,'TDCR','MATCHED','MATCHED CREDIT',0,2,0,0,NULL,4.800000000000000e+001,0,2,0,0)
INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(3,50,'MTGI3','NOR03','5315','286','16339','16339','a1416','b1213','c15283 ','11113','A134592','P','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Nov 13 2008 12:00:00:000AM',-2.000000000000000e-002,-2.000000000000000e-002,'TDCR','MATCHED','MATCHED CREDIT',0,2,0,0,NULL,4.800000000000000e+001,0,2,0,0)
INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(4,50,'MTGI3','NOR03','5328','286','16425','16425','a1417','b1214','c15284 ','11114','A134593','P','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Nov 26 2008 12:00:00:000AM',-5.220000000000000e+002,-5.220000000000000e+002,'TD','YES','TRUE DUPLICATE',0,1,0,0,NULL,4.800000000000000e+001,0,1,0,0)
INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(5,50,'MTGI3','NOR03','7899','DM00000286','16426','16426','a1418','b1215','c15285 ','11115','A134594','D','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Oct 15 2008 12:00:00:000AM',2.000000000000000e-002,2.000000000000000e-002,'no','no',NULL,0,0,0,0,NULL,4.800000000000000e+001,0,22,0,0)
INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(6,50,'MTGI3','NOR03','A-286','286','16353','16353','a1419','b1216','c15286 ','11116','A134595','V','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Nov 13 2008 12:00:00:000AM',5.220000000000000e+002,5.220000000000000e+002,'TD','YES','TRUE DUPLICATE',0,6,0,0,NULL,4.800000000000000e+001,0,1,0,0)
INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(7,50,'MTGI3','RAM03','5318','CK RQST 10/31/0','16342','16342','a1420','b1217','c15287 ','11117','A134596','P','FELIPE RAMIREZ','Oct 31 2008 12:00:00:000AM',NULL,'Nov 13 2008 12:00:00:000AM',-4.136400000000000e+002,-4.136400000000000e+002,'no','no',NULL,0,0,0,0,NULL,4.800000000000000e+001,0,22,0,0)
INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(8,50,'MTGI3','SYS03','101214','810220564','17146','17146','a1421','b1218','c15288 ','11118','A134597','P','SYSCO FOOD SVS OF ATLANTA','Nov 1 2008 12:00:00:000AM',NULL,'Dec 16 2008 12:00:00:000AM',-8.834999999999999e+001,-8.834999999999999e+001,'A2','YES','INVOICE DATE',0,8,0,0,NULL,4.800000000000000e+001,0,11,0,0)
INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(9,50,'MTGI3','NOR03','5328','286','16725','16725','a1422','b1219','c15289 ','11119','A134598','V','NORTHEAST SALES DIST INC','Oct 15 2008 12:00:00:000AM',NULL,'Nov 30 2008 12:00:00:000AM',5.220000000000000e+002,5.220000000000000e+002,'TD','YES','TRUE DUPLICATE',0,6,0,0,NULL,4.800000000000000e+001,0,1,0,0)
INSERT INTO [mjtest] ([InvoiceId],[ProjectId],[CompanyCode],[VendorNumber],[CheckNumber],[InvoiceNumber],[VoucherNumber],[DocumentNumber],[Ap1],[Ap2],[Ap3],[UserId],[PurchaseOrdNumber],[CheckStatus],[VendorName],[InvoiceDate],[InvoiceNo],[CheckDate],[InvoiceAmount],[CheckAmount],[tagging],[auto_tag],[duplicate_type],[convertion_done],[duplicate_with],[StatusId],[unmark],[reason],[discount],[tagstatus],[sortbyid],[chkfornegative],[id])VALUES(10,50,'MTGI3','SYS03','101218','812231389','17160','17160','a1423','b1220','c15290 ','11120','A134599','P','SYSCO FOOD SVS OF ATLANTA','Dec 23 2008 12:00:00:000AM',NULL,'Dec 30 2008 12:00:00:000AM',-9.509270000000000e+003,-9.509270000000000e+003,'no','no',NULL,0,0,0,0,NULL,4.800000000000000e+001,0,22,0,0)
I have different conditions where I match different fields like the example I had given you before.
December 29, 2009 at 8:14 pm
Hi,
Were you able to check the create and insert statements I had sent. I did not give the following information in my last message and thats probably why you could not reply.
The way I am doing this is running a select statement to get all records for that project.
Select id, VendorNumber, checknumber, invoicenumber, InvoiceDate from mjtest where projectid=50
Then I loop through them to find similar records and update a field duplicate_with with the matching record number and also check if it has a duplicate_with then not to update it. I cannot update anything where duplicate_with field has a value because I do different types of matching using field duplicate_type starting from matching most fields and then reducing them and so each one has a match type and I don't want to overwrite that.
update transactions set duplicate_with = @id, duplicate_type='P4' where duplicate_with is null AND VendorNumber like '%@VendorNumber%' AND checknumber like '%@checknumber%' AND invoicenumber like '%@invoicenumber%' AND InvoiceDate <= @InvoiceDate+15 AND AND InvoiceDate >= @InvoiceDate-15
Thanks for your help and I will be waiting for your answer.
December 30, 2009 at 7:14 am
There were a couple of issues I found. One was that the insert statements were trying to insert into an identity column. You need to put this in front of the inserts.
SET IDENTITY_INSERT [MJTEST] ON;
Don't forget to turn it off afterward
SET IDENTITY_INSERT [MJTEST] OFF;
Then the select was using an id columns which is set to 0. I'm not sure that's what you want. I thought you wanted to update one record with the ID from a matching record. I used the InvoiceID instead which is the identity column. I modified Lowell's code to match you table and fields. This worked for me
UPDATE mjtest
SET duplicate_with = MyAlias.id
--get a grouping of 5 or more fields?
FROM (Select InvoiceID id, projectid, VendorNumber, CheckNumber, InvoiceNumber, InvoiceDate
from MJTest GROUP BY projectid, InvoiceID, VendorNumber, CheckNumber, InvoiceNumber, InvoiceDate ) MyAlias
--don't self update the row
where mjtest.InvoiceID <> MyAlias.ID
AND mjtest.duplicate_with is null
AND mjtest.VendorNumber like '%' + MyAlias.VendorNumber + '%'
AND mjtest.CheckNumber like '%' + MyAlias.CheckNumber + '%'
AND mjtest.InvoiceNumber like '%' + MyAlias.InvoiceNumber + '%'
AND mjtest.InvoiceDate between MyAlias.InvoiceDate - 15 and MyAlias.InvoiceDate + 15
AND mjtest.projectid = MyAlias.projectid
For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
December 30, 2009 at 7:37 am
this looks good. why are you grouping those 5 fields in the subquery?
December 30, 2009 at 7:58 am
to make sure only one row comes back per set
For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
December 30, 2009 at 10:48 pm
THanks. I ran the following for my exact duplicate type records and it was much faster but since you have vpsinvoice.InvoiceID <> MyAlias.InvoiceID it did not update the parent record duplicate_with field which I also want to update. If I remove this condition it updates the whole database with the same id. How do I change it so only the ones that have duplicates get updated?
UPDATE vpsinvoice
SET duplicate_with = MyAlias.InvoiceID
FROM (Select InvoiceID, projectid, VendorNumber, CheckNumber, InvoiceNumber, InvoiceDate,InvoiceAmount,VendorName
from vpsinvoice GROUP BY projectid, InvoiceID, VendorNumber, CheckNumber, InvoiceNumber, InvoiceDate,InvoiceAmount,VendorName ) MyAlias
where vpsinvoice.InvoiceID <> MyAlias.InvoiceID
AND vpsinvoice.VendorNumber = MyAlias.VendorNumber
AND vpsinvoice.VendorName = MyAlias.VendorName
AND vpsinvoice.InvoiceNumber = MyAlias.InvoiceNumber
AND vpsinvoice.InvoiceDate = MyAlias.InvoiceDate
AND vpsinvoice.InvoiceAmount = MyAlias.InvoiceAmount
AND vpsinvoice.projectid = MyAlias.projectid
December 31, 2009 at 6:30 am
I'm not sure what you're talking about. Do you have an example? Based on the data you provided, it updated all duplicate records. The only records not updated were "non-duplicate". I noticed your example didn't use like. Is there a reason for that?
For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
January 1, 2010 at 2:05 am
What I am saying is if invoiceid 1 is same as invoice 2, 3 and 4 I want the duplicate_with field to have the value 1 for all invoiceids 1,2,3,4. The way it is now it only updates 2,3,4 but not for invoiceid 1. Can we modify it so that all of them get updated including the row it is being matched to and use the lowest invoiceid for the duplicate_with field in all cases. I am not using like because this is the first duplicate match condition without like and then the others will use like. Thanks and Happy New Year.
January 1, 2010 at 2:28 am
I ran it again and noticed that 1 and 4 were duplicates and the duplicate_with field for 1 had the value 4 and duplicate_with for 4 had the value 1.What I want is for both rows to have the lower invoiceid value for duplicate_with field which is 1 so I can group them together. How can I do that?
January 1, 2010 at 4:59 am
Please check my last 2 messages and I noticed 1 more issue with this. If there are more than 2 matching rows only the first matching row gets updated and not the next one. Suppose 1,2,3,4 are matching to one then I want all rows duplicate_with to have the id 1 but it only updates row 2 with the number and not 3 and 4. How can I do that?
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply