January 26, 2009 at 6:22 am
It seems the more I get into SQL the more it appears I am also getting into Programming! Being "old school", I tend to be more "linear" in my thinking, doing things "one step at a time", which is why I was never any good at more abstract languages such as C++ or Java!
We recently discovered a major bug in an accounting system due to improper indexes/keys set on a table, thankfully way before my time here. Basically, the table can have an customer invoice with multiple Items and charges (or credits) per invoice. There are cases when a ItemNumber, charge and charge code can be duplicated when it is posted with additional items, i.e,
Invoice1 ItemNo1 $21 ChargeType: MainItem
Invoice1 ItemNo1 $1.5 ChargeType: Tax
Invoice1 ItemNo1 $3.0 ChargeType: Delivery
... (further in the file with weekly invoices)...
Invoice1 ItemNo1 $21 ChargeType: MainItem
Invoice1 ItemNo1 $1.5 ChargeType: Tax
Invoice1 ItemNo1 $4.0 ChargeType: Fuel
Invoice1 ItemNo1 -$3.0 ChargeType: Delivery
...
Unfortunately, since an outside system creates the invoice file we have no control over the content coming to us, just the ability
My task is to try and put a unique numbered sequence for all of the above items in the existing table. So, for the example above, I will need 1 through 7 for Invoice1 ItemNo1 entries in the table. There are no other fields that will make each record "unique".
My thought was to loop by doing the following:
1) Create a new table, dbo.Inv_NewCustCharge, with the exact same structure but with a field INT IDENTITY(1,1) called "primary_key". I could then take all of the existing records and number each field 1 through the end of the data.
2) Get a list of all of the Invoice(s) and ItemNo, along with a count of the number of existing item(s) in the database (#temptable).
3) Have an outer loop that gets the Invoice, ItemNo and count.
4) Have an inner loop that will number each record based on the "count" of ItemNo(s), then insert that into the table.
5) Delete the "primary_Key" field after all of the items in (3) and (4) are numbered.
New enteries into the table would get their numbering from the application that enters the data into the database, so I just need to deal with the existing data. Here was my solution:
SET NOCOUNT ON
DECLARE @Statement NVARCHAR(4000),
@MyItemNumber NVARCHAR(128),
@MyInvoice NVARCHAR(128),
@MyCount NVARCHAR(128),
@DataCntr INT,
@LoopCntr INT,
@RowCounter INT,
@MinRow BIGINT,
@MaxRow BIGINT;
IF OBJECT_ID(N'tempdb..#temptable', N'U') IS NOT NULL
DROP TABLE #temptable;
CREATE TABLE #temptable (primary_key INT IDENTITY(1,1) NOT NULL,
CntRecords int NOT NULL, ItemNo varchar(20) NOT NULL, InvoiceNo char(9) NOT NULL);
INSERT INTO #temptable
select count(*) as Count, ItemNo, InvoiceNo
from dbo.Inv_CustCharge
GROUP BY ItemNo, InvoiceNo
-- Sent Counters for looping purposes
SET @DataCntr = ISNULL((Select count(*) from #temptable),0)
SET @LoopCntr = 1
WHILE @DataCntr > 0 AND @LoopCntr <= @DataCntr
BEGIN -- WHILE LOOP
SELECT @MyCount = CntRecords, @MyItemNumber = TrackingNo, @MyInvoice = InvoiceNo
FROM #temptable
WHERE primary_key = @LoopCntr
SELECT @MinRow = MIN(primary_key), @MaxRow = MAX(primary_key)
FROM dbo.Inv_NewCustCharge--#Temp_CustCharge
WHERE ItemNo = @MyItemNumber AND InvoiceNo = @MyInvoice;
SET @RowCounter = 1
WHILE @RowCounter <= @MyCount AND @MinRow <= @MaxRow
BEGIN
UPDATE dbo.Inv_NewCustCharge
SET RowNumber = @RowCounter
FROM dbo.Inv_NewCustCharge --#Temp_CustCharge
WHERE ItemNo = @MyItemNumber AND InvoiceNo = @MyInvoice
AND Primary_Key = @MinRow;
SET @RowCounter = @RowCounter + 1;
SET @MinRow = @MinRow + 1
END -- Inner While Loop
SET @LoopCntr = @LoopCntr + 1;
END -- WHILE LOOP
DROP TABLE #temptable;
The biggest issue is the time it takes to perform the work. For 116K records, it took 2 hours 34 minutes to "re-number" the table. Some databases can have over 1 million records in the table for a specific customer!
I was hoping someone might have some good "out of the box" ideas to try and perform that above in a much faster fashion. My boss seems to think I can get the above process down to a 5-10 minute process... personally, I think he is off his rocker!
Thanks!
Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)
January 26, 2009 at 7:47 am
Thinking about it some more, we came up with a solution that might be a bit faster... instead of doing an update for ever record (processing row by row may be necessary but is NEVER good! See the "gots" list courtesy of Jeff Moden!), we created a new table using the primary key and row counter,
CREATE TABLE #Temp_NewChargeCount (Primary_key BIGINT NOT NULL, NewCount INT NOT NULL);
We then modified the code to do an insert,
INSERT INTO #Temp_NewChargeCount
VALUES(@MinRow, @RowCounter);
SET @RowCounter = @RowCounter + 1;
SET @MinRow = @MinRow + 1
We join the #Temp_NewChargeCount with the table having the Primary_key field and insert the row numer and original records into a new table.
It now takes 35 minutes 34 seconds instead of 2 hours 34 minutes.... a definite improvement!
Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)
January 26, 2009 at 8:22 am
You might want to take a look at the Row_Number, Rank and Dense_Rank functions in Books Online. I'm not sure, but they seem to fit what you're looking for here.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 26, 2009 at 8:51 am
Applying the Rownumber process to your data for example:
;with numbered as(SELECT rowno=row_number()
over (partition by Invoice, Item, Type order by Invoice),
Invoice,Item, Charge, type from Sales)
select * from numbered
Will give you these results:
rownoInvoiceItemChargetype
1113.00Delivery
211-3.00Delivery
1114.00Fuel
11121.00MainItem
21121.00MainItem
1111.50Tax
2111.50Tax
Remvoing the "Item" from the above code will result in
;with numbered as(SELECT rowno=row_number()
over (partition by Invoice, Item order by Invoice),
Invoice,Item, Charge, type from Sales)
select * from numbered
Producing:
rownoInvoiceItemChargetype
11121.00MainItem
2111.50Tax
3113.00Delivery
41121.00MainItem
5111.50Tax
6114.00Fuel
711-3.00Delivery
81121.00MainItem
9111.50Tax
10113.00Delivery
111121.00MainItem
12111.50Tax
13114.00Fuel
1411-3.00Delivery
Hopefully this will be of some use to you
January 26, 2009 at 9:12 am
Oh... oh.... now why did I not think about that! I use the Row_Number() function in other SQL statements... never thought about it in this case. Another case of posting here yeilding excellent results! 🙂 🙂
I used SELECT ROW_NUMBER() OVER(PARTITION BY ItemNo ORDER BY InvoiceNo) and compared the results with my programmatic solution.. identical counts.
The speed? 1 second. WOOO HOOO!
Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)
January 26, 2009 at 9:38 am
The cool thing about this is that your boss was right, 😉
Here's a solution I was building, but I'm a bit late,
USE tempdb
GO
IF EXISTS (SELECT *
FROM sys.objects
WHERE object_id = OBJECT_ID(N'[dbo].Invoice')
AND TYPE IN (N'U'))
DROP TABLE [dbo].Invoice
GO
CREATE TABLE Invoice (
ID INT IDENTITY ( 1 , 1 ),
SequenceNum INT,
InvoiceNum NVARCHAR(20),
ItemNum NVARCHAR(20),
Price MONEY,
ChargeType NVARCHAR(20))
INSERT INTO Invoice
(InvoiceNum,
ItemNum,
Price,
ChargeType)
SELECT 'Invoice1',
'ItemNo1',
21,
'MainItem'
UNION ALL
SELECT 'Invoice1',
'ItemNo1',
1.5,
'Tax'
UNION ALL
SELECT 'Invoice1',
'ItemNo1',
3.0,
'Delivery'
UNION ALL
SELECT 'Invoice1',
'ItemNo1',
21,
'MainItem'
UNION ALL
SELECT 'Invoice1',
'ItemNo1',
1.5,
'Tax'
UNION ALL
SELECT 'Invoice2',
'ItemNo1',
1.5,
'Tax'
UNION ALL
SELECT 'Invoice2',
'ItemNo1',
3.0,
'Delivery'
UNION ALL
SELECT 'Invoice2',
'ItemNo1',
21,
'MainItem'
UNION ALL
SELECT 'Invoice2',
'ItemNo1',
1.5,
'Tax'
UNION ALL
SELECT 'Invoice2',
'ItemNo1',
4.0,
'Fuel'
UNION ALL
SELECT 'Invoice1',
'ItemNo1',
4.0,
'Fuel'
UNION ALL
SELECT 'Invoice1',
'ItemNo1',
-3.0,
'Delivery'
UNION ALL
SELECT 'Invoice2',
'ItemNo1',
21,
'MainItem'
UNION ALL
SELECT 'Invoice2',
'ItemNo1',
-3.0,
'Delivery'
CREATE CLUSTERED INDEX Clus_Invoice_ID ON Invoice (
ID ASC)
CREATE NONCLUSTERED INDEX IX_Invoice_InvoiceNum ON Invoice (
InvoiceNum ASC)
UPDATE Invoice
SET SequenceNum = NewSequence.NewSequence
FROM (SELECT row_number()
OVER(PARTITION BY InvoiceNum ORDER BY InvoiceNum ASC) AS NewSequence,
ID,
SequenceNum,
InvoiceNum,
ItemNum,
Price,
ChargeType
FROM Invoice) AS NewSequence
WHERE Invoice.ID = NewSequence.ID
SELECT *
FROM Invoice
ORDER BY invoiceNum,
SequenceNum
Hope it helps,
Cheers,
J-F
January 26, 2009 at 7:16 pm
DB_Newbie2007 (1/26/2009)
It seems the more I get into SQL the more it appears I am also getting into Programming! Being "old school", I tend to be more "linear" in my thinking, doing things "one step at a time", which is why I was never any good at more abstract languages such as C++ or Java!We recently discovered a major bug in an accounting system due to improper indexes/keys set on a table, thankfully way before my time here. Basically, the table can have an customer invoice with multiple Items and charges (or credits) per invoice. There are cases when a ItemNumber, charge and charge code can be duplicated when it is posted with additional items, i.e,
Invoice1 ItemNo1 $21 ChargeType: MainItem
Invoice1 ItemNo1 $1.5 ChargeType: Tax
Invoice1 ItemNo1 $3.0 ChargeType: Delivery
... (further in the file with weekly invoices)...
Invoice1 ItemNo1 $21 ChargeType: MainItem
Invoice1 ItemNo1 $1.5 ChargeType: Tax
Invoice1 ItemNo1 $4.0 ChargeType: Fuel
Invoice1 ItemNo1 -$3.0 ChargeType: Delivery
...
Unfortunately, since an outside system creates the invoice file we have no control over the content coming to us, just the ability
My task is to try and put a unique numbered sequence for all of the above items in the existing table. So, for the example above, I will need 1 through 7 for Invoice1 ItemNo1 entries in the table. There are no other fields that will make each record "unique".
Ok... I've got to ask... once you have those unique numbers on the rows, what are you going to do? Let the duplicates continue to exist or ??? Obviously, the purpose of adding such numbers is to do something with them... what is that purpose?
--Jeff Moden
Change is inevitable... Change for the better is not.
January 27, 2009 at 5:41 am
Jeff,
Good question. First, the problem was the Invoice Total (a seperate record in the file we receive) did not always match the sum of the charge(s) (sum was sometimes lower)... which is when we discovered the "duplicates".
Now that we can bring "all" of the data in, we can analyze it and figure out if some of these companies are being incorrectly charged twice for the same items. I wrote another script to compare all of the invoice total(s) versus the sum of the charges and went back to 2007.... less than 1% of the invoices have a problem, but it is still something that needs to be analyzed.
Being a former Engineer/Scientist, I was alway taught to never throw out data, always bring it in and analyze it to try and figure out what is going on (much like your comment about RBAR versus set based queries). But bringing in all of these records causes a new problem... there are 10 reports that rely on each record in the "charges" table being unique. Any time it was not, it was considered a duplicate billing. Now things have changed.... another challenge, another day! 🙂
Now part of the challenge is to ensure all records are imported but that a User cannot mistakely re-process the same invoice file (which can contain multiple invoices). We do that by assigning each record in the invoice file a unique number. If we see the same invoice with the same row number come in, we know someone is trying to re-process. If we see an previous invoice come in but is in a different location (row number) in a new invoice file, we know that is a true duplicate billing....
Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)
January 27, 2009 at 6:33 am
Thanks for the info... just trying to understand so we can do this in as high a performance method as possible...
In your original post on this thread, you have (obviously) some dupes... would you mark which ones are dupes just to be absolutely clear? I have an idea that may automate your "analysis". Thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply