August 18, 2021 at 12:04 am
You are correct on the PK.
Thanks.
August 18, 2021 at 4:49 am
Bruin wrote:With the same Line,subline?
Look at the PK you've chosen. IMHO, only the first 3 columns of InvoiceNbr, Line, and SubLine should make up the PK.
I'm more confused than that and have a more fundamental q. Invoices and shipments are two different data entities. Invoices don't have a "ship date", you don't ship invoices, you ship shipments, which can have several different shipment dates.
To the OP: is your table for invoices or for shipments??
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
August 18, 2021 at 4:53 am
Jeff Moden wrote:Bruin wrote:With the same Line,subline?
Look at the PK you've chosen. IMHO, only the first 3 columns of InvoiceNbr, Line, and SubLine should make up the PK.
I'm more confused than that and have a more fundamental q. Invoices and shipments are two different data entities. Invoices don't have a "ship date", you don't ship invoices, you ship shipments, which can have several different shipment dates.
Is that table for invoices or for shipments??
Invoice line items can have a ship date but I agree that there's a bit of a "fog on the bog" for this one.
I'm still doing my FTE work at this late hour and won't get to a proposed solution tonight but a little trick with HASHBYTEs for the non-PK columns will make for a nasty fast solution leading to a nice, simple, fast UPSERT with the comparison the OP wants.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 18, 2021 at 5:03 am
Invoice line items can have a ship date
Only with a bad data model. Parts from the same invoice line could be shipped on different dates. You don't split invoice lines based on when they shipped. At least I've never seen that done in my decades working with ordering / invoice / shipment / credit systems.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
August 18, 2021 at 6:22 am
Jeff Moden wrote:Invoice line items can have a ship date
Only with a bad data model. Parts from the same invoice line could be shipped on different dates. You don't split invoice lines based on when they shipped. At least I've never seen that done in my decades working with ordering / invoice / shipment / credit systems.
Yeah... I do agree that there should be a shipment table and the InvoiceDetail table should contain ShipmentIDs for the line items in the invoice detail table. Might even be able to justify a "bridge" table but a lot of folks would consider that to be some serious overkill.
Of course, other people that just want to do reporting will disagree with us. It's the ol' "data mart" argument. Not saying that's right. I'm just saying that's where a lot of folks end up. Heh... long live the duplication of data for reporting, eh? It helps us justify buying extra hardware. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
August 20, 2021 at 11:08 pm
You where correct with the " It's the ol' "data mart"... It's kind of a hybrid table.
Will that effect the Delta output?
THanks.
August 21, 2021 at 3:35 pm
You where correct with the " It's the ol' "data mart"... It's kind of a hybrid table.
Will that effect the Delta output? THanks.
No - but I ask again. Have you tried using EXCEPT - and if so, what were the results? You were also asked to provide the expected results from you sample data - still waiting on that information.
It has not been made clear what you want to accomplish. Is the result you want a table that shows the previous row - and the new row? Or do you just need to identify the changes so they can be applied to the final table? Something else?
If you want a table that contains the history for a row - then you could setup and use a temporal table. That would be the simpler method to implement - then all you need to do is update rows in that table where there is a difference and insert new rows. The system then generates the entries in the history table for you.
Either way - you can use EXCEPT to identify the rows in the staging table where a value has changed or the row is new.
Select InvoiceNbr
, Line
, SubLine
, InvoiceDate
, ShipDate
, MwAcctType
, Channel
From #staging
Except
Select InvoiceNbr
, Line
, SubLine
, InvoiceDate
, ShipDate
, MwAcctType
, Channel
From #daily_invoices;
You can then use the above with MERGE - or use separate UPDATE/INSERT statements with that data as the source for the update/insert.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
August 22, 2021 at 12:16 am
Thanks Except seems to work for what I need.
Thanks Again!!
I used to use EXCEPT but usually don't anymore because an old friend showed me long ago just exactly how slow it can actually be and Andy Leonard confirmed that HashBytes is actually faster especially since it allow all 4 conditions to be easily determined in a single pass of the tables. EXCEPT also doesn't support the idea of doing a Before'n'After check of the 4 possible conditions rows in a staging table can have. And for my production work, EXCEPT actually did crawl compared to the HashBytes method and that was good enough for me. It was an "InSitu" test. 😀
First, just to keep things together, here's a reworked version of the previously provided test data.
--===== Create and populate the table that contains the permanent data.
DROP TABLE IF EXISTS #daily_invoices;
GO
CREATE TABLE #daily_invoices
(
InvoiceNbr VARCHAR(8) NOT NULL
,Line INT NOT NULL
,SubLine INT NOT NULL
,InvoiceDate DATE NOT NULL
,ShipDate DATE NOT NULL
,MwAcctType VARCHAR(25) NULL
,Channel NVARCHAR(12) NULL
,HashMatch AS HASHBYTES('SHA1',CONCAT(InvoiceDate,'|',ShipDate,'|',MwAcctType,'|',Channel)) PERSISTED
,PRIMARY KEY CLUSTERED (InvoiceNbr,Line,SubLine)
)
;
INSERT INTO #daily_invoices WITH (TABLOCK)
VALUES (78123,1,0,'8-13-2021','8-10-2021',1002,'rrb')
,(78124,1,0,'8-13-2021','8-10-2021',1002,'rrb')
,(78127,1,0,'8-13-2021','8-10-2021',1002,'rrb')
,(78128,1,0,'8-13-2021','8-10-2021',1002,'rrb')
,(78129,1,0,'8-13-2021','8-10-2021',1002,'rrb')
;
SELECT *
FROM #daily_invoices
;
--===== Create and populate the table that contains the permanent data.
DROP TABLE IF EXISTS #staging;
GO
CREATE TABLE #staging
(
InvoiceNbr VARCHAR(8) NOT NULL
,Line INT NOT NULL
,SubLine INT NOT NULL
,InvoiceDate DATE NOT NULL
,ShipDate DATE NOT NULL
,MwAcctType VARCHAR(25) NULL
,Channel NVARCHAR(12) NULL
,HashMatch AS HASHBYTES('SHA1',CONCAT(InvoiceDate,'|',ShipDate,'|',MwAcctType,'|',Channel)) PERSISTED
--,RowType CHAR(1)
,INDEX NCI_Inv(InvoiceNbr,Line,SubLine)
)
;
INSERT INTO #staging WITH (TABLOCK)
VALUES (78123,1,0,'8-13-2021','8-10-2021',1002,'rrb')
,(78124,1,0,'8-13-2021','8-10-2021',1002,'rrb')
,(78125,1,0,'8-13-2021','8-10-2021',1002,'rrb')
,(78126,1,0,'8-13-2021','8-10-2021',1002,'rrb')
,(78127,1,0,'8-13-2021','8-10-2021',1001,'rrb')
,(78128,1,0,'8-13-2021','8-10-2021',1021,'rrz')
;
SELECT *
FROM #staging
;
Then, a fully visible return of both tables in a FULL JOIN. Obviously, we wouldn't display all of this data in real life. Instead, we'd likely only update the "RowType" byte as a pre-process. You could also use 3 of the conditions to build the 3 statements for an UPSERT. Or... if you really insisted on using MERGE, the whole HashMatch method might (I don't know for sure because I flat out refuse to use MERGE) make for a faster merge because it's only comparing one column and the key columns instead of all the columns.
--===== Display the differences. This could be used to guide and "Upsert" instead of using MERGE,
-- which I still worry about because of all the insane problems it used to have.
SELECT *
,RowType = CASE
WHEN tgt.InvoiceNbr IS NULL THEN 'N' --NEW, need to add row to target
WHEN src.InvoiceNbr IS NULL THEN 'D' --DELETE, need to delete row from target (OPTIONAL??)
WHEN src.HashMatch = tgt.HashMatch THEN 'S' --Same, No change, No action required
ELSE 'U' --target needs to be updated from the row in the source.
END
FROM #daily_invoices tgt
FULL JOIN #staging src
ON tgt.InvoiceNbr = src.InvoiceNbr
AND tgt.Line = src.Line
AND tgt.SubLine = src.SubLine
;
The demo code above works with the table creation and population script above and should work with no changes required, like it did for me.
And, to emphasize the performance, this method is lightning fast compared to EXCEPT and gets comparatively much faster the more columns you need to check. You'd think that the CONCAT and HASHBYTEs function would slow things down a whole lot compared to EXCEPT but it doesn't.
Collisions: Any time you use HashBytes, there is a chance for collisions no matter how slim. If there are two equal hashbytes, they could still be different. It they're not equal, then they're guaranteed to be different.
That could leave a small number of potentially different values identified as being the same when they're not. To prevent that issue (especially on the huge runs that I frequently do), I add another set of HashMatch columns (1 in each table) where I simply move the first column in the CONCATenation to be the last column. That makes makes it virtually impossible to miss any non-matches and still be a lot faster than EXCEPT especially for a large number of columns. but even with a small number of columns because EXCEPT won't give you all 4 UPSERT/MERGE conditions. It only gives you 1.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 22, 2021 at 2:32 am
I almost forgot. One "trick" I do (that I didn't do here) is to add an "_stg" suffix to the column names in the staging table so that when I'm working with both tables, the column names are different, just in case I want to use INTO to send the output to a TempTable and still be able to use * on some really wide tables but the column names will still sort nicely if I have to look at all the column names in a single meta-data query.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 22, 2021 at 9:16 pm
Thanks will test the Code out!!
August 25, 2021 at 12:22 am
When I run the piece for creating #staging:
Column name or number of supplied values does not match table definition.
Thx,
August 25, 2021 at 5:58 pm
When I run the piece for creating #staging:
Column name or number of supplied values does not match table definition.
Thx,
I'd gotten interrupted while I was working on the code. I had two slightly different versions of the tables in code and posted the wrong one.
I've corrected the code. Please try again.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 25, 2021 at 7:15 pm
Jeff...do you still post your "1 million row" test scripts?
Would love to see one for this Q.
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
August 25, 2021 at 7:21 pm
Worked perfect thanks!!!!
Viewing 15 posts - 31 through 44 (of 44 total)
You must be logged in to reply to this topic. Login to reply