Create a delta from downloaded info

Question

Post reply

Create a delta from downloaded info

Jeff Moden

SSC Guru

Points: 1004433

August 22, 2021 at 2:20 am

Answer

#3922507

I used to use EXCEPT but usually don't anymore because an old friend showed me long ago just exactly how slow it can actually be and Andy Leonard confirmed that HashBytes is actually faster especially since it allow all 4 conditions to be easily determined in a single pass of the tables. EXCEPT also doesn't support the idea of doing a Before'n'After check of the 4 possible conditions rows in a staging table can have. And for my production work, EXCEPT actually did crawl compared to the HashBytes method and that was good enough for me. It was an "InSitu" test. 😀

First, just to keep things together, here's a reworked version of the previously provided test data.

--===== Create and populate the table that contains the permanent data.
   DROP TABLE IF EXISTS #daily_invoices;
GO
 CREATE TABLE #daily_invoices
        (
         InvoiceNbr     VARCHAR(8)      NOT NULL
        ,Line           INT             NOT NULL
        ,SubLine        INT             NOT NULL
        ,InvoiceDate    DATE            NOT NULL
        ,ShipDate       DATE            NOT NULL
        ,MwAcctType     VARCHAR(25)     NULL
        ,Channel        NVARCHAR(12)    NULL
        ,HashMatch   AS HASHBYTES('SHA1',CONCAT(InvoiceDate,'|',ShipDate,'|',MwAcctType,'|',Channel)) PERSISTED
        ,PRIMARY KEY CLUSTERED (InvoiceNbr,Line,SubLine)
        )
;
 INSERT INTO #daily_invoices WITH (TABLOCK)
 VALUES  (78123,1,0,'8-13-2021','8-10-2021',1002,'rrb')
        ,(78124,1,0,'8-13-2021','8-10-2021',1002,'rrb')     
        ,(78127,1,0,'8-13-2021','8-10-2021',1002,'rrb')
        ,(78128,1,0,'8-13-2021','8-10-2021',1002,'rrb')
        ,(78129,1,0,'8-13-2021','8-10-2021',1002,'rrb') 
;
 SELECT *
   FROM #daily_invoices
;
--===== Create and populate the table that contains the permanent data.
   DROP TABLE IF EXISTS #staging;
GO
 CREATE TABLE #staging
        (
         InvoiceNbr     VARCHAR(8)      NOT NULL
        ,Line           INT             NOT NULL
        ,SubLine        INT             NOT NULL
        ,InvoiceDate    DATE            NOT NULL
        ,ShipDate       DATE            NOT NULL
        ,MwAcctType     VARCHAR(25)     NULL
        ,Channel        NVARCHAR(12)    NULL
        ,HashMatch   AS HASHBYTES('SHA1',CONCAT(InvoiceDate,'|',ShipDate,'|',MwAcctType,'|',Channel)) PERSISTED
        --,RowType        CHAR(1)
        ,INDEX NCI_Inv(InvoiceNbr,Line,SubLine)
        )
;
 INSERT INTO #staging WITH (TABLOCK)
        
 VALUES  (78123,1,0,'8-13-2021','8-10-2021',1002,'rrb')
        ,(78124,1,0,'8-13-2021','8-10-2021',1002,'rrb')     
        ,(78125,1,0,'8-13-2021','8-10-2021',1002,'rrb')
        ,(78126,1,0,'8-13-2021','8-10-2021',1002,'rrb')
        ,(78127,1,0,'8-13-2021','8-10-2021',1001,'rrb')
        ,(78128,1,0,'8-13-2021','8-10-2021',1021,'rrz')
;
 SELECT *
   FROM #staging
;

Then, a fully visible return of both tables in a FULL JOIN. Obviously, we wouldn't display all of this data in real life. Instead, we'd likely only update the "RowType" byte as a pre-process. You could also use 3 of the conditions to build the 3 statements for an UPSERT. Or... if you really insisted on using MERGE, the whole HashMatch method might (I don't know for sure because I flat out refuse to use MERGE) make for a faster merge because it's only comparing one column and the key columns instead of all the columns.

--===== Display the differences. This could be used to guide and "Upsert" instead of using MERGE,
     -- which I still worry about because of all the insane problems it used to have.
 SELECT *
        ,RowType =  CASE
                    WHEN tgt.InvoiceNbr IS NULL THEN 'N' --NEW, need to add row to target
                    WHEN src.InvoiceNbr IS NULL THEN 'D' --DELETE, need to delete row from target (OPTIONAL??)
                    WHEN src.HashMatch = tgt.HashMatch THEN 'S' --Same, No change, No action required
                    ELSE 'U' --target needs to be updated from the row in the source.
                    END
   FROM #daily_invoices tgt
   FULL JOIN #staging   src 
          ON tgt.InvoiceNbr = src.InvoiceNbr
         AND tgt.Line       = src.Line
         AND tgt.SubLine    = src.SubLine
;

The demo code above works with the table creation and population script above and should work with no changes required, like it did for me.

And, to emphasize the performance, this method is lightning fast compared to EXCEPT and gets comparatively much faster the more columns you need to check. You'd think that the CONCAT and HASHBYTEs function would slow things down a whole lot compared to EXCEPT but it doesn't.

Collisions: Any time you use HashBytes, there is a chance for collisions no matter how slim. If there are two equal hashbytes, they could still be different. It they're not equal, then they're guaranteed to be different.

That could leave a small number of potentially different values identified as being the same when they're not. To prevent that issue (especially on the huge runs that I frequently do), I add another set of HashMatch columns (1 in each table) where I simply move the first column in the CONCATenation to be the last column. That makes makes it virtually impossible to miss any non-matches and still be a lot faster than EXCEPT especially for a large number of columns. but even with a small number of columns because EXCEPT won't give you all 4 UPSERT/MERGE conditions. It only gives you 1.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Viewing 15 posts - 31 through 44 (of 44 total)

You must be logged in to reply to this topic. Login to reply

Bruin SSCrazy Points: 2088 More actions · Answer 1

Bruin

SSCrazy

Points: 2088

August 18, 2021 at 12:04 am

#3921078

You are correct on the PK.

Thanks.

ScottPletcher SSC Guru Points: 101130 More actions · Answer 2

Jeff Moden wrote:

Bruin wrote:
With the same Line,subline?
Look at the PK you've chosen. IMHO, only the first 3 columns of InvoiceNbr, Line, and SubLine should make up the PK.

I'm more confused than that and have a more fundamental q. Invoices and shipments are two different data entities. Invoices don't have a "ship date", you don't ship invoices, you ship shipments, which can have several different shipment dates.

To the OP: is your table for invoices or for shipments??

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Jeff Moden SSC Guru Points: 1004433 More actions · Answer 3

ScottPletcher wrote:

Jeff Moden wrote:
Bruin wrote:
With the same Line,subline?
Look at the PK you've chosen. IMHO, only the first 3 columns of InvoiceNbr, Line, and SubLine should make up the PK.
I'm more confused than that and have a more fundamental q. Invoices and shipments are two different data entities. Invoices don't have a "ship date", you don't ship invoices, you ship shipments, which can have several different shipment dates.
Is that table for invoices or for shipments??

Invoice line items can have a ship date but I agree that there's a bit of a "fog on the bog" for this one.

I'm still doing my FTE work at this late hour and won't get to a proposed solution tonight but a little trick with HASHBYTEs for the non-PK columns will make for a nasty fast solution leading to a nice, simple, fast UPSERT with the comparison the OP wants.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ScottPletcher SSC Guru Points: 101130 More actions · Answer 4

Jeff Moden wrote:

Invoice line items can have a ship date

Only with a bad data model. Parts from the same invoice line could be shipped on different dates. You don't split invoice lines based on when they shipped. At least I've never seen that done in my decades working with ordering / invoice / shipment / credit systems.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Jeff Moden SSC Guru Points: 1004433 More actions · Answer 5

ScottPletcher wrote:

Jeff Moden wrote:
Invoice line items can have a ship date
Only with a bad data model. Parts from the same invoice line could be shipped on different dates. You don't split invoice lines based on when they shipped. At least I've never seen that done in my decades working with ordering / invoice / shipment / credit systems.

Yeah... I do agree that there should be a shipment table and the InvoiceDetail table should contain ShipmentIDs for the line items in the invoice detail table. Might even be able to justify a "bridge" table but a lot of folks would consider that to be some serious overkill.

Of course, other people that just want to do reporting will disagree with us. It's the ol' "data mart" argument. Not saying that's right. I'm just saying that's where a lot of folks end up. Heh... long live the duplication of data for reporting, eh? It helps us justify buying extra hardware. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Bruin SSCrazy Points: 2088 More actions · Answer 6

You where correct with the " It's the ol' "data mart"... It's kind of a hybrid table.

Will that effect the Delta output?

THanks.

Jeffrey Williams SSC Guru Points: 90313 More actions · Answer 7

Bruin wrote:

You where correct with the " It's the ol' "data mart"... It's kind of a hybrid table.
Will that effect the Delta output? THanks.

No - but I ask again. Have you tried using EXCEPT - and if so, what were the results? You were also asked to provide the expected results from you sample data - still waiting on that information.

It has not been made clear what you want to accomplish. Is the result you want a table that shows the previous row - and the new row? Or do you just need to identify the changes so they can be applied to the final table? Something else?

If you want a table that contains the history for a row - then you could setup and use a temporal table. That would be the simpler method to implement - then all you need to do is update rows in that table where there is a difference and insert new rows. The system then generates the entries in the history table for you.

Either way - you can use EXCEPT to identify the rows in the staging table where a value has changed or the row is new.

Select InvoiceNbr
     , Line
     , SubLine
     , InvoiceDate
     , ShipDate
     , MwAcctType
     , Channel
  From #staging
Except
Select InvoiceNbr
     , Line
     , SubLine
     , InvoiceDate
     , ShipDate
     , MwAcctType
     , Channel
  From #daily_invoices;

You can then use the above with MERGE - or use separate UPDATE/INSERT statements with that data as the source for the update/insert.

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

Bruin SSCrazy Points: 2088 More actions · Answer 8

Thanks Except seems to work for what I need.

Thanks Again!!

Jeff Moden SSC Guru Points: 1004433 More actions · Answer 9

I almost forgot. One "trick" I do (that I didn't do here) is to add an "_stg" suffix to the column names in the staging table so that when I'm working with both tables, the column names are different, just in case I want to use INTO to send the output to a TempTable and still be able to use * on some really wide tables but the column names will still sort nicely if I have to look at all the column names in a single meta-data query.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Bruin SSCrazy Points: 2088 More actions · Answer 10

Bruin

SSCrazy

Points: 2088

August 22, 2021 at 9:16 pm

#3922638

Thanks will test the Code out!!

Bruin SSCrazy Points: 2088 More actions · Answer 11

When I run the piece for creating #staging:

Column name or number of supplied values does not match table definition.

Thx,

Jeff Moden SSC Guru Points: 1004433 More actions · Answer 12

Bruin wrote:

When I run the piece for creating #staging:
Column name or number of supplied values does not match table definition.
Thx,

I'd gotten interrupted while I was working on the code. I had two slightly different versions of the tables in code and posted the wrong one.

I've corrected the code. Please try again.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 13

Jeff...do you still post your "1 million row" test scripts?

Would love to see one for this Q.

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Bruin SSCrazy Points: 2088 More actions · Answer 14

Bruin

SSCrazy

Points: 2088

August 25, 2021 at 7:21 pm

#3924008

Worked perfect thanks!!!!