July 21, 2021 at 5:23 am
Good Morning,
I have a question regarding Update statement between two tables (stage_Table, Final_Table).
Fianl_table contains full data that includes everyday new records and changes over time
in the both tables we have same columns for example col1, col2, col3,..... col20
so here my question is, I need to update exist records column values in final_table everyday from stage_table.
only update final_table column values in below two scenarios
if the stage has null and final table has value then no updates will happen to that column/field.
how to check if there is any updates in Intake table columns col3 to col 10
-- for the above two scenarios i have used below query is there any other better ways to do it? please advise
UPDATE FT
SET
FT.col3 = CASE WHEN S.col3 is not null or S.col3 <>'' then S.col3 Else FT.Col3 END,
FT.col4 = S.col4,
FT.col5 = S.col5,
FT.col6 = S.col6,
FT.col7 = S.col7,
FT.col8 = S.col8,
FT.col9 = S.col9,
FT.col10 = S.col10
FROM FINAL_TABLE FT JOIN
STAGE_TABLE S on FT.Col1 = S.col1
and FT.COL2 = s.Col2
WHERE
FT.col3 <> S.col3 OR
FT.col4 <> S.col4 OR
FT.col5 <> S.col5 OR
FT.col6 <> S.col6 OR
FT.col7 <> S.col7 OR
FT.col8 <> S.col8 OR
FT.col9 <> S.col9 OR
FT.col10 <> S.col10
Thank you
Asita
July 21, 2021 at 5:52 am
This was removed by the editor as SPAM
July 21, 2021 at 2:12 pm
Why not go crazy and use a WHERE clause to find the records that are different and only update those?
July 21, 2021 at 3:09 pm
This sounds like a prime example to use the HASHBYTES function where you hash all of the columns in each table and join on the key and where it is different, update the destination table. Sorry, I don't have a specific example of how to do it but hopefully a quick Google could help you out.
July 21, 2021 at 4:03 pm
Possibly a dumb question but do you need those WHERE clauses? If the data is the same, setting the value "10" to "10" is not going to hurt too badly if the table isn't HUGE and if there are not a lot of indexes. The only cases you want to be careful of are your first condition - if staging has NULL then you want to keep the final value, but that can be handled by CASE conditions.
Now, if that is billions of rows OR a lot of nonclustered indexes, you will benefit from only updating a subset of the data in which case those WHERE clauses are beneficial. Or if you have a calculated column or trigger keeping track of a modified date, then you will want to make sure you only update rows that have changed.
I think the other things to know is what problem are you trying to solve? What I am thinking is if you are trying to reduce the time that the "Final Table" is unusable, reducing the dataset in the Staging Table that will get written to the Final Table would be the approach I would take. What I am meaning is do all of the work in the Staging Table so you have a smaller data set to push across to the final table. You could add a column to the Staging Table to track data status and a data load datestamp. First, remove the truncate step. Then, when you are pulling the data from the feed into the staging table, the status would be unprocessed. Once the data has been pulled in, you look through the staging table ONLY for data that changed from the previous run which can be determined using the data load datestamp field. From here you update the status to be data changed or no change depending on the logic that determines what should be pulled across to the final table. You could even handle the NULL cases by updating the value in the staging table so that if it is null, it becomes the previous value. Then you can remove the previous load data as it is no longer needed (unless you want it to review the process, but it is going to make the staging table grow pretty quickly if you don't remove historical data) and push across any data where the data has changed to the final table. Advantage to this approach is that the Final Table is going to have reduced blocking on it and your UPDATE statement can have the WHERE clause logic reduced to a single WHERE condition - where status is data changed. Disadvantage is that the overall process may be slower and you will have more blocking on the staging table and you will have more disk being used long term.
I would make the Status field a TINYINT datatype, and you may want more statuses in the future if logic needs to be applied differently when moving to the final table. But my approach with ETL is to do all the work in the staging table so I have only a small bit of logic and as small of a data set as possible to work with to push across to the final table. It MAY put more strain on the SQL instance (or SSIS), but it means more uptime for my final tables.
Now, if you have a downtime window, and you are not exceeding that window, there may not be much benefit in working to improve the process. If you do have a lot of nonclustered indexes, you may benefit from disabling and rebuilding those after your data loads.
Just my 2 cents!
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
July 21, 2021 at 10:03 pm
Hello David.
Can you please provide any example with this
Thank you
asita
July 21, 2021 at 10:11 pm
Hello Brian,
This table has data about 200K and 20 columns. daily changed may be around 100 new records and changes about 100 to 200 records for few columns.
there is only one primary key on the table nothing else in terms of indexes.
I do have UpdatedDate in the FInal_Table which get updates during the real update(non null changes between stage and final table) happened.
Can you please advise query for it.
Thank you
Asita
Thank you
Asita
July 21, 2021 at 10:12 pm
Hello Linden.
Sorry can you please provide some sample what you mean by use where clause?
Thank you
ASita
July 28, 2021 at 9:41 am
The pattern we use for a similar process is this.
MERGE INTO final_table
USING (SELECT key_column_1
, key_column_2
, other_column_1
...
, other_column_n
FROM stage_table
EXCEPT
SELECT key_column_1
, key_column_2
, other_column_1
...
, other_column_n
FROM final_table
) as changes
ON final_table.key_column_1 = changes.key_column_1
AND final_table.key_column_2 = changes.key_column_2
WHEN MATCHED THEN
UPDATE
SET other_column_1 = changes.other_column_1
...
, other_column_n = changes.other_column_2
WHEN NOT MATCHED THEN
INSERT (key_column_1
, key_column_2
, other_column_1
...
, other_column_n)
VALUES (changes.key_column_1
, changes.key_column_2
, changes.other_column_1
...
, changes.other_column_n);
If you prefer, you can use a similar pattern for UPDATE/INSERT instead of MERGE.
July 28, 2021 at 4:04 pm
The pattern we use for a similar process is this.
MERGE INTO final_table
USING (SELECT key_column_1
, key_column_2
, other_column_1
...
, other_column_n
FROM stage_table
EXCEPT
SELECT key_column_1
, key_column_2
, other_column_1
...
, other_column_n
FROM final_table
) as changes
ON final_table.key_column_1 = changes.key_column_1
AND final_table.key_column_2 = changes.key_column_2
WHEN MATCHED THEN
UPDATE
SET other_column_1 = changes.other_column_1
...
, other_column_n = changes.other_column_2
WHEN NOT MATCHED THEN
INSERT (key_column_1
, key_column_2
, other_column_1
...
, other_column_n)
VALUES (changes.key_column_1
, changes.key_column_2
, changes.other_column_1
...
, changes.other_column_n);If you prefer, you can use a similar pattern for UPDATE/INSERT instead of MERGE.
I think you are missing the FROM statement in the second query - which needs to be the final table. This avoids the need for using something like HASHBYTES to compare each row - and generates the subset of rows that will either need to be inserted or updated.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
July 29, 2021 at 7:37 am
Thanks Jeffrey, I've corrected my sample code.
July 29, 2021 at 3:42 pm
Thanks Jeffrey, I've corrected my sample code.
BTW - I like this solution and think it works much better than a traditional UPSERT. I need to *remember* that we can use EXCEPT to identify the subset of rows to be inserted/updated. A similar method can be used in an [NOT] EXISTS using INTERSECT or EXCEPT to identify the rows.
SELECT ...
FROM db1.tbl a
JOIN db2.tbl b ON a.keycol = b.keycol
WHERE NOT EXISTS (SELECT a.col1, ...
INTERSECT
SELECT b.col1, ...)
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply