Do people process data from beginning of time when there is data change?

Question

Do people process data from beginning of time when there is data change?

fransiscuscandra

SSC Enthusiast

Points: 183
More actions
April 5, 2020 at 11:12 am

#3740533

This might be a beginner developer questions but need some clarifications. What is best / mostly used / best practice to process data into database table from beginning of time (for example data from January 1900) ? Do people pull and process data from beginning of time and process it at once? Do people pull data based on window of time (example processing by one Month of Year at a time or one Year a time). Also when there is modification of data in June 2000, do people reprocess it from beginning of time or just reprocess for June 2000. I really appreciate your response.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246964 More actions · Answer 1

If the number of rows in your source table is low ... in the tens of thousands, maybe ... a full refresh is simplest and cleanest.

As soon as the full refresh starts taking more time and resources than you are comfortable with, you can move to an incremental approach. For that, you will need some way of tracking changes to your source data. There are multiple ways of doing this and the one you choose will depend on your scenario. A fairly straightforward way is to make sure that all rows have CreatedAt and ModifiedAt (datetime) columns (and good indexing). Once this is in place, you are in a good position to be able to do a SELECT [cols] WHERE ModifiedAt >= [last run date] and MERGE the results into your target table.

Note that this method requires additional refinement if there is a need to handle deletions too.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

fransiscuscandra SSC Enthusiast Points: 183 More actions · Answer 2

Thank you for your feedback, appreciate your valuable time, skills and know ledges. Do you agree if we have more than 5 millions rows in source table, we should start using a process with increment approach (insert/update).

Phil Parkin SSC Guru Points: 246964 More actions · Answer 3

fransiscuscandra wrote:

Thank you for your feedback, appreciate your valuable time, skills and know ledges. Do you agree if we have more than 5 millions rows in source table, we should start using a process with increment approach (insert/update).

Yes, that seems like it is getting too large for a full refresh. How frequently do you want to run this process? Daily? More often?

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

fransiscuscandra SSC Enthusiast Points: 183 More actions · Answer 4

The process suppose to run daily or weekly. I agree with you to process data using SELECT [cols] WHERE ModifiedAt >= [last run date]. The reason i put this questions is to see feedback from Database practitioners on their approaches and to check my insanity.

I have built a process in stored procedure that uses time logic so you can pull data from source table based on certain time window, for example: ModifiedAt Between (YYYYMM-3) AND (YYYYMM+3), assuming YYYYMM is Integer. Then perform Insert and Update to destination Fact table based on these YYYYMM partitions. So we can update or build destination table incrementally.

Once again thank you for your valuable time and great feedback. Have a great weekend.

frederico_fonseca SSCoach Points: 16156 More actions · Answer 5

fransiscuscandra wrote:

Thank you for your feedback, appreciate your valuable time, skills and know ledges. Do you agree if we have more than 5 millions rows in source table, we should start using a process with increment approach (insert/update).

Its not so much the number of rows but the time it takes to refresh the data and how frequently

if 5 million row takes 2 mins to refresh then not much point in doing incremental - unless your window to refresh is 1 min

but if 10k rows takes 30 mins to refresh then incremental is likely to be a better option

but in either case - if you do a once a week refresh and you have 10 hours to refresh it then 30 mins does not matter either - so as usual "it depends" on your requirements whether you do incremental or full

for example on one of my servers I copy 12 Million rows onto 7 different tables once a week (with biggest table containing 3 million rows) - takes 3 mins to copy everything sequentially - why would I bother with implementing a process to do incremental processing?

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 6

frederico_fonseca wrote:

fransiscuscandra wrote:
Thank you for your feedback, appreciate your valuable time, skills and know ledges. Do you agree if we have more than 5 millions rows in source table, we should start using a process with increment approach (insert/update).
Its not so much the number of rows but the time it takes to refresh the data and how frequently
if 5 million row takes 2 mins to refresh then not much point in doing incremental - unless your window to refresh is 1 min
but if 10k rows takes 30 mins to refresh then incremental is likely to be a better option
but in either case - if you do a once a week refresh and you have 10 hours to refresh it then 30 mins does not matter either - so as usual "it depends" on your requirements whether you do incremental or full
for example on one of my servers I copy 12 Million rows onto 7 different tables once a week (with biggest table containing 3 million rows) - takes 3 mins to copy everything sequentially - why would I bother with implementing a process to do incremental processing?

I couldn't say it any better. Totally agree. "It Depends" and "Must look eye". 😀

I will add that there are actually a couple of advantages to wholesale replacement of all data.

You NEVER get out of sync even by accident.
If you use the SYNONYM method of flopping between the currently working table and the update-staging table where one becomes the other after a full lost to the staging table, you can enjoy virtually zero downtime (just several milliseconds) to do a total refresh.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

fransiscuscandra SSC Enthusiast Points: 183 More actions · Answer 7

Thank you, I got your point, It all depend on time to copy those tables. If they are less than 10 minutes, no need to do increment process, just straight copy.

fransiscuscandra SSC Enthusiast Points: 183 More actions · Answer 8

Thank you for your feedback. I will look into SYNONYM method. Have a great day.