April 5, 2020 at 11:12 am
This might be a beginner developer questions but need some clarifications. What is best / mostly used / best practice to process data into database table from beginning of time (for example data from January 1900) ? Do people pull and process data from beginning of time and process it at once? Do people pull data based on window of time (example processing by one Month of Year at a time or one Year a time). Also when there is modification of data in June 2000, do people reprocess it from beginning of time or just reprocess for June 2000. I really appreciate your response.
April 5, 2020 at 1:03 pm
If the number of rows in your source table is low ... in the tens of thousands, maybe ... a full refresh is simplest and cleanest.
As soon as the full refresh starts taking more time and resources than you are comfortable with, you can move to an incremental approach. For that, you will need some way of tracking changes to your source data. There are multiple ways of doing this and the one you choose will depend on your scenario. A fairly straightforward way is to make sure that all rows have CreatedAt and ModifiedAt (datetime) columns (and good indexing). Once this is in place, you are in a good position to be able to do a SELECT [cols] WHERE ModifiedAt >= [last run date] and MERGE the results into your target table.
Note that this method requires additional refinement if there is a need to handle deletions too.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
April 5, 2020 at 1:43 pm
Thank you for your feedback, appreciate your valuable time, skills and know ledges. Do you agree if we have more than 5 millions rows in source table, we should start using a process with increment approach (insert/update).
April 5, 2020 at 1:53 pm
Thank you for your feedback, appreciate your valuable time, skills and know ledges. Do you agree if we have more than 5 millions rows in source table, we should start using a process with increment approach (insert/update).
Yes, that seems like it is getting too large for a full refresh. How frequently do you want to run this process? Daily? More often?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
April 5, 2020 at 2:25 pm
The process suppose to run daily or weekly. I agree with you to process data using SELECT [cols] WHERE ModifiedAt >= [last run date]. The reason i put this questions is to see feedback from Database practitioners on their approaches and to check my insanity.
I have built a process in stored procedure that uses time logic so you can pull data from source table based on certain time window, for example: ModifiedAt Between (YYYYMM-3) AND (YYYYMM+3), assuming YYYYMM is Integer. Then perform Insert and Update to destination Fact table based on these YYYYMM partitions. So we can update or build destination table incrementally.
Once again thank you for your valuable time and great feedback. Have a great weekend.
April 5, 2020 at 2:37 pm
Thank you for your feedback, appreciate your valuable time, skills and know ledges. Do you agree if we have more than 5 millions rows in source table, we should start using a process with increment approach (insert/update).
Its not so much the number of rows but the time it takes to refresh the data and how frequently
if 5 million row takes 2 mins to refresh then not much point in doing incremental - unless your window to refresh is 1 min
but if 10k rows takes 30 mins to refresh then incremental is likely to be a better option
but in either case - if you do a once a week refresh and you have 10 hours to refresh it then 30 mins does not matter either - so as usual "it depends" on your requirements whether you do incremental or full
for example on one of my servers I copy 12 Million rows onto 7 different tables once a week (with biggest table containing 3 million rows) - takes 3 mins to copy everything sequentially - why would I bother with implementing a process to do incremental processing?
April 5, 2020 at 3:37 pm
fransiscuscandra wrote:Thank you for your feedback, appreciate your valuable time, skills and know ledges. Do you agree if we have more than 5 millions rows in source table, we should start using a process with increment approach (insert/update).
Its not so much the number of rows but the time it takes to refresh the data and how frequently
if 5 million row takes 2 mins to refresh then not much point in doing incremental - unless your window to refresh is 1 min
but if 10k rows takes 30 mins to refresh then incremental is likely to be a better option
but in either case - if you do a once a week refresh and you have 10 hours to refresh it then 30 mins does not matter either - so as usual "it depends" on your requirements whether you do incremental or full
for example on one of my servers I copy 12 Million rows onto 7 different tables once a week (with biggest table containing 3 million rows) - takes 3 mins to copy everything sequentially - why would I bother with implementing a process to do incremental processing?
I couldn't say it any better. Totally agree. "It Depends" and "Must look eye". 😀
I will add that there are actually a couple of advantages to wholesale replacement of all data.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 5, 2020 at 10:25 pm
Thank you, I got your point, It all depend on time to copy those tables. If they are less than 10 minutes, no need to do increment process, just straight copy.
April 5, 2020 at 10:27 pm
Thank you for your feedback. I will look into SYNONYM method. Have a great day.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply