Back when i was in school, me and a bunch of friends used to hangout at a diner. The diner was popular for its specials, which were displayed on a board outside the entrance.
The manager would roll in every day, wipe the board clean and list all the specials for that day. There will be a few new specials introduced daily, but most of them would be the same every day. Still the manager would erase everything completely (including those that weren’t changed) and rewrite. He was fine with having to spend a lot of time every day to do this, but he wanted to clearly display all the specials.
On the other hand, a manager at a different diner would erase only those specials that were discontinued and add the new ones. He didn’t mind that the specials were not clearly displayed, and only wanted to make the daily changes as quickly as possible.
Destructive Load
Also known as Full Load, is a process of completely destroying/deleting the existing data and reloading it from scratch. A lot of unchanged data is also deleted and reloaded in this process. But a destructive load ensures the highest data integrity easily.
“Delete destination data. Read data from source. Load into destination.”
Incremental Load
Incremental load is a process of loading data incrementally. Only new and changed data is loaded to the destination. Data that didn’t change will be left alone. Data integrity can be ensured in this process too, but ETL can get complicated.
“Read source data. Compare with destination. Filter for new and changed data. Write to destination.”. I stole this line from Anatomy of an Incremental Load by Andy Leonard (b|t).
Why Incremental?
A full load seems to be easiest the approach. Right?. Then why do we even care about incremental load?
- Speed. Opting to do a full load on larger datasets will take a great amount of time and other server resources. Ideally all the data loads are performed overnight with the expectation of completing them before users can see the data the next day. The overnight window may not be enough time for the full load to complete.
- Preserving history. When dealing with a OLTP source that is not designed to keep history, a full load will remove history from the destination as well, since full load will remove all the records first, remember! So a full load will not allow you to preserve history in the data warehouse.
Full Load vs. Incremental Load:
Destructive Load | Incremental Load | |
How it works | Deletes all rows and reload from scratch. | Only new or updated rows are processed |
Time | Requires more time. | Requires less time. |
Data Integrity | Can easily be guaranteed | Difficult. ETL must check for new/updated rows. |
History | Can be lost. | Retained. |
Conclusion
Full load is the easiest way to load daily. But consumes a lot of time and other server resources. Incremental load processes only the new or changed data, so when time is of essence with larger data sets, incremental load is the way to go.
Next up…
In my next post, I’ll continue with scripts to do a full and incremental loads.