August 19, 2011 at 3:04 pm
We have a transactional database with every table containing create_date and update_date columns.
We designed a data warehouse and developed ETL process (initial and incremental).
Is it a valid concern to use create_date/ update_date for incremental ETL process or a better approach to simply load all records new/ modified by comparing to existing DW?
Are there any advantages/ downsides to either strategy?
Thanks Alexander.
August 22, 2011 at 1:09 pm
Shurkadze (8/19/2011)
We have a transactional database with every table containing create_date and update_date columns.We designed a data warehouse and developed ETL process (initial and incremental).
Is it a valid concern to use create_date/ update_date for incremental ETL process or a better approach to simply load all records new/ modified by comparing to existing DW?
Are there any advantages/ downsides to either strategy?
After initial load ETL should be incremental.
By design DWH system stores a wider range of data than OLTP systems do therefore not all the data is available on the OLTP system.
The secret is how to identify what has to be moved from OLTP to DWH staging tables, your row created/updated timestamp is a very valid and reliable strategy for extracting new data from OLTP system.
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.August 22, 2011 at 1:15 pm
PaulB-TheOneAndOnly (8/22/2011)
Shurkadze (8/19/2011)
We have a transactional database with every table containing create_date and update_date columns.We designed a data warehouse and developed ETL process (initial and incremental).
Is it a valid concern to use create_date/ update_date for incremental ETL process or a better approach to simply load all records new/ modified by comparing to existing DW?
Are there any advantages/ downsides to either strategy?
After initial load ETL should be incremental.
By design DWH system stores a wider range of data than OLTP systems do therefore not all the data is available on the OLTP system.
The secret is how to identify what has to be moved from OLTP to DWH staging tables, your row created/updated timestamp is a very valid and reliable strategy for extracting new data from OLTP system.
What if you have a row modified in OLTP with a column value changed that is not mapped to DW. The data then should not be moved to Data Warehouse? Correct? What is the use of timestamp then?
August 22, 2011 at 2:20 pm
Shurkadze (8/22/2011)
What if you have a row modified in OLTP with a column value changed that is not mapped to DW. The data then should not be moved to Data Warehouse? Correct? What is the use of timestamp then?
how about anticipated growth? it may not map to the data warehouse today, but a future upgrade or functionality may require it?
Lowell
August 22, 2011 at 2:57 pm
Lowell (8/22/2011)
Shurkadze (8/22/2011)
What if you have a row modified in OLTP with a column value changed that is not mapped to DW. The data then should not be moved to Data Warehouse? Correct? What is the use of timestamp then?how about anticipated growth? it may not map to the data warehouse today, but a future upgrade or functionality may require it?
The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.
Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?
August 22, 2011 at 3:36 pm
Shurkadze (8/22/2011)
The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?
Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?
An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.
Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.
Use your timestamp so sort out DWH ETLable candidates - you will sleep better.
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.August 22, 2011 at 4:04 pm
PaulB-TheOneAndOnly (8/22/2011)
Shurkadze (8/22/2011)
The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?
Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?
An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.
Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.
Use your timestamp so sort out DWH ETLable candidates - you will sleep better.
I will then find this record by comparing data in OLTP to one in DW and handle it accordingly in DW.
August 22, 2011 at 4:36 pm
Shurkadze (8/22/2011)
PaulB-TheOneAndOnly (8/22/2011)
Shurkadze (8/22/2011)
The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?
Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?
An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.
Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.
Use your timestamp so sort out DWH ETLable candidates - you will sleep better.
I will then find this record by comparing data in OLTP to one in DW and handle it accordingly in DW.
do you mean... comparing row-by-row, column-by-column? :w00t:
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.August 22, 2011 at 4:49 pm
PaulB-TheOneAndOnly (8/22/2011)
Shurkadze (8/22/2011)
PaulB-TheOneAndOnly (8/22/2011)
Shurkadze (8/22/2011)
The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?
Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?
An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.
Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.
Use your timestamp so sort out DWH ETLable candidates - you will sleep better.
I will then find this record by comparing data in OLTP to one in DW and handle it accordingly in DW.
do you mean... comparing row-by-row, column-by-column? :w00t:
You can use
SELECT *
INTO #TEMP
FROM
(SELECT
col1,
col2,
col3
FROM
OLTP_TABLE
EXCEPT
SELECT
col1,
col2,
col3
FROM
DW_Dim_TABLE) a
This will create temp table with results of differences.
August 23, 2011 at 7:02 pm
Shurkadze (8/22/2011)
PaulB-TheOneAndOnly (8/22/2011)
Shurkadze (8/22/2011)
PaulB-TheOneAndOnly (8/22/2011)
Shurkadze (8/22/2011)
The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?
Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?
An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.
Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.
Use your timestamp so sort out DWH ETLable candidates - you will sleep better.
I will then find this record by comparing data in OLTP to one in DW and handle it accordingly in DW.
do you mean... comparing row-by-row, column-by-column? :w00t:
You can use
SELECT *
INTO #TEMP
FROM
(SELECT
col1,
col2,
col3
FROM
OLTP_TABLE
EXCEPT
SELECT
col1,
col2,
col3
FROM
DW_Dim_TABLE) a
This will create temp table with results of differences.
I've several DIM tables over 500 millons rows in size - do you really think this has the ability to perform in a daily basis?
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.August 23, 2011 at 7:16 pm
PaulB-TheOneAndOnly (8/23/2011)
Shurkadze (8/22/2011)
PaulB-TheOneAndOnly (8/22/2011)
Shurkadze (8/22/2011)
PaulB-TheOneAndOnly (8/22/2011)
Shurkadze (8/22/2011)
The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?
Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?
An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.
Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.
Use your timestamp so sort out DWH ETLable candidates - you will sleep better.
I will then find this record by comparing data in OLTP to one in DW and handle it accordingly in DW.
do you mean... comparing row-by-row, column-by-column? :w00t:
You can use
SELECT *
INTO #TEMP
FROM
(SELECT
col1,
col2,
col3
FROM
OLTP_TABLE
EXCEPT
SELECT
col1,
col2,
col3
FROM
DW_Dim_TABLE) a
This will create temp table with results of differences.
So you have dimension tables with 500M+ records? Or is this a hypothetical question?
August 24, 2011 at 9:06 am
Would it be possible to do a checksum on the fields you're interested in on both source and DW; if they're different, then you need to do an update or add row for your slowly changing dimension? You could even store the checksum on the DW if that's faster than recalculating. Would that be performant enough for your scenario?
September 2, 2011 at 8:03 am
Shurkadze (8/23/2011)So you have dimension tables with 500M+ records? Or is this a hypothetical question?
Nothing hypothetical about it - actually one of the DIM tables holds 1.2 billion rows.
How do you think your proposed solution will scale in a medium size environment like the one I'm describing?
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.September 2, 2011 at 8:05 am
rgtft (8/24/2011)
Would it be possible to do a checksum on the fields you're interested in on both source and DW; if they're different, then you need to do an update or add row for your slowly changing dimension? You could even store the checksum on the DW if that's faster than recalculating. Would that be performant enough for your scenario?
Why just don't use PK as everybody does? run incremental ETL processes staging only newly added/updated rows?
What's the point on trying to complicate a simple process like this one?
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply