ETL separate database

Question

ETL separate database

RVO

SSCertifiable

Points: 5495
More actions
October 30, 2016 at 5:42 pm

#331691

I need to improve ETL.
Currently it's hard to troubleshoot missing files, bad data issues.
Very bad Logging, notification.
I want to create separate "ETLtools" database where I would load staging data and
do validation, transformations. I can make it Simple Recovery Model.
No need to write to transaction log staging data every day.
The issue is the current ETL table "Batch" sits in the main application "Scene" database and it
has Foreign Keys in lots of main tables like "Point", "Account".
Batch.BatchID -->Point.BatchID
Batch.BatchID -->Account.BatchID
So if I move Batch table to ETLtools databse, I cannot keep these Foreign Keys.
Batch table has columns like:
FileName
FileSize
Header
Trailer
BatchDt
My plan is to remove these Foreign Keys.
But I need to justify this decision.
My argument is:
We can still insert BatchID so in case we need to find which Point was inserted from which File we can do it.
But without Foreign Key constraint.
Does it make sense?

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 1

RVO (10/30/2016)
I need to improve ETL.
Currently it's hard to troubleshoot missing files, bad data issues.
Very bad Logging, notification.
I want to create separate "ETLtools" database where I would load staging data and
do validation, transformations. I can make it Simple Recovery Model.
No need to write to transaction log staging data every day.
The issue is the current ETL table "Batch"
has Foreign Keys in lots of main tables
Batch.BatchID -->Point.BatchID
Batch.BatchID -->Account.BatchID
So if I move Batch table to ETLtools databse, I cannot keep these Foreign Keys.
Batch table has columns like:
FileName
FileSize
Header
Trailer
BatchDt
My plan is to remove these Foreign Keys.
But I need to justify this decision.
My argument is:
We can still insert BatchID so in case we need to find which Point was inserted from which File we can do it.
But without Foreign Key constraint.
Does it make sense?

No. Unless the data is ultimately exposed to the FK's, it doesn't make sense. The FK's are the to guarantee referential integrity, which is another name for "the proper data".

Now, if you do ultimately insert from a staging table into final tables that do have referential integrity to each other and you have validated the data in the staging table, THEN it makes perfect sense to have a staging DATABASE that's in the SIMPLE Recovery Model so that you can enjoy such things as high performance "Minimal Logging" during the load of the staging table(s). As you probably already know, you can't have FKs across databases so all such validation will nee to be done by your validation code prior to firing it all into the final tables.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RVO SSCertifiable Points: 5495 More actions · Answer 2

Thanks Jeff.

I'm confused.

You're saying it's a good idea to have staging tables in a separate database

but then what do I do with Foreign Keys?

Flow is the following:

1. load into one column staging1 table

2. using SUBSTRING() - parse the column and load into another staging2 table with real columns

3. Do Validation (UPDATE invalid, error codes, etc.)

4. Load into final table from staging2 table

ZZartin SSC-Dedicated Points: 30920 More actions · Answer 3

RVO (10/31/2016)
Thanks Jeff.
I'm confused.
You're saying it's a good idea to have staging tables in a separate database
but then what do I do with Foreign Keys?
Flow is the following:
1. load into one column staging1 table
2. using SUBSTRING() - parse the column and load into another staging2 table with real columns
3. Do Validation (UPDATE invalid, error codes, etc.)
4. Load into final table from staging2 table

You wouldn't have an foreign keys on your staging tables. Any constraints would need to be enforced as part of your ETL process, the final tables could have foreign keys on them if you're finding issues with the ETL process and ending up with orphaned records.

RVO SSCertifiable Points: 5495 More actions · Answer 4

Maybe my English is not good enough...

Is this the setup you're proposing?

Move Batch table to ETLtools database.

Load file into ETLtools database.

Create BatchID (Primary Key) in Batch table.

Staging1 table - no FK

Staging2 table - no FK

Main "Scene" database:

FinalTable - insert BatchID but no FK

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 5

RVO (10/31/2016)
Thanks Jeff.
I'm confused.
You're saying it's a good idea to have staging tables in a separate database
but then what do I do with Foreign Keys?
Flow is the following:
1. load into one column staging1 table
2. using SUBSTRING() - parse the column and load into another staging2 table with real columns
3. Do Validation (UPDATE invalid, error codes, etc.)
4. Load into final table from staging2 table

Basically BUT step 2 should be unnecessary. Whatever you're using to do the import should do the parsing of data. For example, if you're using BULK INSERT to do the import and the table has a "fixed field format", you should have a "BCP format file" that lays out what the rows should look like and it will run like the wind.

You can also have synonyms in your ETL database that point back to the final database so that you can do validations against any reference tables that you have. Same goes for the final tables that you want to insert into. The rule of thumb here is, except within the definition of the synonyms, to stick to a 2 part naming convention in case you have to move databases or even if you want to use the same ETL database to act as a loading point for more than 1 database.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)