ETL separate database

  • I need to improve ETL.

    Currently it's hard to troubleshoot missing files, bad data issues.

    Very bad Logging, notification.

    I want to create separate "ETLtools" database where I would load staging data and

    do validation, transformations. I can make it Simple Recovery Model.

    No need to write to transaction log staging data every day.

    The issue is the current ETL table "Batch" sits in the main application "Scene" database and it

    has Foreign Keys in lots of main tables like "Point", "Account".

    Batch.BatchID -->Point.BatchID

    Batch.BatchID -->Account.BatchID

    So if I move Batch table to ETLtools databse, I cannot keep these Foreign Keys.

    Batch table has columns like:

    FileName

    FileSize

    Header

    Trailer

    BatchDt

    My plan is to remove these Foreign Keys.

    But I need to justify this decision.

    My argument is:

    We can still insert BatchID so in case we need to find which Point was inserted from which File we can do it.

    But without Foreign Key constraint.

    Does it make sense?

  • RVO (10/30/2016)


    I need to improve ETL.

    Currently it's hard to troubleshoot missing files, bad data issues.

    Very bad Logging, notification.

    I want to create separate "ETLtools" database where I would load staging data and

    do validation, transformations. I can make it Simple Recovery Model.

    No need to write to transaction log staging data every day.

    The issue is the current ETL table "Batch"

    has Foreign Keys in lots of main tables

    Batch.BatchID -->Point.BatchID

    Batch.BatchID -->Account.BatchID

    So if I move Batch table to ETLtools databse, I cannot keep these Foreign Keys.

    Batch table has columns like:

    FileName

    FileSize

    Header

    Trailer

    BatchDt

    My plan is to remove these Foreign Keys.

    But I need to justify this decision.

    My argument is:

    We can still insert BatchID so in case we need to find which Point was inserted from which File we can do it.

    But without Foreign Key constraint.

    Does it make sense?

    No. Unless the data is ultimately exposed to the FK's, it doesn't make sense. The FK's are the to guarantee referential integrity, which is another name for "the proper data".

    Now, if you do ultimately insert from a staging table into final tables that do have referential integrity to each other and you have validated the data in the staging table, THEN it makes perfect sense to have a staging DATABASE that's in the SIMPLE Recovery Model so that you can enjoy such things as high performance "Minimal Logging" during the load of the staging table(s). As you probably already know, you can't have FKs across databases so all such validation will nee to be done by your validation code prior to firing it all into the final tables.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff.

    I'm confused.

    You're saying it's a good idea to have staging tables in a separate database

    but then what do I do with Foreign Keys?

    Flow is the following:

    1. load into one column staging1 table

    2. using SUBSTRING() - parse the column and load into another staging2 table with real columns

    3. Do Validation (UPDATE invalid, error codes, etc.)

    4. Load into final table from staging2 table

  • RVO (10/31/2016)


    Thanks Jeff.

    I'm confused.

    You're saying it's a good idea to have staging tables in a separate database

    but then what do I do with Foreign Keys?

    Flow is the following:

    1. load into one column staging1 table

    2. using SUBSTRING() - parse the column and load into another staging2 table with real columns

    3. Do Validation (UPDATE invalid, error codes, etc.)

    4. Load into final table from staging2 table

    You wouldn't have an foreign keys on your staging tables. Any constraints would need to be enforced as part of your ETL process, the final tables could have foreign keys on them if you're finding issues with the ETL process and ending up with orphaned records.

  • Maybe my English is not good enough...

    Is this the setup you're proposing?

    Move Batch table to ETLtools database.

    Load file into ETLtools database.

    Create BatchID (Primary Key) in Batch table.

    Staging1 table - no FK

    Staging2 table - no FK

    Main "Scene" database:

    FinalTable - insert BatchID but no FK

  • RVO (10/31/2016)


    Thanks Jeff.

    I'm confused.

    You're saying it's a good idea to have staging tables in a separate database

    but then what do I do with Foreign Keys?

    Flow is the following:

    1. load into one column staging1 table

    2. using SUBSTRING() - parse the column and load into another staging2 table with real columns

    3. Do Validation (UPDATE invalid, error codes, etc.)

    4. Load into final table from staging2 table

    Basically BUT step 2 should be unnecessary. Whatever you're using to do the import should do the parsing of data. For example, if you're using BULK INSERT to do the import and the table has a "fixed field format", you should have a "BCP format file" that lays out what the rows should look like and it will run like the wind.

    You can also have synonyms in your ETL database that point back to the final database so that you can do validations against any reference tables that you have. Same goes for the final tables that you want to insert into. The rule of thumb here is, except within the definition of the synonyms, to stick to a 2 part naming convention in case you have to move databases or even if you want to use the same ETL database to act as a loading point for more than 1 database.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply