July 1, 2011 at 3:13 pm
Hello all, I'm coming into an ETL Project very late and questioning some design practices. We're pulling data from many systems and warehousing them. I've summarized the steps below.
Step 1. Pull all system tables raw into Database 1 (ODS)
Step 2. Move all ODS tables raw into Database 2 (Staging)
Step 3. Perform transformations/aggregations within Database 2 (Staging)
Step 3. Truncate the Production tables in Database 3 (Prod).
Step 4. Move all pre-aggregated/transformed data from Staging tables in Database 2 (Staging) to Database 3 (Prod).
The one big piece I'm questioning is that the designer NEVER executes a cross database query as a rule. Is there any performance hit regarding cross database queries that I'm unaware of? I know that linked server queries are bad, but this is all on the same server, and is being executed at night when little other activity is taking place. Can someone explain that logic to me? Is there a good article on the pros/cons (or lack there of) regarding cross database queries?
Additionally I'd appreciate it if anyone give me some advice on the the overall strategy of this ETL.
Cross Database query example below.
[font="Courier New"]USE STAGING
SELECT table1.column_1, table1.column_2, table2.column_1
FROM table1
JOIN ods.dbo.table2.[/font]
July 1, 2011 at 3:27 pm
Bad_Karma (7/1/2011)
The one big piece I'm questioning is that the designer NEVER executes a cross database query as a rule.
Have you asked the designer why is that rule in place?
May be there is no need at all for cross database queries.
User community should get the information they need by issuing queries against the FACT/DIM (guessing star schema modeling) tables.
ODS tables should be out or reach for the user community.
Finally, I'm pretty sure there is an interaction in between STaGing and FACT/DIM tables during final phase of the ETL process.
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.July 1, 2011 at 4:09 pm
Clarification: The user community only has interaction with the Prod database, thats not where I'm concerned, I'm a true believer in that aspect. What I'm concerned about is the ETL process itself being inefficient.
The designer isn't a very big SQL guy, he was given instruction by a consultant we no longer have contact with. So he really doesn't have a reason. The reason I have issue with this is that to get around not doing cross database queries, he's having to move multi-million-record tables from the ODS to the Staging database, and THEN running a sproc that merges the tables and storing that in the Staging DB.
These moves from ODS to Staging are extremely time consuming and ,unless someone can tell me otherwise, pointless.
Again I understand the value in the elegance of not having cross database queries in your application/reports. But in my ETL process?
One more example to illustrate how extreme: I built a separate ETL that does an incremental fetch (rather than full truncation and refresh).
After my staging table is ready to be moved to production, I do an Exists against the Production table, so that I only insert new records.
The designer has an issue with this. Saying that I should have another complete copy of the production table in staging, do the EXISTS against that table to figure out which records are new.
In a world of infinite resources and zero time constraints, I really don't care, but now in reality I have 3 identical tables and 2 seperate insert operations where I should have 2 tables and 1 insert.
July 2, 2011 at 11:40 am
Bad_Karma (7/1/2011)
The reason I have issue with this is that to get around not doing cross database queries, he's having to move multi-million-record tables from the ODS to the Staging database, and THEN running a sproc that merges the tables and storing that in the Staging DB.
I'm with you on this one. It doesn't make sense to copy an ODS table to the STG database just to avoid a cross database query.
Does the design also calls to move STG tables to Production to complete the load?
I'm with the designer that (in general) queries issued by the user community should target exclusively the Production - Star Schema - database but when it comes to ETL the cheapest way to do it in such multi-database platform is to allow for cross database processes.
You have my vote, moving ODS and/or STG tables around just to avoid cross database processes makes no sense at all.
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.July 2, 2011 at 12:15 pm
The only justification I could see for avoiding cross-database calls is to leave the option open to scale out at some point down the line by moving the Staging or ODS databases onto separate instances/servers. I agree with Paul, it may be as simple as seeing if the designer will share the reasoning.
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
July 2, 2011 at 9:22 pm
If it were Cross-Server calls, then I'd be might concerned. 🙂 But, for Cross-Database calls, there's no concern about performance issues.
First, you posted a query for a Cross Database call as follows:
Cross Database query example below.
USE STAGING
SELECT table1.column_1, table1.column_2, table2.column_1
FROM table1
JOIN ods.dbo.table2
Please don't take this as a personal attack. I'm trying to help. I can understand why a designer might have some problems with that code. First, you used the 3 part naming convention for table2 instead of creating a Synonym and then using the 2 part naming convention. Second, you used the tablename.columnname syntax (which has been deprecated, by the way) in the SELECT list instead of using the generally accepted best practice of using the tablealias.columnname syntax. Third, you only used the one part naming convention for table1, and fourth, you didn't use any column aliases to resolve the non-unique naming of column_1 from two separate tables. Yes, I do understand that it's only an example bit of code but you did offer it as a possible solution. 😉
Onward...
If the data being put into the ODS database lives on disk as a "raw" file somewhere else, I see no purpose for the ODS database. If the data is downloaded directly to a database using something like SSIS, then I can see having the ODS database as the sole source of original data. Copies of unadulterated original data should always be kept (especially for SOX/SEC compliance and the sake of legal breadcrumbs in general) and the ODS database would serve well as such a vehicle especially when a valid backup and maintenance plan is implemented.
Since the original data should never be altered in any way for legal reasons, there's a staging database to do data cleansing, dupe checking, some validation, transformations, aggregations, etc, etc. The real reason behind doing this in a separate database is simple... performance. You can map separate spindles for the database so that you don't create an IO problem for production.
After that, I have a real problem with the steps you identified the designer having defined as...
[font="Arial Black"]Step 3. Truncate the Production tables in Database 3 (Prod).
Step 4. Move all pre-aggregated/transformed data from Staging tables in Database 2 (Staging) to Database 3 (Prod).[/font]
I would never, ever do a "truncate, then move". What happens to production if the move fails? BOOM! Production is dead! Sure, the TRUNCATE could be included in the transaction but that's not going to fix how long it will take if a rollback does happen to occur.
Instead, I'd do a "copy" as a different table name (between two known table names such as ImportantTable1 and ImportantTable2). Let's say that ImportantTable1 was currently the table "in service". It wouldn't/shouldn't be in-service directly. Instead, there should be a Synonym or Pass-Through View named ImportantTable and that should be pointed at ImportantTable1. Now it's time to do the "replacement update" the designer wants. No problem. Truncate ImportantTable2 (which is not in-service) and populate it. If the population fails, no problem. Just leave ImportantTable1 in-service. If the population succeeds, just repoint the Synonym or Pass-Through View to ImportantTable2 and Bob's your Uncle. ImportantTable1 will serve as an online backup of previous data if a differential question comes up or there's a problem. Next update, the process switches. ImportantTable1 is truncated and populated behind the scenes and then the Synonym or Pass-Through View is repointed to it in scant milliseconds. The users won't even notice.
Why go through all of that instead of doing a "simple upsert"? Good question. It may be because the designer simply doesn't want to take the chance of the data getting out of sync due to a rollback or anything else that may happen. The designer may be trying to guarantee an "all or nothing" update. If that's the case, then the use of "flip-flop" dual target tables and Synonyms or Pass-Through Views are just what the doctor ordered.
There's another side benefit to the "flip-flop" table method... nearly zero interference with the users. The row, page, extent, and maybe table lock during a "simple upsert" may simply take more time and cause more interference with the users than the designer wants to allow (although I'd certainly question that motive given that the designer specified a "truncate, then populate" methodology :sick:).
Yet another side benefit which will also be good for performance is that, if you carefully plan for it, you may be able to populate the table using minimal-logging techniques. Do keep in mind that, unlike when done in the same database, SELECT/INTO WILL lock some of the system tables in the "remote" source database when going across databases. You'll need to experiment a bit to see what happens if you decide to go with ths option.
Still, done correctly, a well written "upsert" with the appropriate checks and balances for the Insert, Update, and Delete portions of the merge may be nearly as unobtrusive to the users depending on the volume of the changes (of course). But all 3 would have to be rolled back (easy to do if using MERGE) if anything went wrong to keep things in sync and that would be terribly expensive and provide a huge amount of intrference with the users. If the raw data has all of the data required, it's far easier to just use all of the raw data rather than trying to keep things in sync.
The real surprise for the designer is going to be that, no matter what you do, some form of cross-database work must be done to accomplish the moves.
I suggest you and the designer get together and discuss the alternative I've given to the "truncate, then populate" method the designer specified. The jobs you save may be both of your own. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
July 3, 2011 at 1:37 pm
Thanks for all of the responses. And Jeff, no offense taken, I normally do everything you suggested, I'm just not involved enough in the SQL community to know if all of that is kosher when posting imaginary code. Consider me "educated" 😛
You've definitely given me some good suggestions that I need to consider. Thats the first time I've heard of the flip flop methodolgy for sure. I'm curious though, if I can re-cap to ensure I've understood correctly. Please correct me if I've misunderstood something.
- ImportantTable1 and ImportantTable2 are identical, but are updated on alternating days.
- The users never directly interact with either table, but instead interact a view named ImportantView.
- After successfully updating ImportantTable1 or ImportantTable2, I execute an alter statement against the view, redirecting it to the most current table.
And if I'm on Enterprise version of SQL, I can index said view, and never really be concerned about performance degredation.
How did I do?
July 3, 2011 at 4:46 pm
Thanks for the feedback.
You did perfect except for one thing... it's a "Pass-Through View"... there is absolutely no need to index the view because any query you use against the view will use the indexes on the underlying table just as if you were selected from the table itself. Try it and see.
The other thing to consider is to use a Synonym instead of a view. They're much easier to write/maintain and don't require special updates if the underlying table has column added or deleted.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply