November 18, 2010 at 1:20 pm
I am Integrating 3 different database sources into an ODS. The ODS is going to have the exact tables and columns as the source system. I just tried to load one sample table (3 columns) with 2 million records, just to check the systems. It took around 30 minutes, I don’t want to go into asking you guys how to optimize it.
My plan is to create a schema, and the table structure for all the sources in the ODS in the first week and once the structure is ready I will start loading data from one database source at a time
Q1) Is there a golden rule or method to follow while creating the table structure or should I just start doing the create table scripts for all the tables in the ODS?
Q2) Since the tables are relational in the source system, and my guess is that ODS need not have any relation among tables. So I don’t have to create any constraints (like PK, FK) while writing the create scripts. Correct me if I am wrong.
Q3) I am going to use SSIS to load data, so how do I create the packages, should I have each package for each source or should I break up the tables, say 5 tables in one package. Does this affect the performance?
It would be really great if you could answer these questions and also give me some tips/advice.
I owe you guys. Thanks
November 18, 2010 at 1:58 pm
Allright, I'm no design expert, but I'll do my best.
Q1: make your tables as small as possible. Meaning, for example, use varchar instead of char. Use int, smallint or tinyint instead of bigint. If you can save space, and the column still fits your requirements, then save space. Think about nullability for your columns. Does a column need to allow null values or not?
Certainly think about placing the right indexes to speed up query responses.
There are many references/blog posts on the internet about designing tables with much better advice then I am able to give.
Q2: do create a PK! You must (!) have something that can tell you if a row is unique or not. Again, dozens of blog posts exists whether you should use an IDENTITY column, a GUID or a natural key. FK are not super necessary if you can trust your ETL to do the referential integrity. If you leave them out, it saves you some time when loading a table.
Q3: regarding SSIS, you can keep it simple. Use the Import and Export wizard to create a package (or multiple packages) that handle simple data loading. You can save the package(s) and manually edit them if necessary. Try to incorporate parallellism, but not too much, your disk I/O should still be able to handle it. You can solve disk I/O problems by cleverly choosing which databases reside on which disk and by using partitioning (different partitions on different disks).
ps: 2 million rows in 30 minutes, that can a lot faster 😀 (but it kind of depends on your hardware and network set-up)
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
November 18, 2010 at 3:49 pm
Q1/Q2 - Does assigning PK/ FK affect my load process, if so can I create PK/FK after the loading process is over?
Q3- Sounds good. How do I implement parallelism?
I know 30 mins is lot, I was loading from Oracle to SQL server, I don't know if that had any part to play. I really appreciate you, taking time to read by big a** post.
November 23, 2010 at 4:21 am
danielagger98 (11/18/2010)
Q1/Q2 - Does assigning PK/ FK affect my load process, if so can I create PK/FK after the loading process is over?Q3- Sounds good. How do I implement parallelism?
I know 30 mins is lot, I was loading from Oracle to SQL server, I don't know if that had any part to play. I really appreciate you, taking time to read by big a** post.
A1 and 2) Procastination is a bad deal choose in data warehousing, specially. Once your data is "staged" it means it is ready for any/all data transformation and it is a considerable bad practice to "re-stage" your data selectively for any data transformation.
A3) Please follow here
Raunak J
November 23, 2010 at 8:26 am
I don’t know if your efforts are for a class project, due at the end of the week, or for an enterprise class data store with a well defined service level agreement (SLA). If scalability, extensibility or data quality don’t matter, or if this is just a one off task, use the Import/Export wizard.
A1) Consider your SLA and if your internal/external customers want to see history or last month’s report. You may want to add auditing so you’ll include auditing fields to your tables. In addition, if history is to be maintained, you may want to include new surrogate keys. Also, if you have tables from different sources with the same name, you could build the entities for each source in its own schema in your ODS. That way you won't have table name collisions.
A2) Primary keys will help DML and ETL performance while Foreign Keys will ensure data integrity. Don’t assume the source is always pure. If you leave FKs out, just be prepared to manually validate your data for phantom errors.
A3) I like to create an SSIS package for each target table that I’m building, but I also use auditing and reassign my surrogate keys as well. Then I create a controller package that executes each of the table packages in the proper sequence for proper referencing. This way, I can create basic template(s) and reuse them, modify just the SQL that conducts the actual ETL.
If you’re worried about drive space, profile your data sources to see what columns may not be being used or are obsolete. Also, you can evaluate some ETL issues you may have when you see the range of values, etc.; I wouldn't reduce or alter data types that would are not compatible with the range of data from the source.
If you are just beginning your project, as it sounds, you may want to consider using/developing your ETL Framework before you get too far down the road.
November 23, 2010 at 9:18 am
Foreign keys may make loading a bit slower, but they can also be used by the query engine to optimize queries into the data. That can mean slower loads and faster queries, so you'll need to judge that based on the expected uses of the ODS. All database work is based on "pay now or pay later", and "pay later" usually means "with interest". Think of the work on a database that way, and a lot of questions will answer themselves. (Slower load = pay now, slower queries = pay later, in case my analogy is too obscure. More work in initial design = pay now, quick design with heavier code work to deal with poor normalization = pay later with interest. And so on.)
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply