normalizing data in csv file into relational tables using SSIS

  • I have a csv file with data which is represented in the table as follows

    Create table test1 (

    Sid int

    ,Name varchar(20)

    ,Gender Char(1)

    ,siteid varchar(20)

    ,sitename varchar(20)

    ,readingtestdate smalldatetime

    ,readingscore int

    ,mathtestdate smalldatetime

    ,mathscore int)

    Insert into test1 values(1,'ABC','M','M01','XX','5/08/2009',90,'',0)

    Insert into test1 values(1,'ABC','M','M01','XX','',0,'06/08/2009',95)

    Insert into test1 values(2,'DEF','F','VV1','VV-01','04/06/2009',88,'',0)

    Insert into test1 values(2,'DEF','F','VV1','VV-01','',0,'05/06/2009',77)

    Now I need to load the data into staging table along with row_number and then after loading the staging I have to use the staging table and I need to load the data into normalized tables like site table (siteid,sitename) and student table(sid,name,gender,siteid),exam table(sid,readingtestdate,readingscore,mathtestdate,mathscore).

    Please help me how to load into normalized tables using SSIS.

    If at all I am not clear please let me know

    Thanks,

    sai

  • Your question is a bit too broad. It sounds like you have decided on a process for your SSIS package, ie, something like:

    1) Truncate staging table

    2) Load from CSV into staging table

    3) Load from staging table into production tables

    and yes, SSIS can do all of that - but which part do you need help with?

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Hi,

    First of all thanks for giving your attention towards my question. As I explained the test1 table I created actually is csv file in which I have the fields which are in denormalized form. Now what I am doing is I am loading the csv file data into staging table. Now after loading I have to use the staging table data and I need to load the data into normalized tables.

    I am not understanding how I need to load the data from staging table into relational tables. In short I want to know whether I need to load site table first because this is master table and I am using siteid in student table ....and then do I need to load student table because exam table is having student id (sid) acting as foreign key.

    This whole thing should be created in one package or is there any other option.

    I want to know what transformations I have to use and what is the sequence of these transformations and how I need to use them.

    Thanks,

    Sai

  • Aha, OK, now I understand.

    You need to load your master data first (ie the stuff on the 'one' side of one-to-many relationships) - Site then Student then Exam, by the look of it, so that your foreign keys are available when needed and to avoid problems with referential integrity.

    One way of doing this is having a stored procedure which performs all the updates and inserts for you. SSIS can then just call this sp as its final step after importing the CSV data.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • hi Phil,

    I am sorry that I couldn't understand your point. I want to perform this using SSIS transformations. So, please let me know whether can I do with SSIS or not. If not please explain how can I do this using Stored procedure.

    Thanks,

    saidwarak

  • OK, sorry if I was not clear.

    Once your data is in a staging table, there will be a series of SQL statements which need to be executed to INSERT or UPDATE that data into your production database tables. Personally, I would not use an SSIS dataflow for this, but I would write the INSERTS and UPDATES into a stored procedure.

    This stored procedure can then be called by your SSIS package using an Execute SQL task.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • hi,

    Can I use multicast transformation to load data from staging table into production tables.

    Thanks,

    saidwarak

  • Not really, because you do not have all the foreign key information that you require at that point.

    Also, is there ever a need to update existing rows in the database, or is everything always just inserted?

    If you are inserting only, you could create a dataflow to insert into site, then another dataflow to insert into Student (with a lookup transformation to get the FK to site), then a dataflow to insert into Exam, with more lookups as required to get foreign keys.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • I hope that I'm not adding to your confusion, but keep in mind that SSIS is a full fledged ETL tool so you don't really need to load a staging table first. Most transformations can and should be done in the data flow pipleline. Using staging tables works and makes transitioning from DTS to SSIS easier, but the goal of SSIS is to Extract, Transform, and Load the data not Exract, Load, Transform, and Re-load.

    The more work you can keep in SSIS and not burden the database instance with, the better your SSIS solution will run and scale.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Hi John

    If we take away the staging table and do the import directly, using SSIS, how best do you deal with all of the FKs that need to be populated? I am talking specifically about this example.

    It seems to me that it would require several passes of the input file to do this, whereas the staging table approach allows a set-based solution to the FK issue which I imagine to be somewhat faster.

    Phil

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Hi phil & john,

    Thanks for your replies and I'll be in touch with you guys once the problem is solved as per your recommendations.

    Thanks,

    sai

  • Hi Phil,

    Great question. First off, I just wanted to say that there is nothing wrong with the staging table approach. I just wanted to point out that all ETL tasks can be done within SSIS. Most people are still using the DB to store working data or stage data, and this is not wrong. I only wanted to convey the message that all ETL can be done within SSIS, making it a stand-alone process without dependancy on a database. This allows for greater portability and less burden on the destination SQL instance.

    In this example, there would be a couple of ways to do this. First, as you guessed, would be to set up a data flow for each entity (Site, Student, Exam) and pass through the source file once to load each entity. The control flow would make sure that each data flow would be called in the correct order to ensure that data dependancies are taken care of. As you stated, this may not be as efficient as using the staging table approach.

    Another method would be to pass through the source file one time and use the Raw File destination to store off each logical entity in their own raw file. Then use the raw file as the source for the data flows for the corresponding entities. This method would reduce access to the source file to 1 pass. Granted you would need to open/close the raw files for loading the entities, but by eliminating duplicates as you create the raw files (the denomalized source file is most likely rampant with redundant data), you should be greatly reducing the size of the raw files that the data flows need to process. Is this better than using a staging table? I'm not sure. Much of this depends on how portable you want to make your SSIS packages and how important it is to reduce the ETL load on your database instance.

    Just a side note, both solutions (staging table versus SSIS data flows) may be over kill for this example. I for one may be tempted to create a view on the underlying tables and put an instead of insert trigger on the view and use bulk insert to import the rows. This would allow you to put the individual entity inserts into a stored procedure and give your set based access to the data set. The view itself wouldn't be used for data access, just as in import interface for bulk inserting from the file.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Thank you, John, for the considered and informed response.

    All of that makes sense. But I need to get another strong coffee and read the Raw file stuff again to make sure I understand - never used those before:-)

    Phil

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Coffee....mmmm.

    I hate to recommend web resources other than SSC because it is by far the best, but since Jamie Thomson requents SSC as well, I'll give him due credit. His blog SSIS Junkie is one of the best SSIS resources that I've found. Raw files are the SSIS intended method to share data between data flows. Here's a good read on raw files:

    http://blogs.conchango.com/jamiethomson/archive/2006/12/12/SSIS_3A00_-Dropping-data-into-a-raw-file.aspx

    He's has many other good blog posts with infromation on raw files.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply