Importing to normalized structure

  • I have a dts that bulk inserts flatfile records into a table (tblStage1) then uses a data pump with transformations to split up the data into columns (tblStage2). The problem with this is that the data is still flat. I took the time to create a normalized structure because the data is extremely redundant. Now that I have the structure, I don't know how to import the data into it. I created a view that looks like the orignal data and someone suggested using an "instead of" trigger to insert the data through the view but I just get more confused when I look up documentation on that. I know this is something that is done every day but I am at a loss. Please Help!

  • I would guess they meant to say a Trigger instead of a View.  Once one field in your tblStage2 is populated, (obviously you would want to pick a field that is meaniningful), the Trigger can then populate your normalized table(s) with the fields from that record that you want. 

    You probably want to wait until all of your necessary fields are populated before starting the Trigger; in other words, put the Trigger on the last field populated in tblStage2. 

    I wasn't born stupid - I had to study.

  • Actually, there is such a thing as an "Instead Of" Trigger. It supposedly can be used to insert data through a view (SQL2k and up). I think I can figure out this trigger thing if I read a bit. My main question is, If I insert just one column, trigger, then insert the rest of the data into their respective tables, how do I make sure the data of each record is still intact as one record? I cheated on setting up my normalized table structure by creating the single table in access and running the "Analyze Table" Wizard to break it up and create the relationships. I then used the upsize wizard to get it into SQL. Here's how my tables are set up:

    tblStage1

    --1 record to one column

    tblStage2

    --10 columns

    tblMAIN

    --1 data column

    --1 ID column

    --9 ID column linked to each of the other tables

    tbl1, 2, 3, etc...

    --1 data column

    --1 ID column

  • I haven't heard of anything to automatically do that type of transformation. Forget the trigger; it just makes the work more complicated. Add some additional steps to copy data in tblStage2 to the normalized tables. Unfortunately, you have to write those queries to copy the data. Those queries have to handle the IDs that link the table.

  • Your "normalized" table structure looks whacky.

  • I would just like to point out that I am posting in the newbie section because I am somewhat retarded when it comes to this stuff. I don't see anything wrong with my table structure but I would welcome any reasoning as to why it is whacky and how it should be done correctly. Also, if you guys could reply with some sort of examples to send me in the right direction I would appreciate it. I need a little hand holding here, sorry.

  • OK, let me start over for clarification:

    I am attempting to create a database to hold proxy log data for dynamic reporting.

    I have imported the flat files into a single field table using BULK INSERT.

    tblStage1

    [rawdata] varchar (8000) NULL

    I then parse the data from tblStage1 to tblStage2

    tblStage2

    127.0.0.1 varchar (15) NULL,

    [Username] varchar (20) NULL,

    [Division] varchar (20) NULL,

    [Date] datetime NULL,

    [Time] datetime NULL,

    varchar (1000) NULL,

    [Code] int NULL,

    [bytes] int NULL

    I'm not exactly sure what I'm doing at this point but I know I would like to remove the duplication that exists in all fields. This is how I was thinking I should split up the table structure:

    tblURLS

    [URLID] int NOT NULL IDENTITY PRIMARY KEY,

    varchar (1000) NOT NULL,

    [IPID] int NULL,

    [UID] int NULL,

    [DIVID] int NULL,

    [DID] int NULL,

    [TID] int NULL,

    [CID] int NULL,

    [BID] int NULL

    tblIP

    [IPID] int NOT NULL IDENTITY PRIMARY KEY,

    127.0.0.1 varchar (15) NOT NULL

    tblUSER

    [UID] int NOT NULL IDENTITY PRIMARY KEY,

    [Username] varchar (20) NOT NULL

    tblDIV

    [DIVID] int NOT NULL IDENTITY PRIMARY KEY,

    [Division] varchar (20) NOT NULL

    tblDATE

    [DID] int NOT NULL IDENTITY PRIMARY KEY,

    [Date] datetime NOT NULL

    tblTIME

    [TID] int NOT NULL IDENTITY PRIMARY KEY,

    [Time] datetime NOT NULL

    tblCODE

    [CID] int NOT NULL IDENTITY PRIMARY KEY,

    [Code] int NOT NULL

    tblBYTES

    [BID] int NOT NULL IDENTITY PRIMARY KEY,

    [bytes] int NOT NULL

    My major problem here is that I don't know how to get the data from tblStage2 to the normalized structure without screwing up my data. I'm a beginner at SQL so please provide examples in your post. Any and all suggestions welcome.

  • The actual data and how it is used really determines how to structure it. Unfortunately, I'm familiar with neither.

    Just looking at the tblStage2, I see no reason to split it into multiple tables. The only column of spacial significance is URL with it's 1000 characters. But that apparently is unique based on tblURLS and would not benefit being in a separate table. The other columns look like attributes of the URL. And they're small enough byte-wise to appear like codes that don't need to be represented by yet another code (ID). On the surface, I don't see any relation between the columns that warrant pulling them out into a separate table.

    Perhaps all you need are some good indexes to improve your reporting.

  • I see what your saying but I don't really know how indexes work. I figured I'd split up the data to make queries faster. Like if I wanted all divisions I could just SELECT Division FROM tblDIV instead of SELECT DISTINCT Division FROM tblStage2 (this would have to look through all records and I'm looking at several Million)

  • I see your point and it's a good one. Do you need that functionality for date, time, and bytes?

    If you are just reporting, I'd say just use the "select distinct". But maybe you're putting a UI in front of this reporting and letting users select division from a drop-down list. Generally, we have the UI (webserver) just cache the "select distinct" results.

    Is the data always loaded in bulk? Or is the data being continually changed with row-by-row inserts/updates/deletes? Or will the data be bulk loaded initially, with further changes done row-by-row?

  • No, I'm guessing date, time and bytes wouldn't need that... I was splitting them out just to reduce redundancy, but I guess that's kindof pointless seeing that I would just be replacing them with redundant key values. I would like that functionality with all other fields though.

    The reports are dynamic and will have a UI, which UI I have yet to determine (I'll deal with that later)

    Data is bulk inserted on a weekly basis and records are never changed from their original data.

  • Keep tblStage2. Add an identity column and make that the primary key. Add indexes on the columns you will need for querying - probably ip, username, division, code, date/time.

    If a list of divisions is needed for performance, use "select distinct" to fill a table of divisions. Do that after the weekly bulk insert and after truncating the table. Make division the primary key in the table. Do the same for similar data (ip, username, code).

    I understand the desire to reduce redundancy. But keep in mind complexity. This is a reporting database, so make it work best for reporting. Having all that "normalization" would make the weekly bulk insert difficult, as well as hurting most queries due to joining many tables.

  • Thanks for the suggestions. I think I jumped in to the normalization just for the sake of it.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply