One table or two?

  • I have an outside company that will be sending me 2 files each month.  File 1 has approx. 280 columns, File 2 has around 330.  The first 150 column names on each table are exactly the same.  The remaining columns on each file are unique to that file(no overlapping column names).  There is no one-to-one match of data between the 2 files, so I can't combine records.

    My initial thought is to combine these 2 files into 1 SQL table.

    150 columns shared by both files.

    130 columns only holding File 1 data.

    180 columns only holding File 2 data.

    10 additional columns I would add that would be filled in by each load.

    I would let the columns not being filled in for that load to default to NULL.

    Just curious how others would handle this situation.  Down falls of such a large table?  Aprrox. 470 columns wide.   This is a new company so I don't know how many rows of data we will be getting each month.  I'd guess around a thousand on each table at least initially.  Any thoughts or issues you've run into are appreciated.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • Even if an table is fairly wide, applications can always run against covering indexes and indexed views.    The key question is whether the data from the first file has to be used in conjunction with the second file?    If no, then just keep two separate tables.

    You said "There is no one-to-one match of data between the 2 files, so I can't combine records."   If you can't combine two rows into one, there is no reason to combine the data into a single table.   That  both tables contain similar information (same column names) is NOT significant.

     

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • I would never combine these tables.

    I'd be much more concerned with determining the best clustered keys for each table.  Or whether, perhaps, they would best be heaps (individual rows are almost never read, the whole table is pretty much read in full every time); it doesn't sound like that from your text so far, but that's not really lot of details yet.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • I'll be using this data to add entries to our data warehouse tables.  After using this data to fill in our data warehouse tables most of the time it will not be used after that.  We are keeping this data as a just in case someone needs it down the road.  In my transaction table I have a field that captures the RecordId(identity column) from the 'source' table.  This is there so we can easily join back to the 'source' table record to get any additional data not carried in the data warehouse tables.  I guess I was thinking of one table so anyone that does need to join back to the 'source' only had to worry about one table to join to.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • below86 wrote:

    I'll be using this data to add entries to our data warehouse tables.  After using this data to fill in our data warehouse tables most of the time it will not be used after that.  We are keeping this data as a just in case someone needs it down the road.  In my transaction table I have a field that captures the RecordId(identity column) from the 'source' table.  This is there so we can easily join back to the 'source' table record to get any additional data not carried in the data warehouse tables.  I guess I was thinking of one table so anyone that does need to join back to the 'source' only had to worry about one table to join to.

    I'm thinking that you probably don't store the data derived from these two files in a single DW table.

    I guess it all boils down to what the datatypes of the nullable columns would be and how much space you can afford to waste.  Also keep in mind that wider tables will always be slower because you have to read more pages to read the same number of rows.  That's a part of why people don't store everything in a single table in a DW.  Personally, I have a bit of hate problem with wide tables/rows and partially NULL columns but they're frequently unavoidable (like with Call Detail Records and similar).

    As for the "Just in Case" aspect of all of this, I'd definitely create a partitioned table (1 FileGroup/File per month, for example) and store that in a separate database.  It would be a huge waste of time and resources to backup the whole thing each month so I'd make the older months' FileGroups READ-ONLY so that I could get away with backing them up "one final time" at the start of a new month and never have to backup again that which will never change.  Of course, you should remove any free space from the filegroups before you set them to READ-ONLY.  That takes a bit of a trick especially since you also want to make sure the indexes of the filegroup are totally defragged with a FillFactor of 100.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the feedback.  Like I said this was my initial thought to join them into one, but the more I thought about it I was thinking I should go with 2.  That's why I posted the question here, I wanted a second opinion.

    I will go with the 2 table approach.  Maybe I will load the 2 files as separate batches(so they each get a unique BatchId).  This way when someone does need to join back from the transaction table to this 'History' table they will get the right records..  The users will then need to set up the joins to both tables matching on the Record ID and the batch id so it is  a one-to-one match.

    Certain users will need to join back to these 'history' tables to grab data not stored in our warehouse to do certain regulatory reporting on these lines of business.

    Thanks again

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply