Flat files - any advantages?

  • We have a number of tables that are not normalized.  The person in charge of the database believes that having flat files will increase the efficiency of queries, since no joins need to be performed. 

     

    I am writing the code (in C#) to generate the queries, so having flat files somewhat simplifies my job.  However, to me, it seems that in the long run it is better to have things fully normalized.  It also seems to me that SQL Server, being a relational database engine, is optimized to do basic things like joins. 

     

    So…

    How much of a time hit are joins? 

    What are the advantages of flat files vs. the advantages of having a relational database?

     

    Thanks.

     

    David

  • Depends of how much data is required to be kept in the db.  If all you need to save is the adress of the company and it's phone numbers, etc.  I see no point in creating and deploying a full db.  However if you have any needs superior to those I'd suggest you use a normalized system right away.  It'll save you a lot of headaches on the long run.

  • If reporting, or tracking/trending then denormalized is supposed to be the best bet for performance.

    If an application like and Order Fullfilment system then it is poor and each insert may cause a lot of index movement.

    I would suggest first he either doesn't understand what he does or is lazy in the later scenario. - Opinion please don't use that statement as is very non-PC.

    By making the data flat it puts a lot more into the application design requirements and the cost for long term data storage.

    Ex how would you handle an Orders system on a flat table

    1) You will have multiple rows many columns with the same data.

     But how will you ensure the orde is properly identified as a unit?

     What steps will be needed to alter the Shipping Address so no errors occurr?

     How will you manage an item being added to the order and ensure remains related properly to the existing items?

     Are you really going to store the entire item description in each row of the order?

     How will you application on the fullfilment side ensure what address to use if there is a difference and there shouldn't be (what if there should be)?

    These questions can be address more fluidly in a normalized system (especially the last if multiple shipping addresses are allowed) and with a lot less work that it will take in the denormalized version. Just consider how ou would address them in a normalized state.

  • IMHO... if he wants flat files, why bother with the expense of an RDBMS???  Store it in a FILE and maintain it with a good word processor.   And access it with Binary FSO's.  Or, a simple spreadheet may be all that's necessary.

    If you want to maintain orderly data, then normalize it and put it in a database.  In a properly formed database, joins are not all that big a performance hit and the structure of the database makes the data a heck of a lot easier to maintain... just not in flat files.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I recently read somewhere that if your DB is used primarily for analysis and reporting and you frequently have to join more than 4 tables, then you should consider denormalization to improve performance.  However, the "penalty" for going down this path is additional work to maintain data consistency.  If Cust_Name appears in 4 tables, you have to have some way of keeping it the same in all tables if the data is updateable.  If it's just a read-only snapshot, then it's probably not such a big deal.

    If your DB is primarily used to record transactions, ie, many inserts/updates, then fully normalized is the way to go.

    Then again... size matters.  If it's not a very large DB to begin with, then normalize the tables.  The performance impact of multiple joins would be negligible and you avoid the pitfalls of data inconsistency.  It's why we have relational DB's

  • At a very high level OLTP applications benefit

    from normalized data, while OLAP applications

    benefit from denormalized data.

    I am fortunate to have had the experience of

    moving from flat file COBOL mainframe era to the

    DBMS/RDBMS/ODBMS era.

    Flat files are definitely faster. Not just because

    they are not normalized, but they do not have the

    overhead of an RDBMS engine.

    Flat files does not again mean notepad, Word or Excel file.

    Flat files also need indexing and in the system I worked

    in the 80s, the mainframe gave that support.

    Though everyone of us knew that RDBMS is slower than

    flat files, we adopted Codd's RDBMS theory because it

    eliminated one of the major, untolerable business need,

    the curse of Data Inconsistency. The next thing that

    RDBMS tried to eliminate and is reasonably successful is

    in avoiding Data Dependancy. Remember how Data structures

    were embedded into COBOL programs. The same data could

    be represented as different data types in different programs.

    I am not sure, if the world is moving towards XML databases

    but atleast for data transfer, we are in there. And what is

    XML? In my opinion, it is nothing but a glorified hierarchical

    flat file.

    I have cut a long story short. Just to make it very clear,

    Data consistency and integrety is more important than speed, and

    that is the reason why we should and are still using normalized

    databases. There is no viable alternative for that technology yet.

    But if speed is of concern, like in today's high volume Data

    Warehouse applications, a diluted version of the normalized databases

    is being used. It is still not complete denormalized.

    hope this helps.

    jambu

  • Our database is used primarily for querying.  There is no updating within the application that access the database.  We get the data from varied sources.  Unfortunately, there is no control on the format in which we receive it.  Much of it is in dbf format, and that file structure can change on the whim of the provider.

    Queries can take hours to run, and performance is the major issue.  To be honest, there is not a lot of relational data.  The joins would not be very deep (not 4 tables), but probably would impact performance.  And there is the maintainance issue - we would have to convert varied non-relational data to relational each time we get it.

    I am not a DBA, but my first thought was that normalized is always better. We do have normalized data within the database and I thought that we should normalize more.  Now, I am not so sure.  Data consistency is not too much of a problem for us.  Values that are out of range are much more of a problem and checking would not be helped by normalization.

    Thanks for all your comments.

    David

  • Values that are out of range are much more of a problem and checking would not be helped by normalization.

    it would help rosedd. Values that should be in a specific range should be stored in a normalized table and looked up. Values not being within a range is a form of data inconsistency, if I am not wrong.

     

  • Yes, it is a form of data inconsistency.  But how would normalization help check that a float field contains values between 10.5 and 20.5?

  • got you. if that is the case, normalization is not the answer. A control file(table) is the answer.

  • Jambu,

    What exactly is a control file?  Is it just a file with the value ranges that can be used in a query to pull out rows with values that are not within range?

    David

  • yes ofcourse

    i agree this could be handled in your programming logic as well easily

  • If you use BCP and a format file to import to a table, you don't need the "control file"... just add the necessary constraints to the table and setup BCP to write exceptions to an errata file.

    You are all correct... if the data being imported is just going to be used for reporting, then denormalization will certainly work fine... so would a good spreadsheet.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If join performance is a concern for a normalized design, there are a lot of things to check before going to a denormalized structure.

    Do the indexes allow efficient joins?  Does the execution plan show merge joins or hash joins?

    Does the server have multiple data drives?  Can you use filegroups to place tables, indexes, and tempdb on separate drives for max I/O performance?  Does the server have enough RAM?  Try moving to a 64-bit system with total RAM greater than the combined size of the tables and see what happens.

    If the flat file represents some kind of master/detail relationship, the master fields make up a large portion of the total record, and there are a lot of detail records for each unique master record, then the normalized structure will be a lot smaller than the flat structure.  Fewer data pages usually translate into faster queries.

    You could say that a denormalized structure is sometimes faster.  You could say that denormalized is often faster for reporting applications.  But someone who says denormalized is always faster is not analyzing the problem at hand, they're just mindlessly reciting a rule of thumb they heard somewhere.

    If you think a denormalized structure will be faster for your reporting queries but you don't want to give up the data integrity of a normalized design, you might also consider using indexed views.  Note that this will require extra disk space and will slow down updates and inserts.

  • Ok, I'm a bit confused... Rose posted "Queries can take hours to run, and performance is the major issue.  To be honest, there is not a lot of relational data."

    I'm thinking that normalization/denormalization is NOT the real issue here... I'm thinking that there's code with a lot of RBAR in it (pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row").

    Rose, it might help if you posted one of your more troublesome queries... post the CREATE TABLE code for the tables that are used, as well.  Indexes will be of prime concern as well... you may want to post that code, too.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply