Lookup Table Madness

  • "It depends on the information you are trying to store. If the clinical trial data has any meaning at all, it has to be defined, and if it is defined data, then it must be modled if it is to have any useful persistence."

    Clinical Trial data is extensively defined, with very detailed, meticulously maintained metadata repositories. As a matter of fact, what you optimize for first in clinical trial apps is the referential integrity of the 30 to 40 entities they generally contain. Also, the business rules that are generally called the "middle tier" are also stored in the database.

    It's just in the real world, the price of maintaining those thousands of distinct tables, as well as the dynamic SQL they would require, just isn't worth the purity.

  • I can see a few issues with MOCK tables:

    1. Extensibility:  If different information is kept in one table, and if after a few years you find that you have to add an additional attribute to one set of data?  what if when you find the second set needs another attribute? What if tomorrow you find that one set has to be re-organized to hierarchical data? Unless you separate the data into different tables, you are unnecessarily loosing space, creating unwanted fields which may lead to bugs.

    In case of database growth, you can’t move the related tables into one file (e.g. employee and employee status together) unless you again partition it (Which goes back to the same point); File is not part of relational concept; but it is part of any database;

    2. Availability:  You are creating a possible hot table. All (it doesn't mean ALL; but in the context.) details in one table; whatever details you add, it will basically update the same table. At times when you update a type of information, instead of row locks or page locks DBMS may decide to use table lock.  It means other users who want to access the same table have to wait, simply because the other information, even though different, placed in the same table.

    3. Manageability: Each time you create a join, you also need to add type information. It means your queries become bigger and more complex.  This will lead to more chances of errors.

    4. Security: If you want to give different rights to different users based on the type, what will you do? Create views to horizontally partition the data? (Most of the time, we wont have this much of security requirements.  That’s why I am placing this as the last item.)

    Only possible use I can see with MOCK table is this: store dynamic classifications (if you like call it as dimensions) of a SINGLE entity. All classifications could be stored in a single table. But for different entities, it is always better to use different tables.

     

    Cheers,
    Prithiviraj Kulasingham

    http://preethiviraj.blogspot.com/

  • Saying that the database organizes data that reflects some part of the "real world" in no way implies that the things that the database manages information about have to be physical "entities."  More often they are conceptual "things" that can't be seen, or touched, but the must be defined.  Furthermore databases reflect a highly subjective slice of reality in that it is impossible for there to ever exist a universal definition of "Customer" to use your example.  Every organization will choose to view things in a particular manner and THAT is what the database should faithfully represent.

    You are also confusing the Relational Model of Data Management with our currently available tools that happen to be SQL DBMSs.  These two are not the same thing at all.  SQL (SQL Server included, but not exclusively) is very poor implementation of the RM as proposed by Codd.  So don't blame the limitations of SQL on the RM.

    Having said that, I understand that we have to live in the real world and that there are some very real limitations imposed by the tools we have at our disposal.  My primary concern is that when compromises have to be made by knowledgable people, there are at least 10 others who see those compromises and think that it is an "elegant solution" as opposed to the ugly kludge that it really is.  Those are the types of folks who will run off and eventually create the "Thing" table or XML table that I sardonically mentioned in the article.

    I also understand that the RM, even if fully implemented, would not be a panacea.  The real world is often very complex and difficult to model, but don't confuse the difficulty of modeling that which is inherently complex with some weakness in the RM.  Ultimately the RM (and to a lesser extent SQL) is about making it easier to model reality by declaring the relationships and constraints that exist.  Modeling these constraints in the database helps to preserve the meaning of the data, the further removed those constraints are from the data the more chance there is for mis-use and ambiguity. 

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • You should never store different entities as the same entity, no matter how you store them: XML, different relations, an E-A-V table, a flat file, whatever.

    If performance is important, then you should probably store them as separate tables. If, however, performance isn't so important (I realize that that may be a foreign concept to lots of you, but it does happen), then an E-A-V table may make sense.

    While MUCKing can happen however you store the data, it is more likely to happen with lookups stored as E-A-V pairs; therefore, if you need to do that, then it pays to be extra-vigilant in order to avoid that situation.

  • Extensibility can easily be addressed by factoring those particular "lookup" values out the main table when they need more than just an ID and a name. Presuming you have an abstraction layer between frontend clients and the database I have found this not to be a problem. Just because you use a consolidated table does not mean you have to use it for everything.

    I have never found availability or more specifically contention to be an issue as the structure of such a table is narrow (generally only three columns) and short in the grander scheme of things. If you have 100 categories each with 20 items each, that is only 2000 rows but would require 100 tables in a relational design.

    In terms of manageability, I find queries to be fractionally more verbose than would normally be required and really not any more complex. Again, if you are using stored procedures for all database requests and a middle-tier, this is very simple. You can even get away with not checking the category of the "lookup" value in you queries if you can rely on the quality of your middle-tier.

    Admittedly, if security is an issue, a consolidated table makes this a bit more difficult. Generally, the consolidated "lookup" table approach works when there are no security restrictions on the lookup values themselves. My inclination would be to factor out those lookup values that require separate security into separate relational tables as necessary.

    Consolidated tables work well when there are numerous, short, identically structured, somewhat nonvolatile lookup items used for various purposes by various GUIs such as categorization or drop-down list type markers (used in lieu of the user typing anything at random). If a given category has additional properties or security requirements, then it makes sense in those cases to break them out into their own table.

  • I suppose I do not see the world of solutions as black-and-white. I make a concerted effort to structure my database using normalization however if an EAV solution (of which MUCK tables are a type) makes sense in a narrow aspect of the system, then I'll use it (albeit with trepidation). For example, I worked on a system five or six years ago for a company that made software that manages film distribution contracts. The database was a standard RM model and well normalized. A common occurrence was that customers would ask to store some value or set of values merely for tracking purposes. They wanted to see the value on a report and have it associated with a contract but that was the extent of it. Adding a column to a table or new table for every customer, every time they wanted to store a new widget would have been a maintenance headache. Furthermore, since this was common, we wanted our support people to be able to add storage for these types of things in the field without having to change the structure of the system. So, I created a small EAV aspect to the system called "other data" that the user could store whatever they wanted. It was structured such that they had to formally add the "type" of thing and it's data type (albeit string was one of the types) to a maintenance grid but other than that it was wide open. When a given customer wanted to do something with a given value like a calculation or see it in a special way on a report or anything beyond just a grid of values, we would then formally evaluate its specification and incorporate it into the design as a column or table. 95% of the data that was stored in this "other data" section never left there. Only 5% or less of those values were "upgraded" into the formal RM database design model. If I had looked at the world as black-and-white (either all RM or all EAV) it would have never worked.

    I suppose my recommendation to developers would be to prefer normalization unless you have a very good reason to deviate from it. "MUCK" tables can sometimes (note: not always, not even regularly...just sometimes) be included in those good reasons for deviating from the standard RM model.

  • Thomas--my point exactly.

  • Your example is almost exactly the same as several that I have seen, and in cases with those very narrowly defined criteria, I have used MUCK tables.  They work for that purpose, but outside of those bounds, they quickly fall apart.  Like I said in an earlier post this kind of data is what I refer to as "shoe size" data.  It's data that you acquire and want to keep, but don't use to make any real decisions on, it's strictly incidental information.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • I see both sides of the argument, except where people claim a MUCK table will make their GUI easier...  I tend to dislike direct mapping of controls such as Grids to the database tables - makes things difficult to change, locks the app into a 2-tier at best solution and doesn't allow the GUI screens to be customised for editing particular entities....  And in any case, I doubt you would have every type of entity listed on the screen at once, so why not point your "one screen fits all" model at different tables as required?  I fall into the anti-MUCK camp except when, as just mentioned previously, there is a narrow requirement for "user defined fields" such as additional customer values.  I see no reason why the DB should change to fit the requirements of a single customer in the DB.

  • After having performed some data extraction work yesterday on a Paradox DB that made use of MUCK tables, I definitely am not in favour of them - at least not to the extent that this programmer used them.  It was a healthcare application from 7-8 years ago - data such as a patient's medicare number (that is like an SSN but for australian health authorities), which is a definite attribute of the patient that you would always capture and even search on is stored as some kind name + value - I had to join 4 tables to get the info!  Yuk   When it's a definite attribute of data that all users of the system should capture, I don't see why a MUCK table is appropriate.  For customer-defined fields - MUCK is probably the way to go.

  • Yeah, Ian, that vendor's database is a perfect example of how NOT to use MUCK tables.

    Some people would agree with you that "For customer-defined fields - MUCK is probably the way to go." But I ask, If you need users to be able to create new fields, then why not create new tables for them intead?

    My point being that you could provide the same functionality to the user if you used a MUCK table OR if you used dynamically generated table for each PK/FK reference that would have been stored in the MUCK. But with dynamically generated tables you could actually enforce referential integrity. You could also supply the user with information about their data model as well as their data; which is important.

    There are many good reasons for doing it this way; the only reason not to do it is that it's more complicated and takes more development work.

    Just a thought.

    cl

    Hey, Post 500! Sweet!

    Signature is NULL

  • The simple answer to your question Calvin is that tables are far less amenible to change than data. In most cases, that lack of flexibility works in your favor but there are times when it does not.

    Let's suppose for a moment, that you needed to be able to allow 20 customers of your product to have an area where they can store data such that each customer wished to store over a dozen or two different categories of data. The **only** use of the data would be for tracking and to appear on reports in a simple grid. This area is really a categorized bucket of crap that only has meaning to each customer. If for some reason, a particular piece of data needed to be used in any way beyond just a simple dump in a report, it would be taken out of the bucket and integrated into the rest of the formal design.

    In order to do this with tables, you would have to ensure that you *never* refer to a column by name lest it change which it has to be assumed to do often. Getting all the columns would require either a Select * type query or a query against the Information_Schema queries as you would have to assume that you do not know all the columns at runtime. Making a change to the table would require a table lock instead of a simple row lock which could be problematic if the table is hit often or if the table structure is protected. (Of course, if there is that much data in the thing, the use of a EAV table should probably be re-evaluated.)

    No question that dynamic table assembly is another possible solution, but I have found that it was more costly in development time to use dynamic tables than the value of the data being stored. If the data was that important and needed structure, then it shouldn't be in a EAV bucket in the first place.

  • Thomas, I certainly can't disagree that it is much harder. Yes, there are problems when you let users modify schema; and there are definitely at least a couple of instances when name/value pairs can be stored in heaps (like the example you mentioned).

    If your application is throw away, or the data is throw away, or there's nothing you want to do except store name/value pairs, and if you don't care about data integrity, then the MUCK approach works just fine. I hate applications like that, though; they grow up to develop bad tempers.

    If you want an application that is data flexible and still maintains referential integrity then you have to crack the dynamic schema nut for both tables and columns. This is required if you're building enerprise level applications where the user needs to be able to extend the data structure. Creating a new column and creating a new table are both schema modifications, after all; just one is more complicated than the other.

    I like your comment on how, if user defined data found a specific use, it would be "integrated into the rest of the formal design." This feedback loop between the developer and the user is key to any good application; evolve or die! Eventually this loop with be automatically closed, which will bring on its own sets of challenges.

    This article got more posts than any since "No DBAs allowed access to Production DB Servers", so obviously a lot of different opinions out there. I've been exploring the wonderful world of dynamic schema metadata for a while; so I find this encouraging.

    cl

    Signature is NULL

  • I hate to re-open this issue, but want to make sure I understand the trade-offs exactly.

     

    I intend to use a unique number for all CodeIDs whether they are in multiple tables or a MUCK table. One “master” CodeID might be used for multiple descriptions.

     

    For example:

     

    CodeID              MasterCodeID                Description

    5                                  5                      abnormal chest x-ray

    6                                  5                      chest x-ray, abnormal

    7                                  7                      chest x-ray, calcified granuloma

    8                                  7                      granuloma, calcified, pulmonary

     

    The multiple listings are required so the user can find what they’re looking for any number of ways. The MastercodeID is the one committed to the database. I don’t care which of the list selections the user picked, only what the masterCodeID was for his selection.

     

    The other issue is reuse of the same code in multiple lists.

     

    Example: The following list may be used for both medications and allergies:

     

    Penicillin

    Bactrim

    Tylenol

    Aspirin

     

    But there are some drugs on only one of those lists. It would seem to me to be important that Penicillin is coded consistently. (already waded out of some messes when that wasn’t the case). To meet these requirements, it would seem that a table holding all drugs is most useful, and selected rows would be used in various pick-lists. This is essentially a MUCK table, but makes the most sense to me.

     

    Separate tables would mean redundant data and I’d have to be careful the coding was consistent. In the complexity of medicine, there are numerous instances of this type of thing. Of highest importance to me is that coding is consistent, whether in one or 15 tables. Appreciate any input here.

  • The definition of Bucket of Crap: a file; more formally, Unstructured Data.  The main point being that it has no place inside a "so called" Relational Database system.

    My recommendation, find a different solution/tool.

    It's not mine, but I love it nevertheless "to a hammer, everything looks like a nail."  Unfortunately, one size does not fit all, and you cannot use an RDBMS to store everything, especially if you don't know what it is.

    Sincerely,

    Anthony Thomas

     

Viewing 15 posts - 151 through 165 (of 202 total)

You must be logged in to reply to this topic. Login to reply