October 5, 2012 at 2:24 pm
I'm in the process of designing some new tables. The decision has been made to always create a new row when the data for that table is different than any other row that already exists. Currently the table is about 8 columns wide and contains personal identification information.
I'm concerned about building the indexes for this type of table. I'll have a search query with 8 different where clauses.
For example
Where firstname = @firstName, middlename = @middlename, lastname=@lastname, phonenumber=@phonenumber, zipcode = @zipcode...ect
First I'd like your thoughts on how to get the best performance in this type of situation. My first thought is to generate a md5 type of hash based on all of the columns, and then search on that.
October 5, 2012 at 2:48 pm
I'm curious why a search query would have so many criteria. Usually a search query will be by a primary key or other unique identifier, or by 1-3 columns - preferabley idexed (firstname, lastname, city), or with a join to another table.
October 5, 2012 at 2:52 pm
Customer requirement is to maintain each and every combination of data exactly as entered. I'm sure there are different approaches to handle this type of requirement, However we've been directed to maintain each unique combination in the manner described above. As a result we need an efficient method of determining whether the newly supplied data matches an existing row, or needs to be created.
October 5, 2012 at 3:12 pm
The table you describe violates normalization rules and will be inefficient just by the nature of the design. It is my opinion that business/user requirments can state WHAT needs to be stored but should never dictate HOW something is stored in the database. That falls on the professional data modeler to correctly implement in a way that provides the business/user with the data they require.
October 5, 2012 at 3:18 pm
the key to indexing and performance is going to be how they are searched in your WHERE clause.
if all 8 values are absolutely required for every search, an index with all eight columns would be very selective.
if you know that most of the time they search on just lastname or lastname + firstname, then two different indexes, to cover both of those queries would be good.
you'll probably want the GUI to build your query with WHERE statements only for the non-empty values;
you want to avoid trying to create a catch all query;
check out the pitfalls in this article:
http://sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/
more details and we can offer more than general advice.
Lowell
October 6, 2012 at 1:38 am
You could create a unique index or constraint that has all eight columns to guarantee that you cannot have duplicates.
There can be a lot of overhead to such a wide index, so that could be a disadvantage.
In general, code like this should work well with smaller indexes that only have the more selective non nullable columns. If the columns are nullable, you will have to modify the query to allow for that.
insert into MyTable
(
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8
select
@Col1,@Col2,@Col3,@Col4,@Col5,@Col6,@Col7,@Col8
where
not exists (
select *
from
MyTable
where
Col1 = @Col1and
Col2 = @Col2and
Col3 = @Col3and
Col4 = @Col4and
Col5 = @Col5and
Col6 = @Col6and
Col7 = @Col7and
Col8 = @Col8 )
October 6, 2012 at 10:04 am
thisted (10/5/2012)
I'm in the process of designing some new tables. The decision has been made to always create a new row when the data for that table is different than any other row that already exists. Currently the table is about 8 columns wide and contains personal identification information.I'm concerned about building the indexes for this type of table. I'll have a search query with 8 different where clauses.
For example
Where firstname = @firstName, middlename = @middlename, lastname=@lastname, phonenumber=@phonenumber, zipcode = @zipcode...ect
First I'd like your thoughts on how to get the best performance in this type of situation. My first thought is to generate a md5 type of hash based on all of the columns, and then search on that.
This type of table is affectionately known as a "Slowly Changing Dimension" or "SCD" for short. It's a time honored method for auditing all actions taken against the table in the same table instead of having a separate audit or history table (which has it's own set of problems, as well).
One of the most effective "SCD" types of tables is the "Type 2 Slowly Changing Dimension". Please see the following URL for the different types of "SCDs" available.
http://en.wikipedia.org/wiki/Slowly_changing_dimension
One of the purposes of Type 2 SCDs is so that you can actually determine what the status of all of the rows in a table where/are for any given point in time and it's very simple and effective in doing so. Don't let talk of the table being "denormalized" or "ineffecient" deter you because the denormalization is necessary and it's far from being ineffecient if done properly.
To make a longer story much shorter, Type 2 SCDs are generally maintained by a well written, high performance trigger that end-dates the old row and creates a new row with a new start date and an "open" end-date of either a NULL or a very large date (I use 9999-12-30 which leaves one day of headroom for some critical criteria calculations). On my systems, I usually also add a quick check in the trigger to prevent any direct manipulation of the start and end date columns to make it more tamper proof and to ensure that the dates never overlap from row to row for any given unique data.
One of the keys to success and performance is how you index for the PK. Whatever you currently use for the PK needs to be modified so that the start-date column is the first column of the PK (and clustered index, BTW) and that what you would normally have for PK column(s) comes after that.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply