June 25, 2007 at 6:42 am
Our development team is entering a big project. Among others there is a requirement to be able to add entity and attributes easily without the need to make any maintenance changes to the database. It is supposed that the system administrator would just make changes to metadata via a system UI and the new attribute must appear immediately in all the UIs. The system has data definition tables storing entities and their attributes (it is supposed that entities would have no more than 50 attributes). Assume we have these tables for metadata:
1. Data Type definitions (same as SQL Server types in this sample)
CREATE TABLE DataType (
DataTypeId [int] NOT NULL, -- PK
[Name] [nvarchar](255) NOT NULL
)
Sample row values for DataType:
DataTypeId Name
1 int
2 varchar
3 datetime
4 bit
5 money
2. Entities (this one certainly will have more columns)
CREATE TABLE Entity (
EntityId [int] NOT NULL, -- PK
[Name] [nvarchar](255) NOT NULL
)
Sample row values for Entity:
EntityId Name
1 Bank
2 Customer
3. Attributes (virtually those are columns for entities)
CREATE TABLE Attribute (
AttributeId [int] NOT NULL, -- PK
EntityId [int] NOT NULL, -- FK to Entity
[Name] [nvarchar](255) NOT NULL
DataTypeId [int] NOT NULL, -- FK to DataType
Length [int] NULL
)
Sample row values of Attribute:
AttributeId EntityId Name DataTypeId Length
1 1 Name 2 100
2 1 Address 2 100
3 1 Status 1
4 2 FirstName 2 20
5 2 LastName 2 20
6 2 DateOfBirth 3
So far so good. But further we see the two possible approaches:
1. Flat table model. Create standard flat tables based on the metadata. That means that the system would create these tables automatically:
CREATE TABLE BankFlat (
BankId [int] NOT NULL, -- PK
Name [VARCHAR(100)] NULL,
Address [VARCHAR(100)] NULL,
Status [INT] NULL,
)
CREATE TABLE CustomerFlat (
CustomerId [int] NOT NULL, -- PK
FirstName [VARCHAR(20)] NULL,
LastName [VARCHAR(20)] NULL,
DateOfBirth [DATETIME] NULL,
)
So when system administrator adds another column to the Bank entity the system would execute DDL to the corresponding entity table.
Pros:
- It's a natural way to store entities and attributes as tables and columns
Contras:
- DDL to support the table changes may end up to be very complicated
- DDL on large tables will take too much time and block other users
- In a distributed environment with replication subscribers table structure changes should be avoided
2. Detached attribute storage model. Create tables with permanent structure as entity containers this way:
CREATE TABLE Bank (
BankId [int] NOT NULL -- PK
)
CREATE TABLE BankAttributeVarchar (
RowId [int] NOT NULL, -- PK
BankId [INT] NOT NULL,
AttributeId [INT] NOT NULL,
Value [VARCHAR(MAX)] NULL -- index on this column would be created
)
CREATE TABLE BankAttributeInt (
RowId [int] NOT NULL, -- PK
BankId [INT] NOT NULL,
AttributeId [INT] NOT NULL,
Value [INT] NULL -- index on this column would be created
)
So the idea is to create simple master table for entity and detail typed tables to store entity attributes. So the schema would be steady (just new tables would be created, no changes in existing tables). Inn the sample above for Bank the attributes Name and Address are stored in BankAttributeVarchar and Status is stored in BankAttributeInt.
It's clear that to get complete data rows we need views like:
CREATE VIEW BankView AS
SELECT
BankId,
FirstName = Attr1.Value,
LastName = Attr2.Value,
Status = Attr3.Value
FROM Bank
INNER JOIN BankAttributeVarchar Attr1 ON (Attr1.BankId = Bank.BankId AND Attr1.AttributeId = 1)
INNER JOIN BankAttributeVarchar Attr2 ON (Attr2.BankId = Bank.BankId AND Attr2.AttributeId = 2)
INNER JOIN BankAttributeInt Attr3 ON (Attr3.BankId = Bank.BankId AND Attr3.AttributeId = 3)
Pros:
- The schema remains unchanged and replication subscribers won't need schema replication
- No heavy DDL scripts when adding new attributes to entities
Contras:
- It's very likely that this unusual kind of data model may end up in performance issues on big volumes both for inserting and selecting data
We developed some tests emulating an entity with 50 attributes of different types for both models above (note that BankView has 50 joins in this case!) and generated about 100 000 rows. We created indexes on (BankId, AttributeId) in all BankAttributeXxx tables to improve JOIN performance. We tested these two simple queries:
1)
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SELECT * FROM BankFlat
WHERE BankId = 12781
2)
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SELECT * FROM BankView
WHERE BankId = 12781
The results are:
1) Usually takes <1 second to return the row
2) Usually takes 9-16 seconds to return the row
That means that on this simple selection of a single record by a PK value the second model performs more than 10x times worse. Of course if we don't clear cache and buffers it's much better on consequent runs. So these are the questions we have to consider:
1) Did we try all means of optimization for detached attribute storage model? Can it be optimized with another set of indexes or hints?
2) Is there any chance that the detached attribute storage is more efficient for value-based searches?
3) What will likely to happen on millions of records per table?
There are some political issues about taking one of the approaches. I think that the detached attribute storage model will be a huge failure on performance but there are some opinions that it might be compensated by more powerful hardware. I don't believe 10x times (I expect exponential performance fall on millions of records) can be compensated by hardware. The flat model has only the two weak points I specified in its contras (slow DDL blocking users and schema replication). I read that SQL Server 2005 has support for schema replication but is it quite reliable? I read some posts that people had trouble using it.
I would appreciate any expert advice. Thank you.
June 25, 2007 at 8:12 am
If this is for a bank and I had an account with it I'd probably want to move to another bank!! Hope this isn't a UK bank - although I'm not sure we have many left now.
[font="Comic Sans MS"]The GrumpyOldDBA[/font]
www.grumpyolddba.co.uk
http://sqlblogcasts.com/blogs/grumpyolddba/
June 25, 2007 at 8:33 am
It's not for a bank, it's for a retail company.
BTW, after we moved the testing environment to a more powerful server the time for the detached attribute model queries reduced to 3-4 seconds instead of 9-16 seconds. Is there any way to further improve performance? We thought about indexed views but failed to create them due to "self-joins" in the view. And there is no way to avoid self-joins here...
June 25, 2007 at 9:26 am
I must be missing something but it seems that having to manually adjust the DDL of the views (which would be needed EVERY time an attribute was added) defeats the original purpose of not having to alter the database to add attributes.
If someone deletes an attribute then you are going to break a lot of views, and render most applications (any that rely on it anyway) useless.
Just my 2 cents but sticking with the tried and true database design methods would be easier in the long run. I think you will run into more headaches then I can even think of at this time. Just writing the data insert/update statements would be a nightmare.
Does your database design fluctuate that much and that often that something this radical is needed? I would stick with Stored Procedures to sheild any application from underlying changes but make the changes as needed.
Sorry I really wasn't more helpful.
James.
June 25, 2007 at 9:52 am
You might want to also consider a "hybrid" design. Build your tables with all the attributes (elements) that you know and understand now. Then provide an extra column of datatype XML and store additional new attributes there. You could keep a table (or list in the application) of what elements are in the XML column, and format you queries to extract that data as columns (if needed). This way, you can develop you "Version 1.0" application as a standard traditional client/server with normal columns, and have the flexibility for user community to add attributes later.
If you adopt your 2nd approach, where all elements are to be defined, you are handicapping probably 90% of your work and queries to try to handle a small percentage of future needs. Let's break-up the data elements of your system will have into two categories: the basic data is what you know you need now and can easily determine what the data will be; the enhanced data is what you don't know, or will not be necessary to basic functionality of your system (your variable elements). By putting the basic elements in a your flexible 2nd option, you will be handicapping your basic system performance. By defining the basic data as you would normally in tables with distinct columns and datatypes, your basic system performance will be able to work well. And you can have the flexibility of adding and querying the new enhanced elements to the XML column.
Hope this helps
Mark
June 25, 2007 at 9:55 am
Thank you. It is supposed the view DDL code will be generated automatically from the meta data too. Deleting of columns would be disallowed since they may be used by applications of prior versions for some time. So instead of deleting the columns will be marked stale.
It's likely that the master database may have 10+ replication subscribers and that's difficult to deploy new schema simultaneously for all subscribers. So if new schema is deployed at the master database and is not deployed at subscribers then subscribers will lose the new column values during replication. Some tables with new attributes may have millions of records so full table synchronization at deployment would be a very heavy operation.
June 25, 2007 at 10:48 am
Thank you. This sounds like a very good approach. Let's think.
Pros:
- Normal way of table design
- Generally better performance
Contras:
- We need a XML column in every table (may be considered as an overhead, if we use TEXT column it's 16 bytes)
- Searches on XML attributes may be less efficient (I know there is indexing for XML columns though)
But if at some point during the development phase we find XML columns inefficient we can switch to the typed detail tables to store attributes.
June 26, 2007 at 7:48 am
1) I am at a loss to make a comment on the overall architecture here!! Good luck with it . . . 🙂
2) As for XML being overhead in a table and TEXT only being 16 bytes, that 16 bytes is just a pointer to another location on disk where the actual text is stored. Per BOL: Large object data types: text, ntext, image, varchar(max), nvarchar(max), varbinary(max), and xml. This at least implies that xml is stored similary to text. I can't find my SQL 2005 internals book to verify this however.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
June 26, 2007 at 7:58 am
1) I noticed that there are some political issues in this respect. And we have to prove absolutely that it will have poor performance. So far we might not consider all the means to improve performance 🙂
2) Sorry, I meant that XML column is stored the same way as TEXT column and that takes 16 bytes for empty values. Not a big deal of course.
June 26, 2007 at 8:12 am
Personally I think the worst thing from a performance standpoint is if you routinely add columns to existing table it will fragment the hell out of your storage of said data. In my experience as a consultant I have found that I/O is responsible for 80+% of performance problems. Anything you can do to eliminate/reduce/speed I/O is good.
Another thing is how will you handle getting new data back to the client or using it in data processing? Will you use SELECT *'s all over the place (again - poor practice), or modify/create new sprocs/code everytime something new gets added/built.
I could go on . . .
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
June 26, 2007 at 8:20 am
So do you mean that the 2nd option of ours will have the fragmentation poorly affecting performance by design?
June 26, 2007 at 8:38 am
...I meant that XML column is stored the same way as TEXT column and that takes 16 bytes for empty values...
I do not think this is true anymore. I believe that in SQL Server 2005, large datatypes (text, ntext, image, nvarchar(max), varchar(max), varbinary(max), and xml) are stored in the row, until the row size itself is greater that 8060 bytes. When that happens, the SQL engine decides to move one or more columns to the Row Overflow page, and replace the column data with a 24-byte address to the Overflow (not 16-byte). But if a large datatype column is empty, it will not be moved to Overflow, and will not have the pointer to it. So I beleive that any large datatype with an empty value will not even have a pointer taking any space.
Mark
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply