May 26, 2010 at 7:56 am
Nice (lengthy) article with great examples!
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
May 26, 2010 at 8:13 am
Sometimes people make poor data type choices or implement redundant columns due to misguided notions about best practices. For example, I once was hired as a contractor to delevepe reports for a data warehousing project. The first thing that really jumped out at me was that every integer column in the fact tables was defined as BigInt.
[sex] bigint not null,
[marital_status] bigint not null,
[center_id] bigint not null,
...
about 30 other bigint indicators or foreign key columns
...
Rather than 8 byte Big integers, for most of these columns, a single byte TinyInt or 2 byte SmallInt would have been more than enough. When I asked the DBA / data modeler why he had chosen BigInt for all these columns, his reply was that they were using the 64bit edition of SQL Server Enterprise and therefore 64 bit integers would perform better. He could offer no performance test results or even links to a single white paper to back that assertion. However, he did have bragging rights for owning the largest data warehouse in the agency's history.
Another example is creating a clustered index on something like LastName, FirstName, PhoneNumber. This not only leads to page splits, fragmentation, and bloated indexes, but also excessive I/O and blocking.
Other things I've seen are GUIDs used as unqiue surrogate key when a 4 byte integer would have been more appropriate or even a GUID column, which seems to serve no purpose, in addition to an integer surrogate key.
Perhaps my biggest data modeling pet peve is when date/time values are contained in a VarChar column. Not only does this consume 20+ bytes, compared to only 4 bytes for SmallDateTime, but often times the values are sporatically miscoded with invalid dates or multiple applications code the dates in multiple formats, resulting in data type mismatch errors and convoluted case expressions in the SQL statement to decode the data. The reason typically provided for this is that different applications need to display the date in different formats. Of course, this can be better achieved by storing dates as Date in the database using a formatting mask on the application form controls.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
May 26, 2010 at 8:23 am
sometime last year in one of the linkedin groups i'm there was a huge religious battle between GUID PK supporters and the rest of us. after a while people got tired of it because the GUID side liked GUID's just because
MS is guilty of some of this stuff as well. this year we set up perfmon to run and collect data from numerous servers and log to a database. the schema is horrendous and i had to do a lot of casting and converting to be able to read the data
May 26, 2010 at 9:01 am
David.Poole (5/26/2010)
What am I doing wrong Paul? I don't seem to get any space back (in SQL2005) other than that I would get by defragging the index.
What are you using to measure the space you get back, David?
Do you see any space being reclaimed? Does the large table shrink to the size of the small table?
If you can give me enough detail so I can replicate what you are seeing, I'll happily investigate! 🙂
When you say 'defragging the index', do you mean you are running ALTER INDEX REORGANIZE / DBCC INDEXDEFRAG?
The article only mentioned a full rebuild of the index, so I concentrated on that (the 'defrag' routines work a bit differently).
Anyhow, if you have a few minutes to confirm the details, that'll help a lot.
Paul
May 26, 2010 at 9:18 am
See if executing a CHECKPOINT updates your allocation stats.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
May 26, 2010 at 9:20 am
Excellent series of articles. sums up a LOT of what I have seen and been yelling at people for years over...
May 26, 2010 at 9:25 am
When you use an ORM tool it effectively hides the schema so you can get away with calling the fields almost anything.
When you are reading down a large block of SQL then it is easy to get the wrong ID, name, description etc, particularly in the FROM clause.
The fundamental point is that the name of something should be clear in its intent and meaning. Aliasing deals with it to some extent but underneath it all it becomes confusing as to whether ID means CustomerID, ProductID, OrderID, SalesmanID, RegionID....etc
Mind you, even ID is preferable to the DB I've inherrited that has a primary key called "NoUseWhatSoEver"
May 26, 2010 at 9:45 am
We can debate about [customer_id], versus [CustomerNumber], versus [Cust_Num], but at a minimum the foreign key names should match the primary key names, and to do that, the primary key names must be something unqiue and preferably descriptive, not just [ID]. To make things simple, I just name the identifier column after the table name + _id. So the identifier for the [customer] table would be [customer_id], and for the [customer_contact] table, it would be [customer_contact_id].
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
May 26, 2010 at 10:33 am
I use the same naming convention as Russell for ID fields, and for much the same reasons it sounds like. From a maintenance standpoint, it's much easier for the average human to understand that customer_id in the customer table and customer_id in the order table are refering the same value, and a related to each other. Especially if you're looking at the table, and not a query!
My personal pet peeve was when a contractor converted one of our databases, and renamed every single primary key field in every table from my carefully named version to "id". So none of my older queries worked - I had to modify every single one of them. And the queries THEY wrote were "less than understandable", aliases or not.
I'm a fan of naming conventions. Loved all the examples of time bombs; I've experience a few of them, although most of our databases aren't large enough to get some of the symptoms (no tera-bytes, here).
Looking forward to David's next article!
Steph Brown
May 26, 2010 at 11:23 am
Actually, the author's guess that inflation will not exceed 3 digits is rather similar to what the 20th century developers did by using 2 digits for year, guessing that their apps will not live to y2k. In the inflation scenario, what is ignored is the possibility of hyperinflation, a small chance, but a chance nonetheless. We wouldn't be the first, and history of just the past 100 years shows several precedents. Just an example of guessing about maximum data size in designing the data model. BTW - in the half-page/full-page example - just use varchar and be done with it, while at the same time taking up minimal storage space.
May 26, 2010 at 11:45 am
lastory2006 (5/26/2010)
BTW - in the half-page/full-page example - just use varchar and be done with it, while at the same time taking up minimal storage space.
The point of the char was to demonstrate what happens when people are sloppy with inappropriate data types i.e. 2 bytes can make a lot of difference.
Anyone who uses CHAR(4040) should be locked up until they can explain the plot of "Lost" to my mum.
May 26, 2010 at 11:56 am
Nice article. Thanks.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
May 26, 2010 at 12:13 pm
Agreed! :))
May 26, 2010 at 1:41 pm
Outstanding article. Thanks.
Lots of good practical ideas, well justified by example and logic.
I am going to file this one away in the Best Practices folder and refer back to it often.
May 26, 2010 at 2:03 pm
eric_russell 13013 (5/26/2010)
Perhaps my biggest data modeling pet peve is when date/time values are contained in a VarChar column. Not only does this consume 20+ bytes, compared to only 4 bytes for SmallDateTime, but often times the values are sporatically miscoded with invalid dates or multiple applications code the dates in multiple formats, resulting in data type mismatch errors and convoluted case expressions in the SQL statement to decode the data. The reason typically provided for this is that different applications need to display the date in different formats. Of course, this can be better achieved by storing dates as Date in the database using a formatting mask on the application form controls.
There is a valid reason to use varchar to store dates, though it doesn't come up often, especially in businesses in America. Before SQL Server 2008, the earliest date you could store in a date field was January 1, 1753. If you're creating a database that includes dates older than that, your only recourse was a varchar field. I've created databases for historical events and fictional events that couldn't use a date field for this very reason. And there are businesses in other parts of the world that have accounts that date back before 1753, so it could theoretically come up then too.
With 2008, you can enter any date in AD through December 31, 9999. Even most SF won't give you a problem there. I suppose for BC you can add a flag to avoid needing to store it in Varchar.
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
Viewing 15 posts - 16 through 30 (of 54 total)
You must be logged in to reply to this topic. Login to reply