June 9, 2009 at 6:48 pm
I have a puzzling experience I would dearly like to understand.
Theoretically, storage of CHAR(35) instead of VARCHAR(35) when the data field has only 6 bytes should mean a wastage of 29 bytes per record (min. 27 bytes wastage if 2 bytes in varchar are used for the length).
My initial experiment was to refactor an existing table containing 8,766,652 rows where a 6-byte identifier (numeric at that!) was stored as CHAR(35). Refactored as VARCHAR(35) in place expecting to save 225MB on the size of the table.
The size of the table dataspace grew - from 951.250Mb to 1,347.594 Mb - an increase of 396.344 Mb.
The index space grew from 0.023Mb to 0.031Mb (table is a heap, no key and no index). Further refactoring a tinyint into an INT increased the dataspace to 1,367.836Mb - an increase of 20.240 Mb.
Running a TRUNCATE TABLE and repopulating from ETL (new dataset) then REINDEXing the database improved the situation slightly:
8,767,909 rows - dataspace 1,133.156 Mb, index space 0.023Mb.
Next, I repopulated the 2 versions of the table (both were TRUNCATED then repopulated) - table A is the unfactored one, char(35) & tinyint, table B is refactored varchar(35), int.
Rows in both: 8,768,530; both are heaps - no index no key
Table A : (TrimTrailingBlanks=NO; Nullable=NO)
Dataspace: 951.453Mb Index space: 0.039Mb
sp_spaceused :
reserved data index_size unused
975,336 KB 974,288 KB 40 KB 1008 KB
Table B : (TrimTrailingBlanks=NO; Nullable=NO)
Dataspace: 1,083.359 Mb Index space: 0.016Mb
reserved data index_size unused
1,109,520 KB 1,109,360 KB 16 KB 144 KB
Database: Collation SQL_Latin1_General_CP1_CI_AS
ANSI Padding - disabled
Can anyone explain how is it possible for the VARCHAR(35) version to use approx 100M more space than the CHAR(35), instead of using less, when the data field is only 6 bytes?
(Calculating the approx extra 100M as 132Mb - 32Mb on INT vs TINYINT as the other refactored field that is different between the 2 versions)
Unless there is something that escapes me utterly, this appears to contradict common sense/wisdom and Technet...
http://technet.microsoft.com/en-us/library/bb508963(SQL.90).aspx
TableB is populated with RTRIM(sourceFld) and there are no trailing blanks (also did a RTRIM(varcharFld) for good measure).
Comparing the 2 tables after a week of truncate then repopulate:
Table A : (TrimTrailingBlanks=NO; Nullable=NO)
Rows: 8,779,273
Dataspace: 952.617MB Index space: 0.031Mb
sp_spaceused :
reserved data index_sizeunused
975968 KB 975480 KB 32 KB456 KB
Table B : (TrimTrailingBlanks=NO; Nullable=NO)
Rows: 8,779,324
Dataspace: 1,084.703 Mb Index space: 0.023Mb
reserved data index_sizeunused
1111320 KB 1110736 KB 24 KB560 KB
June 16, 2009 at 8:17 am
Hi Ol'SureHand,
This is an intriguing problem indeed. And I can't say from just this description what the cause can be. So I ask you to follow some trouble-shooting steps.
First, in SSMS, choose the Tools / Option menu item, choose the Scripting options, and verify that all of "Script CHECK constraints", "Script defaults", "Script foreign keys", "Script indexes", "Script primary keys", and "script unique keys" are set to true. Then find both versions of the table in the object explorer, right-click them and choose "Script table as ..." / "CREATE to ..." / "Clipboard", and paste the results in a reply to this message. In many cases I advise people to remove irrelevant columns, but in this case it is imperative that you change NOTHING (except table and column names, if that is required for confidentiality reasons).
Next, post a copy of the code you run every night to (re)populate the table. Again, leave out nothing.
Lastly, just to doublecheck that all values are indeed 6 characters long (not because I mistrust you, but because to err is human), run this query, with the relevant table and column names for both versions of the table:
SELECT MIN(LEN(YourColumn)),MAX(LEN(YourColumn)),
MIN(DATALENGTH(YourColumn)),MAX(DATALENGTH(YourColumn))
FROM YourTable;
I can't guarantee that the requested information will suffice to explain your issue, but it will at least give me a bit more information to work on.
June 16, 2009 at 10:46 pm
Very happy to help solve the puzzle.
Of course an in-depth analysis as you have suggested lead to finding interesting things .. such as the existence of 330,000 records with more than the 6-byte fields I mentioned.... and my forgetting to RTRIM about 300,000 fields during the import to TableB (although its VARCHAR def should have taken care of that...).
All things being counted, I still don't see why the dataspace size is still larger where the field is VARCHAR(35) by roughly 100Mb... instead of seeing savings of roughly 245 Mb!!
I've included all the data in the attached text file to avoid clutter.
Suffice to say the 2 tables are identical except for 2 fields one being VARCHAR instead of CHAR. Each sits in its own database on 2 separate SQLServers2005 on 2 separate servers - but almost identical in most features.
Each is repopulated nightly using DTS running in legacy mode from the same source tables on an iSeries.
Let's see if we can throw more light on this !
Thanks a lot for volunteering!
All the best
June 17, 2009 at 12:14 pm
Hi,
What ANSI_PADDING you have when you script the tables? It is interesting why MaxLen is different than MaxDataLen in your file. LEN according to BOL returns length of data without any trailing spaces. If you have ANSI_PADDING ON the trailing spaces will not be trimmed - and your second select doesn't trim the field.
Regards
Piotr
...and your only reply is slàinte mhath
June 17, 2009 at 3:01 pm
If the table didn't had varchar columns before the refactoring, when sql server would have to include some more bytes for each row. Each row would be expanded to include a number of variable lenght fields in table (2 bytes) and a variable column offset array (2 bytes * VarCount).
This offen take people by surprise. and it could count for some of the xtra bytes.
Kalen Delaney also writes the following about heap tables in her latest book
"Deleted rows are not physically removed but mark for deletion. In addition to space on pages not being reclaimed, empty pages in heaps frequently cannot be reclaimed. Even if you delete all the rows from a heap. Sql server does not mark the pages as unallocated, so the space is not available for other objects to use. the dynamic management view sys.dm_db_partition_stats still shows the space as belonging to the heap table."
June 17, 2009 at 4:20 pm
(censored)! I hate it when that happens. I spent a long time crafting a lengthy reply, then I get an error message when posting and I can't get my text back. AAAARRRGGHHH!!!!
Ol'SureHand (6/16/2009)
Very happy to help solve the puzzle.Of course an in-depth analysis as you have suggested lead to finding interesting things .. such as the existence of 330,000 records with more than the 6-byte fields I mentioned.... and my forgetting to RTRIM about 300,000 fields during the import to TableB (although its VARCHAR def should have taken care of that...).
Actually, varchar does not take care of that. Varchar preserves trailing spaces. But that is not enough to cater for the different storage size, and neither is the single extra row in TableB.
All things being counted, I still don't see why the dataspace size is still larger where the field is VARCHAR(35) by roughly 100Mb... instead of seeing savings of roughly 245 Mb!!
I've included all the data in the attached text file to avoid clutter.
I had hoped that scri[ting the tables would reveal some forgotten index with a different fill factor on the two tabes. Alas, no such luck!
Anyway, I used the calculations described in Books Online (chapter "Estimating the Size of a Heap") with the tables you posted to find that TableA stores 72 rows per page and TableB (based on an average length of 7 bytes for the varchar column) 89 rows per page. That results in a space estimate of 1,036,408 kB for TableA (as reported by sp_spaceused) and 838,440 kB for TableB (which is far less than sp_spaceused reports).
However, the fact that these tables are heaps and not clustered might shed some light on this, due to the way emmpty space is reused (or rather, NOT reused) in heaps. There are two issues that might be involved.
1) If a row with varying length data is updated, the new data takes more space than the old data, and the extra space required is not available on the current page, the row is moved to a new page. But the old location is not freed up for reuse; instead, a "forwarding pointer" is left, pointing to the new location. Effectively, this means that the row is now taking up twice as much space as you'd expect!
I must say that I know nothing about DTS. I think it's a bit far-fetched to assume that DTS loads data by first loading the rows without the data and then updating it with the actual data - but if DTS is indeed so backwards, than it would offer a perfect explanation of what happens here, as the varchar column would start out at length zero and then be updated to a longer length.
2) If a row is removed, the space it took is available for reuse. Likewise if an already forwarded row has to be moved again, for yet more data growth. However, this empty space is not taken into consideration when new rows have to be stored - only when rows on the same page need extra room due to varchar data growing in length. In practice, this means that space taken by rows that have been deleted will hardly ever be reused. So that will cause heaps to take much more room than expected.
I think that TRUNCATE TABLE TheTable should really free up space previously allotted. I also think that DELETE FROM TheTable will usually not do this. And deleting part but not all of the data will probably not free up any space. So depending on exactly how DTS replaces the data, this issue might be at play - though I am at a loss why it would affect only TableB but not TableA.
Let's see if we can throw more light on this !
Thanks a lot for volunteering!
All the best
As you see, I have some vague ideas but no real killer explanation yet. I'm sorry.
Here are some things you can try. First, run the query below (substituting the actual database and table name) for both tables and post the results:
SELECT *
FROM sys.dm_db_index_physical_stats (DB_ID('MyDB'), OBJECT_ID('MyTable'), NULL, NULL, 'detailed')
Second, drop and recreate (just in case truncating does not suffice) the tables before the next DTS data load and check the data size afterwards.
Third, add a clustered index to the tables. This will not help you find an explanation, but it will almost certainly solve the issue. Remember that there are only very few good reasons to use a heap instead of a clustered table, and unless you know exactly what you are doing, you should cluster all your tables.
Please let me know if any of thhis rambling helps you find the solution!
(And now I'lll first copy the message to the clipboard before trying to post it!)
June 17, 2009 at 7:05 pm
Hugo Kornelis (6/17/2009)
..Books Online (chapter "Estimating the Size of a Heap") :TableA stores 72 rows per page and TableB (based on an average length of 7 bytes for the varchar column) 89 rows per page. That results in a space estimate of 1,036,408 kB for TableA (as reported by sp_spaceused) and 838,440 kB for TableB (which is far less than sp_spaceused reports).
I think that TRUNCATE TABLE TheTable should really free up space previously allotted. I also think that DELETE FROM TheTable will usually not do this. And deleting part but not all of the data will probably not free up any space. So depending on exactly how DTS replaces the data, this issue might be at play - though I am at a loss why it would affect only TableB but not TableA.
SELECT *
FROM sys.dm_db_index_physical_stats (DB_ID('MyDB'), OBJECT_ID('MyTable'), NULL, NULL, 'detailed')
Curiouser and curiouser! This revealed even more fascinating things. Keep in mind TableA has the char(35) column and has been TRUNCATED then repopulated every night for many years ... with an ever-increasing number of records. TableB is the newly refactored varchar(35) running on a newer database restored from a backup of the old one. Both are repopulated by TRUNCATE then DTS doing a full SELECT on the mainframe source table.
In both, ANSI_PADDING is not enabled (disabled).
-- next day DMV stats ... more records
TableA (char)
rowsreserved data index_sizeunused
9329712 1037016 KB 1036640 KB24 KB352 KB
TableB(varchar)
rows reserved data index_sizeunused
9329712 1166360 KB 1166216 KB24 KB120 KB
--------
index_type_desc: HEAP(both)
alloc_unit_type_descIN_ROW_DATA
index_depth1
index_level0
avg_fragmentation_in_percent TableA: 4.87970388648982
TableB: 0.274258131753606
fragment_countTableA: 809TableB: 65
avg_fragment_size_in_pagesTableA: 160.17305315204
TableB: 2242.72307692308
page_countTableA: 129580TableB: 145777
avg_page_space_used_in_percentTableA:98.7145910551025TableB:98.8395107487027
record_count TableA: 9329712TableB:9329712
ghost_record_count0
version_ghost_record_count0
min_record_size_in_bytesTableA:109TableB:113
max_record_size_in_bytesTableA:109TableB:131
avg_record_size_in_bytesTableA:109TableB:123.032
forwarded_record_count 0
there are only very few good reasons to use a heap instead of a clustered table, and unless you know exactly what you are doing, you should cluster all your tables.
Agree! Just looking for some idea as to which columns to index/cluster... takes more than 4 months to get info from the mainframe people as to what exactly those column names mean...
I'll try it on the dev server anyway and keep this posted!
June 18, 2009 at 2:22 am
AAARRGGHHH!! I did it again! Clicked "Post" after crafting a length reply without first copying it to a safe place, and got a site error. I *so* hate this interface!!
Here is a shorter version of the reply I just typed.
Ol'SureHand (6/17/2009)In both, ANSI_PADDING is not enabled (disabled).
I missed that before; please disregard my remark about preserving trailing spaces, as that is ANSI_PADDING ON behaviour.
Note though that this option is deprecated!
the DMV figures you posted are quite interesting:
avg_page_space_used_in_percentTableA:98.7145910551025TableB:98.8395107487027
This shows that there's not much empty space on the pages. In other words, the problem is not caused by deleted rows leaving emptyy space that is not reused. That's one of my two theories out of the window!
forwarded_record_count 0
And this shows that there are no forwarding pointers (caused by growth of varying length data) in the tables. So that's my second theory disproved.
But I also saw this:
min_record_size_in_bytesTableA:109TableB:113
For TableA, this is exactly as I calculated. For TableB, this isn't; I expected a minimum row size of 77 bytes. That's a difference of 36 bytes.
And that finally caused me to see what happened. When you used ALTER TABLE to refactor TableB, the char column became varchar, which is stored in a different portion of the row (the varying length part) - so the space previously used for this column is now unused. Similar for the tinyint to int change - the new column takes more space, so it is added after the existing columns and the single byte previously left for it is kept unused. That explains why the rows take 36 bytes more than expected.
To confirm, I tested this by creating a table with TableA's layout, loading some rows, then using ALTER TABLE to change the columns. I then used DBCC PAGE to examine the pages allocated to the table. It turns out that:
*) In pages already allocated to the table, the bytes originally assigned to the changed column are unchanged (still containing the data) but known to be unused (("dropped") from the metadata.
*) In pages allocated after the ALTER TABLE, the rows are laid out the same, so these 36 bytes are still marked as "dropped", though they are not initialised with any proper data.
*) Truncating the table does not affect the metadata, so the layout of the row storage remains unchanged and the 36 bytes are still wasted.
All this applies to clustered tables as well as heaps. But clustered tables have one advantage - they are affected by ALTER INDEX REBUILD (and the deprecated DBCC DBREINDEX), which does cause the metadata to change (and hence get rid of the 36 unussed bytes) and all pages to be rebuild. So after rebuilding the index, the 36 unused bytes are finally reclaimed and the table finally shrinks. Not so for a heap, of course: since a heap has no index, rebuilding indexes will not affect it. As far as I know, there is no command that is the counterpart of ALTER INDEX REBUILD for a heap.
However, that does not mean that all is lost. You still have two ways to reclaim the unused space:
1) Dropping and recreating the table. It's the easiest option, but can of course not be used if the table is required 24x7.
2) Adding a clustered index, as this will effectively rebuild the table. If you really don't want it, you can drop it afterwards. Note that if you create a UNIQUE clustered index, you'll reclaim all lost space. For a NONUNIQUE clustered index, SQL Server has to add 4 bytes per row to make them unique (the so-called uniquifier); these will remain in place after you drop the index. So this will not get rid of all lost space, but it will reduce the wasted space from 36 bytes per row to 4 bytes per row.
I hope this helps. Please let me know how it goes.
And sorry for being a bit short. I hope that next time I post, I will remember to save my reply before hitting the post button.
June 18, 2009 at 2:57 am
Thanks Hugo, I was thinking about something along these lines. Somehow I missed the fact that the table was altered, not recreated. I read once post of Tony Rogerson that tackles similar topic and it was lingering in my head.
Regards
Piotr
...and your only reply is slàinte mhath
June 18, 2009 at 5:37 am
Hugo Kornelis (6/18/2009)
min_record_size_in_bytesTableA:109TableB:113For TableA, this is exactly as I calculated. For TableB, this isn't; I expected a minimum row size of 77 bytes. That's a difference of 36 bytes.
And that finally caused me to see what happened. When you used ALTER TABLE to refactor TableB, the char column became varchar, which is stored in a different portion of the row (the varying length part) - so the space previously used for this column is now unused. Similar for the tinyint to int change - the new column takes more space, so it is added after the existing columns and the single byte previously left for it is kept unused. That explains why the rows take 36 bytes more than expected.
[..]
*) In pages allocated after the ALTER TABLE, the rows are laid out the same, so these 36 bytes are still marked as "dropped", though they are not initialised with any proper data.
*) Truncating the table does not affect the metadata, so the layout of the row storage remains unchanged and the 36 bytes are still wasted.
All this applies to clustered tables as well as heaps. But clustered tables have one advantage - they are affected by ALTER INDEX REBUILD (and the deprecated DBCC DBREINDEX), which does cause the metadata to change (and hence get rid of the 36 unussed bytes) and all pages to be rebuild. So after rebuilding the index, the 36 unused bytes are finally reclaimed and the table finally shrinks. Not so for a heap, of course: since a heap has no index, rebuilding indexes will not affect it. As far as I know, there is no command that is the counterpart of ALTER INDEX REBUILD for a heap.
-- recovering:
1) Dropping and recreating the table. It's the easiest option, but can of course not be used if the table is required 24x7.
2) Adding a clustered index, as this will effectively rebuild the table. If you really don't want it, you can drop it afterwards. Note that if you create a UNIQUE clustered index, you'll reclaim all lost space. For a NONUNIQUE clustered index, SQL Server has to add 4 bytes per row to make them unique (the so-called uniquifier); these will remain in place after you drop the index. So this will not get rid of all lost space, but it will reduce the wasted space from 36 bytes per row to 4 bytes per row.
Hugo, you're a genius!
This was easy to test, and we did get the 77 byte min. record size.
---wastage was caused by ALTER TABLE from CHAR to VARCHAR !
Dropping and recreating gives us the expected savings on TableB:
rowsreserveddataindex_sizeunused
9330603 829840 KB829664 KB16 KB160 KB
index_type_desc: HEAP
alloc_unit_type_descIN_ROW_DATA
index_depth1
index_level0
avg_fragmentation_in_percentTableB: 1.04062283203577
fragment_countTableB: 150
avg_fragment_size_in_pagesTableB: 691.386666666667
page_countTableB: 103708
avg_page_space_used_in_percentTableB:98.9406226834692
record_count TableB:9330603
ghost_record_count0
version_ghost_record_count0
min_record_size_in_bytesTableA:109TableB: 77
max_record_size_in_bytesTableA:109TableB: 95
avg_record_size_in_bytesTableA:109TableB:87.032
forwarded_record_count 0
---
Since this is still research, I am using a dev server so it was easy to drop, recreate the table, then repopulate with DTS.
The mystery is now explained, and someone should add a big red warning sign to BOL: if you're trying to save storage in a HEAP by refactoring a char column to varchar, make sure you RE-CREATE the table otherwise you'll be forever increasing your wasted space !!!!!!!!
I am humbled by your knowledge, and awed by your perseverance in researching this phenomenon.
SQLServerCentral ROCKS, too.
Now I do not feel alone any more!
This one should go into some Hall Of Fame and you should make it into a QOTD as well since yours is the glory of giving the exact explanation.
Many thanks to Piotrek who helped too!
June 18, 2009 at 6:38 am
Thanks for the kind words. It was a pleasure to help, and a fun challlenge to figure out! 🙂
I like the suggestion to make this into a QotD, and I'll do just that. (That'll be some easy points for everyone wh reads this topic!)
June 19, 2009 at 9:30 am
It's a shame that DBCC CLEANTABLE only operates on variable-length columns, but I guess it makes sense, given the row layout.
Books Online is very brief concerning all this. After a good ten minutes searching through the less-well-trodden parts, the only really relevant statement was found under ALTER TABLE! It says:
Dropping a column does not reclaim the disk space of the column. You may have to reclaim the disk space of a dropped column when the row size of a table is near, or has exceeded, its limit. Reclaim space by creating a clustered index on the table or rebuilding an existing clustered index by using ALTER INDEX.
...which isn't exactly comprehensive coverage 😉
Nice diagnosis Hugo.
Paul
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply