July 26, 2011 at 9:10 pm
Comments posted to this topic are about the item Time for a ROWID?
July 26, 2011 at 11:52 pm
In my database designs my default choice for all tables is for them to have a surrogate key, a TIMESTAMP togethe with Last Modified UserID. For the surrogate key I mostly use IDENTITY, but in one project, with offline datasets, I have used GUIDs.
This surrogate key is used as the Clustered Primary Key. I name those fields ID (identity), TS (time stamp) and US (user stamp) in all tables. The data access layer of my application uses this for some low-level plumbing in managing the data on behalf of the business layer. Besides those fields tables containing top-level business objects (customer, order, etc) will have a unique key (mostly a human-friendly code, either textual or numeric, with appropriate index and constraints).
I don't even think about primary/foreign key methods when adding new functionality to the business layer: the underlying tables will get the above mentioned fields. The users will never see these values and their unique keys could be updated without breaking any relations in the schema (I learned the hard way that the cost to build functionality (coding a cascading update over the foreign keys, having to maintain this code when foreign keys are added, performance if a very large set is updated) for the user being able to occasionally make changes to their customer codes etc, is far outweighed by the small overhead of having a fairly static separate index on this separate key-field.
I consider identifying and tracking (auditing) as part of the data-storage solution and any implementation of features I no longer have to worry about in my application developments are welcome to me (in the end I am an application architect/developer).
So, if the database-server provides these functions (ROWID provides just one of them, but rudimentary auditing using timestamps and last-modified-by-user-account as well) I would seriously considering updating my schema and data-access layer of the application to make use of them sooner rather than later.
Only very occasionally I create a table that deviates from this pattern, mostly when it doesn't contain business-data, but some other arbitrary data and only when those fields would actually hamper some other application-requirement (performance, space-requirments, etc). So it would be nice if these featured could be turned off when I wanted to, but would be there by default.
July 27, 2011 at 12:32 am
Little remark... All non-unique non-clustered indexes are created with a sysem internal unique id, 8 bytes in size, so that sql could differentiate each row. These id "columns" aren't visible or accessible.
If they would have to do anything they could just make this column available for use without modifying. Whatever the column's value, it is good enough for any task (it doesn't have to be "pretty" like a IDENTITY(1,1) column is).
July 27, 2011 at 1:25 am
I'm not sure increasing the row length in SQL would help much--after all, the largest data types are stored in off-row data pages anyway, and a properly normalised database surely shouldn't need so many of the smaller data types that you'd need to increase the row length?
July 27, 2011 at 1:56 am
Larger pages sound kind of inevitable at *some* point in the future, and I can't imagine they'd change the basic concepts we deal with.
However, ROWIDs would be a different kettle of fish. For one thing, introducing a user-visible "pointer" might destroy the distinction between heaps and clustered indexes. SQL Server RIDs are hidden for a reason: they're an implementation detail and exposing them would severely limit the amount of data reorganisation that can be done under the covers; for clustered indexes I don't believe they even get used much internally (CheckDB is probably an exception).
One could turn the argument sideways and propose that all tables, whether heaps or clustered indexes, should always get an automatically-generated bigint identity column, whether you want it or not. It would certainly force a more consistent design vocabularly among users if that were the case, but it would be sure to have many detractors. Ultimately I guess we all just want the freedom to shoot ourselves in the foot with our own choice of weapon instead of an officially mandated one!
July 27, 2011 at 1:58 am
paul.knibbs (7/27/2011)
I'm not sure increasing the row length in SQL would help much--after all, the largest data types are stored in off-row data pages anyway, and a properly normalised database surely shouldn't need so many of the smaller data types that you'd need to increase the row length?
I don't think Steve was thinking of allowing for larger records, but rather lowering overhead-costs associated with managing fewer pages in the page-pool if the I/O costs for loading a larger page is negligible on modern hardware.
Microsoft would have to do some bench-marking to see if the vast majority of applications running on typical hardware would not be negatively impacted by a larger page-size if top-end hardware may benefit. Maybe even allow the sysadmin to choose the optimum page-size for his hardware when installing the server (if Microsoft did a good job with their source-code, they may be able to compile the server-kernel for different page-sizes).
July 27, 2011 at 2:30 am
the.komplikator (7/27/2011)
Little remark... All non-unique non-clustered indexes are created with a sysem internal unique id, 8 bytes in size, so that sql could differentiate each row. These id "columns" aren't visible or accessible.
Errr, no...
Non-unique clustered indexes get a uniquifier, which is 4 bytes (an int) added to rows that have duplicate values for the clustering key.
Nonclustered indexes get either the clustering key (complete with uniquifier) if the base table has a clustered index, or the 8-byte RID (file_id, page_id, slot_index) if the base table is a heap.
If the nonclustered index is unique, that's present only in the leaf level, if the nonclustered index is not unique that is present in the leaf and non-leaf levels.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 27, 2011 at 3:13 am
GilaMonster (7/27/2011)
Errr, no...Non-unique clustered indexes get a uniquifier, which is 4 bytes (an int) added to rows that have duplicate values for the clustering key.
Nonclustered indexes get either the clustering key (complete with uniquifier) if the base table has a clustered index, or the 8-byte RID (file_id, page_id, slot_index) if the base table is a heap.
If the nonclustered index is unique, that's present only in the leaf level, if the nonclustered index is not unique that is present in the leaf and non-leaf levels.
Right. No more early-in-the-morning posts for me 🙂
July 27, 2011 at 6:37 am
Back on the secondary subject of larger pages, I guess the obvious argument against them would be buffer cache wastage. If you have hundreds of often-used lookup/dimension tables with only a few narrow rows in each, quite a lot of buffer cache will essentially be useless, with the wasted amount being proportional to the page size. A similar argument will apply where there have been lots of page splits.
July 27, 2011 at 7:08 am
We also use a unique record ID field in many of our tables. Very handy for updates and confidence that you are updating the exact record you just selected. Does this not serve the same purpose as a 'ROWID'?
Concerning increasing the row length, I'm reminded of the flat files on the mini-computer back in 1982 which I started on. I thought part of the idea of a "relational" database was to decrease row length for faster access and flexibility. However, I admit I am not an expert on the guts of SQL.
Wallace Houston
Sunnyland Farms, Inc.
"We must endeavor to persevere."
July 27, 2011 at 7:11 am
Steve, maybe it's because I haven't had enough caffeine this morning, but I have a question: How does the ROWID column that you propose differ from an identity column?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
July 27, 2011 at 7:45 am
mar10br0 (7/27/2011)
paul.knibbs (7/27/2011)
I'm not sure increasing the row length in SQL would help much--after all, the largest data types are stored in off-row data pages anyway, and a properly normalised database surely shouldn't need so many of the smaller data types that you'd need to increase the row length?I don't think Steve was thinking of allowing for larger records, but rather lowering overhead-costs associated with managing fewer pages in the page-pool if the I/O costs for loading a larger page is negligible on modern hardware.
Microsoft would have to do some bench-marking to see if the vast majority of applications running on typical hardware would not be negatively impacted by a larger page-size if top-end hardware may benefit. Maybe even allow the sysadmin to choose the optimum page-size for his hardware when installing the server (if Microsoft did a good job with their source-code, they may be able to compile the server-kernel for different page-sizes).
It seems to me that there are performance advantages to having a smaller more granular page size, at least when querying non-clustered rows. If just one row is read from a page, then the entire page is copied to the buffer. So, if SQL Server's page size was increased from 8K to something like 64K, then the same query would require more storage in the buffer pool and thus more total memory. Of course you could always install more memory, but a larger page size would in many cases use the same memory less efficiently than would a smaller page size.
Does that sound right?
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
July 27, 2011 at 7:45 am
the.komplikator (7/27/2011)
GilaMonster (7/27/2011)
Errr, no...Non-unique clustered indexes get a uniquifier, which is 4 bytes (an int) added to rows that have duplicate values for the clustering key.
Nonclustered indexes get either the clustering key (complete with uniquifier) if the base table has a clustered index, or the 8-byte RID (file_id, page_id, slot_index) if the base table is a heap.
If the nonclustered index is unique, that's present only in the leaf level, if the nonclustered index is not unique that is present in the leaf and non-leaf levels.
Right. No more early-in-the-morning posts for me 🙂
😀
I try not to post until I've had at least one cup of coffee. Otherwise really strange things get posted
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 27, 2011 at 7:47 am
I think the reason you can't access the internal ID is that it can change when a reorganization takes place. So you'd break any integrity that relied on the key. This is why we have identity fields that don't change by themselves.
July 27, 2011 at 7:58 am
stephen.lear (7/27/2011)
I think the reason you can't access the internal ID is that it can change when a reorganization takes place. So you'd break any integrity that relied on the key. This is why we have identity fields that don't change by themselves.
Oracle exposes the internal address of the row as a pseudocolumn called ROWID. However, I've never found a practical use for it. When it comes to de-duplicating, it's more appropriate to rank() the rows using a combination of columns. Even when dealing with a table that has true duplicates, the same column values straight across, the function row_number() can be used instead of a ROWID, and it's more useful because it's an integer. In Oracle, ROWID is represented as a 16 byte hex string.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
Viewing 15 posts - 1 through 15 (of 67 total)
You must be logged in to reply to this topic. Login to reply