Time for a ROWID?

Question

Time for a ROWID?

Viewing 8 posts - 61 through 67 (of 67 total)

You must be logged in to reply to this topic. Login to reply

peter-757102 SSCertifiable Points: 6877 More actions · Answer 1

David Walker-278941 (7/28/2011)
SQL data is allocated on disk in 8 kilobyte chunks. The 512 byte sector size of the underlying physical disk does not affect this *for writing SQL data*.

I am pretty sure *file* allocation happens in 64KB extents, that contain multiple pages for *objects* , each page being 8KB in size.

This is from books online:

The fundamental unit of data storage in SQL Server is the page. The disk space allocated to a data file (.mdf or .ndf) in a database is logically divided into pages numbered contiguously from 0 to n. Disk I/O operations are performed at the page level. That is, SQL Server reads or writes whole data pages.
Extents are a collection of eight physically contiguous pages and are used to efficiently manage the pages. All pages are stored in extents.

The fact that all pages are stored in extents means that your file size grows at a minimum of 64KB, or 8 pages at a time. I however stand corrected on the 512 byte sector size, it seems indeed SQL server despite technically writing data using sector size chunks, it does not attepts to write partal pages. When not all of a page is written, and the checksum is broken then you have a torn/corrupt page. A result of writing smaller chunks then the page size.

Coming to torn pages, the larger the page size the more date is lost when a page corrupts de to a partial write.

mar10br0 Say Hey Kid Points: 668 More actions · Answer 2

peter-757102 (7/29/2011)
... I however stand corrected on the 512 byte sector size, it seems indeed SQL server despite technically writing data using sector size chunks, it does not attepts to write partal pages. ...

SQL server doesn't care about the sector size of the physical disk, it will send the whole 8Kb buffers (or the whole extend if it deems appropriate) to the OS for writing to disk. It's the OS that will deal in first instance with disk-allocation units (NTFS uses blocks of 4KB) and the disk's system-driver in turn deals with hardware-level sectors. Classic harddrives still use sector sizes of 512 bytes, but SSD drives commonly have sector-sizes of 4KB.

Eric M Russell SSC Guru Points: 125562 More actions · Answer 3

mar10br0 (7/28/2011)
For deduping duplicates in a heap-table, I would just delete all offending rows, then re-insert one record. Since a heap cannot have any foreign-keys pointing to it, this will succeed and when done within a transaction the temporarily non-existence of the row is never noticed by the application.
If you have a bad design containing heaps and an even worse designed application inserting duplicates, well, I feel sorry for you having to do this dirty cleanup-technique, but it works. (it is the very reason why I NEVER create heaps and ALL my tables have at least a surrogate primary key, having the engine force the existence of a ROWID as PK is fine with me).

A heap can still have a non-clustered primary key and foreign key relationships. The term heap just means that the table is not clustered.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

mar10br0 Say Hey Kid Points: 668 More actions · Answer 4

Eric M Russell (7/29/2011)
mar10br0 (7/28/2011)
For deduping duplicates in a heap-table, I would just delete all offending rows, then re-insert one record. Since a heap cannot have any foreign-keys pointing to it, this will succeed and when done within a transaction the temporarily non-existence of the row is never noticed by the application.
If you have a bad design containing heaps and an even worse designed application inserting duplicates, well, I feel sorry for you having to do this dirty cleanup-technique, but it works. (it is the very reason why I NEVER create heaps and ALL my tables have at least a surrogate primary key, having the engine force the existence of a ROWID as PK is fine with me).
A heap can still have a non-clustered primary key and foreign key relationships. The term heap just means that the table is not clustered.

Ah, then I've misunderstood the term heap. I was responding here about the problem of deduping records in a table that has no primary key and duplicate records cannot be identified individually (so there is no criteria to tell the server to delete one record while keeping the other). If there is a primary key (clustered or not) deduping is not a difficult issue.

Eric M Russell SSC Guru Points: 125562 More actions · Answer 5

mar10br0 (7/29/2011)
Eric M Russell (7/29/2011)
mar10br0 (7/28/2011)
For deduping duplicates in a heap-table, I would just delete all offending rows, then re-insert one record. Since a heap cannot have any foreign-keys pointing to it, this will succeed and when done within a transaction the temporarily non-existence of the row is never noticed by the application.
If you have a bad design containing heaps and an even worse designed application inserting duplicates, well, I feel sorry for you having to do this dirty cleanup-technique, but it works. (it is the very reason why I NEVER create heaps and ALL my tables have at least a surrogate primary key, having the engine force the existence of a ROWID as PK is fine with me).
A heap can still have a non-clustered primary key and foreign key relationships. The term heap just means that the table is not clustered.
Ah, then I've misunderstood the term heap. I was responding here about the problem of deduping records in a table that has no primary key and duplicate records cannot be identified individually (so there is no criteria to tell the server to delete one record while keeping the other). If there is a primary key (clustered or not) deduping is not a difficult issue.

In one situation, I have a primary key, but multiple records can still be considered "duplicate". For example, a patient cannot be actively enrolled in more than one disease management program at the same time. This could technicaly be implemented as a hard check constraint, but it can't practically, because there are some exceptions. Also, if the application workflow (rightly or wrongly) allows a patient to dual enroll, then that event must still be recorded in the table. Whatever the case, we only bill the client for one patient active enrolled in any program, so some logic must be applied to bill for the appropriate program.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

mar10br0 Say Hey Kid Points: 668 More actions · Answer 6

Eric M Russell (8/1/2011)
mar10br0 (7/29/2011)
Eric M Russell (7/29/2011)
mar10br0 (7/28/2011)
For deduping duplicates in a heap-table, I would just delete all offending rows, then re-insert one record. Since a heap cannot have any foreign-keys pointing to it, this will succeed and when done within a transaction the temporarily non-existence of the row is never noticed by the application.
If you have a bad design containing heaps and an even worse designed application inserting duplicates, well, I feel sorry for you having to do this dirty cleanup-technique, but it works. (it is the very reason why I NEVER create heaps and ALL my tables have at least a surrogate primary key, having the engine force the existence of a ROWID as PK is fine with me).
A heap can still have a non-clustered primary key and foreign key relationships. The term heap just means that the table is not clustered.
Ah, then I've misunderstood the term heap. I was responding here about the problem of deduping records in a table that has no primary key and duplicate records cannot be identified individually (so there is no criteria to tell the server to delete one record while keeping the other). If there is a primary key (clustered or not) deduping is not a difficult issue.
In one situation, I have a primary key, but multiple records can still be considered "duplicate". For example, a patient cannot be actively enrolled in more than one disease management program at the same time. This could technicaly be implemented as a hard check constraint, but it can't practically, because there are some exceptions. Also, if the application workflow (rightly or wrongly) allows a patient to dual enroll, then that event must still be recorded in the table. Whatever the case, we only bill the client for one patient active enrolled in any program, so some logic must be applied to bill for the appropriate program.

In your case (having a primary key) you can easily delete erroneous duplicates (if the user decides to correct the duplicate enrollments) via their primary key values and leave one record in place. It's when in a badly designed table without a primary key and true duplicates occur (every column having the same value), it is technically impossible to delete one and leave the other (what WHERE clause would you use?). The article proposes to solve this issue by guaranteeing the existence of a surrogate key in the form of a ROWID used as the primary key.

Your case is a perfect example of why "real-world" data (visible to the user) should never be used as a primary key, because in the real world there is a likelyhood that in exceptional circumstances the uniqueness-rule needs to be violated. Another example is that the original input is later discovered to contain a typo, or is otherwise incorrect. Updating the columns comprising a primary key is challenging (causing a costly re-shuffle of clustered records and/or cascading updates of foreign keys, risking an escalation of data-changes that could potentially run wild and have unwanted side-effects, etc, etc).

chrisfradenburg SSCrazy Eights Points: 9592 More actions · Answer 7

mar10br0 (8/1/2011)
It's when in a badly designed table without a primary key and true duplicates occur (every column having the same value), it is technically impossible to delete one and leave the other (what WHERE clause would you use?). The article proposes to solve this issue by guaranteeing the existence of a surrogate key in the form of a ROWID used as the primary key.

Use row_number to uniquely identify the row. top 1 is also valid in a delete statement.

mar10br0 (8/1/2011)
Your case is a perfect example of why "real-world" data (visible to the user) should never be used as a primary key, because in the real world there is a likelyhood that in exceptional circumstances the uniqueness-rule needs to be violated. Another example is that the original input is later discovered to contain a typo, or is otherwise incorrect. Updating the columns comprising a primary key is challenging (causing a costly re-shuffle of clustered records and/or cascading updates of foreign keys, risking an escalation of data-changes that could potentially run wild and have unwanted side-effects, etc, etc).

What if the overhead incurred from using a surrogate key is greater than the overhead from correcting invalid columns? Or if it's a fairly static table? There are certainly cases where natural keys make sense. And there are time where uniqueness can be enforced. Sure, there are times like with SSN where you think it can be enforced but that causes problems. What about tables containing configuration names and it's programatically required that each name be unique? In that case if a user tries entering a non-unique name an error should be thrown and have them pick a new name or change the existing one. Which, if it's a small table, won't make a difference with fragmentation.

Eric M Russell SSC Guru Points: 125562 More actions · Answer 8

In both SQL Server and Oracle, the ROWID is a physical marker that changes when the physical position of the row in the page or table changes. For that reason, it can't be relied upon to identify which row was inserted first, in the same way (for example) that a street address number can't identify which resident was living in the neighborhood first, especially if they swap houses. If one is looking for a column to use as a tie breaker in a ranking function, then an identity column would be a better choice, because it's sequential and once inserted it's value follows the row as the row's position moves during table reorganization. If you're confronted with the issue routinely, then I think the best solution would be to add an identity or even better a date/time column that populated using a default function at the time of insert. Exposing the ROWID to the SQL developer just doesn't seem to have any practical application when there are better choices that are more deterministic.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho