May 5, 2020 at 12:00 am
Comments posted to this topic are about the item Choosing Sequences Over Identity
May 5, 2020 at 9:32 am
Another of the myriad things I've never used that I'll consider for the future. Thanks I think. 🙂 It didn't quite make sense to me, being dense, until I reead this quick primer then I got it:
https://www.sqlservertutorial.net/sql-server-basics/sql-server-sequence/
May 5, 2020 at 12:29 pm
I have had a log at the blog you mentioned, and also Joe Celkos piece (https://www.red-gate.com/simple-talk/sql/learn-sql-server/sql-server-sequence-basics/). I can see linked server differences might be useful if you are in that sort of environment. I can also see that having something that was ANSI-2003-compliance might be something you want if you are in the situation where you have to have multi-vendor considerations. Similarly I can see that SEQUENCE has a use where you need one sequence of numbers over multiple tables that need to be distinct.
Having said which, I can't think of any occassion in my 30 years of dealing with databases where any of the differences between sequences and identities would have been beneficial for me. Identitity does a very specific job, and when SCOPE_IDENTITY and its ilk were introduced (and with the OUTPUT clause) it became easier and more accurate to deal with.
May 5, 2020 at 3:06 pm
To be honest and except for the use of getting/reserving a number prior to storing an item in a table, most of the reasons the given for using sequences are a bad idea, IMHO. There may actually be a reason for more than one table to share a sequence but I've not yet found one. Sharing sequences across a linked server makes me wonder about the overall design of the system and autonomy of said system. IDENTITY also provides automatic protection against some idiot making up numbers and inserting them in a willy-nilly fashion. It doesn't prevent someone from doing so but it does throw a couple of extra steps in the way to help prevent accidents.
Of course, "It Depends" seriously comes into play but, for most things, I see no great advantage to sequences and I see couple of disadvantages.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 6, 2020 at 7:27 am
I've had a couple of cases where I've wanted to share a sequence across two or more tables. Usually when I have several different sources for something like customer data, and I want to generate distinct IDs early on in the ETL process before merging the streams.
But the main thing sequences give for me is extensibility. It means that although you can't think of a reason right now why you might want to generate e.g. CustomerID in more than one place, you're allowing yourself the possibility of doing it in the future. Similarly, you make it easier for yourself if you ever want to insert values for these IDs, maybe because you're putting some default values in place, or because you want to move the ID generation to somewhere else in your process.
May 6, 2020 at 9:45 pm
A recent database I created involves tables where the natural unique keys are several columns wide. Rather than using an IDENTITY or other incremental ID, for the surrogate key column I actually used a computed column consisting of a MD5 hash of the natural keys and then casted that to BIGINT.
CREATE TABLE CustomerHistory
(
CustomerHistoryID AS
CAST(
HASHBYTES('MD5'
, CAST(StoreCode AS VARBINARY(100))
+ CAST(CustomerID AS VARBINARY(100))
+ CAST(StartTime AS VARBINARY(100))
) AS BIGINT) PERSISTED NOT NULL
CONSTRAINT PK_CustomerHistory PRIMARY KEY CLUSTERED,
StoreCode VARCHAR(5) NOT NULL,
CustomerID INT NOT NULL,
StartTime DATE NOT NULL,
EndTime DATE NULL,
IsCurrent BIT NOT NULL
);
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
May 7, 2020 at 7:50 am
I have seen an uuid generator written in Assembler a while ago (which I cannot find anymore unfortunately) but an approach like this is what I'd go for if I would have to design a DB with a lot of inserts happening constantly. That C# Code in there mostly the same as the one I've seen, but you can scrap going for something timestamp based all together if you assume your timestamp part starts at 1900-01-01 anyways and swap those timestamp bytes by patterns, too. That way you can not only have UUIDs which appear random to yourself but can be (looking at a bit level) extremely well performing for sequential workloads too as at least to SQL Server those guids wouldn't be completely random anymore if you're bitshifting through essentially every portion of the UUID. Bit shifting is on most modern CPUs something that does not need a CPU Cycle unlike most other operations which can take between 1 and many CPU Cycles.
I'm still hoping to be able to find that Assembler uuid generator one day again, because I smell a potential natively compiled stored procedure coming out of that.
May 7, 2020 at 2:50 pm
A recent database I created involves tables where the natural unique keys are several columns wide. Rather than using an IDENTITY or other incremental ID, for the surrogate key column I actually used a computed column consisting of a MD5 hash of the natural keys and then casted that to BIGINT.
CREATE TABLE CustomerHistory
(
CustomerHistoryID AS
CAST(
HASHBYTES('MD5'
, CAST(StoreCode AS VARBINARY(100))
+ CAST(CustomerID AS VARBINARY(100))
+ CAST(StartTime AS VARBINARY(100))
) AS BIGINT) PERSISTED NOT NULL
CONSTRAINT PK_CustomerHistory PRIMARY KEY CLUSTERED,
StoreCode VARCHAR(5) NOT NULL,
CustomerID INT NOT NULL,
StartTime DATE NOT NULL,
EndTime DATE NULL,
IsCurrent BIT NOT NULL
);
I like it.
I've used similar in the process of setting for "UPSERTS". MD5 is more prone to collisions than the SHA series but works great for determining if two items are different. If MD5 produces a different result between two or more items, those are guaranteed to be different. Because of the possibility of collisions, two items that have the same MD5 result are NOT guaranteed to be the same.
With that, I have to ask, since MD5 returns a 16 Byte result and you convert that to BIGINT, which is an 8 Byte result (which is a whopping 8 orders of magnitude less), have you had any problems with collisions between non-duplicate values?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 7, 2020 at 2:52 pm
I have seen an uuid generator written in Assembler a while ago (which I cannot find anymore unfortunately) but an approach like this is what I'd go for if I would have to design a DB with a lot of inserts happening constantly. That C# Code in there mostly the same as the one I've seen, but you can scrap going for something timestamp based all together if you assume your timestamp part starts at 1900-01-01 anyways and swap those timestamp bytes by patterns, too. That way you can not only have UUIDs which appear random to yourself but can be (looking at a bit level) extremely well performing for sequential workloads too as at least to SQL Server those guids wouldn't be completely random anymore if you're bitshifting through essentially every portion of the UUID. Bit shifting is on most modern CPUs something that does not need a CPU Cycle unlike most other operations which can take between 1 and many CPU Cycles.
I'm still hoping to be able to find that Assembler uuid generator one day again, because I smell a potential natively compiled stored procedure coming out of that.
Good information but I have to ask, how did we go from a discussion of incremental sequenced numbers to UUIDs?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 7, 2020 at 9:13 pm
With that, I have to ask, since MD5 returns a 16 Byte result and you convert that to BIGINT, which is an 8 Byte result (which is a whopping 8 orders of magnitude less), have you had any problems with collisions between non-duplicate values?
You're right, Jeff, it is a truncated MD5 hash. Still BIGINT is 64 bits or -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807). In practice, I'm only using the MD5 surrogate keys on meta-data tables, so while collisions are possible, it's not likely to occur across a recordset numbering in a few thousand.
What I'm actually doing is developing a CCPA Update engine that's responsible for executing updates across 20+ servers and 1000+ columns we're identified as containing PII. The CPColumnID is hashed across ServerName, DatabaseName, SchemaName, TableName, ColumnName. I needed some type of computed surrogate key, because CPColumnID is used as a foreign key in several different tables used for mapping, tasks, and auditing. I liked hashing because the ID values are portable and consistent across multiple development and QA environments.
I actually did encounter collisions initially when casting MD5 as INT and generating IDs for a table containing a million randomly generated records. However, it doesn't happen with BIGINT, even when repeating the test many times. But, yeah, I do sometimes think about whether I should have just gone with the ubiquitous INT IDENTITY(1,1).
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
May 8, 2020 at 6:47 am
DinoRS wrote:I have seen an uuid generator written in Assembler a while ago (which I cannot find anymore unfortunately) but an approach like this is what I'd go for if I would have to design a DB with a lot of inserts happening constantly. That C# Code in there mostly the same as the one I've seen, but you can scrap going for something timestamp based all together if you assume your timestamp part starts at 1900-01-01 anyways and swap those timestamp bytes by patterns, too. That way you can not only have UUIDs which appear random to yourself but can be (looking at a bit level) extremely well performing for sequential workloads too as at least to SQL Server those guids wouldn't be completely random anymore if you're bitshifting through essentially every portion of the UUID. Bit shifting is on most modern CPUs something that does not need a CPU Cycle unlike most other operations which can take between 1 and many CPU Cycles.
I'm still hoping to be able to find that Assembler uuid generator one day again, because I smell a potential natively compiled stored procedure coming out of that.
Good information but I have to ask, how did we go from a discussion of incremental sequenced numbers to UUIDs?
well as I said, it's not visibly for us an incremental sequenced number but with the right bit shift pattern it is an incremental sequenced number for SQL Server, which happens in this case to be able to be up to 128 bit long instead of "just" int or bigint.
I won't give you a full UUID but from a bit perspective this is pretty much a sequence:
00010001 (17)
00100010 (34)
01000100 (68)
10001000 (136)
if you look closely you could even identify the most likely sort order. Ofcourse this is a very simplistic way of bit shifting but it's apparent you have "some sort of" sequence going on, while eventually not that easy to distinguish in form of an UUID bit shifting for sequences is much faster than "set last PK.value (+1) as next PK" as addition eats cpu cycles which will have a play in terms of your insert performance which will at some point be limited by the available amount of free cpu cycles per second (which are not being eaten up by other things).
May 8, 2020 at 2:30 pm
While I agree that a "Shift-Left" instruction at the register level uses fewer CPU cycles than an "Add" instruction, I don't see the advantage in large data sets that will quickly outstrip the size of even large registers (128 bits, for example). Numerically speaking, a "Shift-Left" of 1 bit is effectively a multiplication by 2 (of course, you know that but had to say it out loud).
I did a search for "Bit Shifting" and "Bit Shifting Counter" but haven't found a decent explanation of how it can be used as a high performance counter for large values. Do you have a link or two that you could point me to so that I can read up on what you're explaining?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 8, 2020 at 2:35 pm
Jeff Moden wrote:With that, I have to ask, since MD5 returns a 16 Byte result and you convert that to BIGINT, which is an 8 Byte result (which is a whopping 8 orders of magnitude less), have you had any problems with collisions between non-duplicate values?
You're right, Jeff, it is a truncated MD5 hash. Still BIGINT is 64 bits or -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807). In practice, I'm only using the MD5 surrogate keys on meta-data tables, so while collisions are possible, it's not likely to occur across a recordset numbering in a few thousand.
What I'm actually doing is developing a CCPA Update engine that's responsible for executing updates across 20+ servers and 1000+ columns we're identified as containing PII. The CPColumnID is hashed across ServerName, DatabaseName, SchemaName, TableName, ColumnName. I needed some type of computed surrogate key, because CPColumnID is used as a foreign key in several different tables used for mapping, tasks, and auditing. I liked hashing because the ID values are portable and consistent across multiple development and QA environments.
I actually did encounter collisions initially when casting MD5 as INT and generating IDs for a table containing a million randomly generated records. However, it doesn't happen with BIGINT, even when repeating the test many times. But, yeah, I do sometimes think about whether I should have just gone with the ubiquitous INT IDENTITY(1,1).
Thanks for the explanation, Eric. I really appreciate it.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 8, 2020 at 7:13 pm
The SEQUENCE and IDENTITY are two very different things. The IDENTITY table property is a very proprietary, totally non-relational, feature in SQL Server. By analogy, it is like the parking space in the garage in your car is the data element. The parking space number, your assigned space depends on when and where you arrived and has nothing whatsoever to do with a relational model of data.
A SEQUENCE is a particular kind of data structure, with particular mathematical properties. For example, (I am not sure if this is still true) in Italy. It is required by law, that invoice numbers be sequential. If that invoice appears as a data element in more than one place in your schema, then you have to use the same SEQUENCE number. This is one of the advantages of CREATE SEQUENCE (schema level data element value) over IDENTITY (table property having nothing to do with a relational data model).
Please post DDL and follow ANSI/ISO standards when asking for help.
Viewing 15 posts - 1 through 15 (of 31 total)
You must be logged in to reply to this topic. Login to reply