Need to insert a new integer that is pseudo-random and unique to a group (not unique to the entire table)

Question

Post reply

Need to insert a new integer that is pseudo-random and unique to a group (not unique to the entire table)

Maxer

SSCrazy Eights

Points: 8964
More actions
June 1, 2012 at 4:21 pm

#274978

We have a column in a table that needs to contain pseudo-random integer. This integer needs to be unique, but not to the entire table.
Only to a group of (for example) product category IDs.
So if we have 1,000 product categories then that integer can't repeat inside a category, but it could repeat more than once in the table (up to 1,000 times, once per category I suppose).
The integer has to be at least 6 digits (I think 6 to 10 would be fine)
I'm fine with creating integers that are "random enough" but how would I go about making sure they don't already exist in the table?
It would be nice to provide the integer during the record insert, but it would be possible to finish all the inserts and then loop back for a second pass to update the records after the fact (though that seems like it would be less ideal and create the need for an overall transaction to prevent records from being left without an integer value.

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 1

With the number of visits and the points against your name, i would think that you know the the forum etiquettes.

Anways, as this problem interested me, im preparing the sample data, sample table structure for you and others.

Setting up some sample data:

IF OBJECT_ID('TempDB..#Temp') IS NOT NULL

DROP TABLE #Temp;

CREATE TABLE #Temp

(

iD INT IDENTITY(1,1)

,CatID INT

);

INSERT INTO #Temp (CatID)

SELECT TOP 100 CatID = ( ROW_NUMBER() OVER (ORDER BY (SELECT 0)) % 5 ) + 1

FROM sys.columns S1

CROSS JOIN sys.columns S2

;

Now if you want a truely **random** number but distinct numbers within each group, you can use this

;

WITH DistCatID AS

(

SELECT DISTINCT T.CatID

FROM #Temp T

)

, RanNum As

(

SELECT CatID , RanCat = ( ABS( CHECKSUM ( NEWID() ) ) % 999999 )

FROM DistCatID

)

SELECT T.iD , T.CatID

,RealRandomNumber = ROW_NUMBER() OVER (PARTITION BY T.CatID ORDER BY T.iD) * CrsApp.RanCat

FROM #Temp T

CROSS APPLY ( SELECT RN.RanCat

FROM RanNum RN

WHERE RN.CatID = T.CatID

) CrsApp

;

If you are OK with distinct sequential numbers in each group, then this would do:

SELECT T.iD , T.CatID

,DistSeqNumber = ROW_NUMBER() OVER (PARTITION BY T.CatID ORDER BY T.iD)

FROM #Temp T

ORDER BY T.CatID , T.iD

;

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 2

Gosh... I have to ask why you want to do this to yourself. What, if any, is the advantage to using random unique integers instead of nice, easy to create, unique, sequential integers? Having integers (for example) of 1 through 100 that were created randomly will be exactly the same as integers 1 through 100 created sequentially. Don't make the mistake of thinking about "order" in the tables because 1) it's no guarantee of order and 2) it shouldn't matter.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 3

Here is the article on the netiquettes to follow while posting, if you have not read it already

How to post questions[/url]

Maxer SSCrazy Eights Points: 8964 More actions · Answer 4

Apologies, I should have been more clear. I wasn't hoping for a tsql direct solution, more just a general discussion.

As for it being "random", I agree it makes no sense, but I didn't build it, I just have to live with it...

Paul White SSC Guru Points: 150467 More actions · Answer 5

Maxer (6/1/2012)
Apologies, I should have been more clear. I wasn't hoping for a tsql direct solution, more just a general discussion. As for it being "random", I agree it makes no sense, but I didn't build it, I just have to live with it...

No need to apologise, the previous respondents may simply have misunderstood you (these things will happen on a forum).

Can you describe the features any solution should ideally have and the design restrictions a little more precisely? My first thought is to populate the random values as part of a DEFAULT constraint or using an INSTEAD OF trigger, but more information would definitely be helpful at this point. I'm as interested as the next person to know what the purpose of the random numbers is, but more interested in how many rows there are, where they come from, whether the numbers ever need to be changed...that sort of thing.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 6

NM.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 7

Maxer (6/1/2012)
Apologies, I should have been more clear. I wasn't hoping for a tsql direct solution, more just a general discussion.
As for it being "random", I agree it makes no sense, but I didn't build it, I just have to live with it...

Thanks for the reply. I've had to live with these sorts of things myself.

As a general discussion on the subject, this type of thing can be a wee bit of a performance bottleneck. If you want to assign the numbers for more than 1 item at a time and owing to the nature of random numbers, one way to do this would be to calculate more random numbers that you need and then select the generated random numbers that weren't already "IN" the table. 6 to 10 digits will give you 9,999,999,999 - 99,999 or 9,999,900,000 (10 billion for all practical purposes) different numbers. If your table never ends up having more than, say, several million rows in it as time wears on, you probably won't have a problem finding unique numbers to fit although verifying that new random numbers don't already exist in the table will make the process run longer and longer as the table gets larger.

To make things a little faster, you could generate 2 or 3 times the random numbers you need for the number of rows you want to insert and then select the "top" number of rows that you need. Although it's not likely to happen, you'd have to add a check to ensure that you actually found the right amount of random numbers that weren't already "IN" the table. If the check failed, you'd have to try again. It is possible (although I think not likely given the number of rows stated) that process would have to "try" several times before it came up with enough numbers for the inserted rows during any insert (including just one row).

You could try to find random numbers for each row in a RBAR fashion but I think the "check" for uniqueness for each row could take quite some time unless you had a decent index on the column. But, there comes another bit of a problem. As with a GUID (which is nothing more than a random number, anymore), an index will quickly become fragmented during the insertion of random numbers. Of course, fragmented indexes will still operate very quickly on single lookups but you should probably make sure you have a decent maintenance plan for such indexes. By no means do I recommend that such an index be a Clustered index because the ensuing page splits will likely cause timeouts for an app that's driving all of the inserts.

As Paul said, an "Instead of" trigger may be just what the doctor ordered for this.

Another way to do this might be to precalculate several hundred thousand sequential numbers in a table along with a GUID column and a "used" column". Then, make sure you rebuild the indexes on that table and simply select the TOP number of random numbers you need from the table that haven't been marked as "used". The table wouldn't take up as much room as you might think and could eliminate the RBAR aspect of doing the check for uniqueness because the numbers would originally be generated as sequential numbers which would, of course, guarantee uniqueness. If you got close to running out of numbers, you could simply add more sequential numbers to the table along with their randomizing GUIDs, rebuild the indexes, and continue on.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Maxer SSCrazy Eights Points: 8964 More actions · Answer 8

Interesting point about pre-generating the values.

That does make sense about getting the job done.

I'll play around with that and post back a sample table structure and some data later.

As for the reason, it is for a sort of raffle system.

What happens is the user enters data and is given a completion code that is more or less random AND unique to that user's business.

So while that completion code may appear multiple times in the table, it will never appear more than once for the business or department or whatever level the user registered at.

The really odd part is that on that table there is no proper primary key linking completion of the raffle data entry with the user.

Instead you have the business or department ID and the completion code.

Without those two as a combined key you will not get a unique hit on the table.

It would have been far better if the completion code was never used as part of a combined key and that table just had a proper autonum ID that linked back to the user creating their account and the account of course linking back to the department/business ID.

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 9

CELKO (6/3/2012)
This is called an additive congruency generator, as opposed to a random number generator. Here is the formula for a 31-bit additive congruency generator. It does not repeat until it has completed a full cycle.
UPDATE generator
SET keyval = keyval/2 + MOD(MOD(keyval, 2) + MOD(keyval/8, 2), 2) * 2^30;
Here is the same algorithm implemented in C.
int asequence()
{static int n = 1;
n = n>>1 | (( n^n>>3 ) & 1) << 30;
return n;}
There are other formulas for different length integers.

That's pretty cool, Joe. How do you control the domain of numbers returned for such a thing and still maintain the uniqueness of values?

I can see how to offset the domain starting value using simple addition but how would you constrain this to return unique values with a range of 6 to 10 digits?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 10

For those that want to play with Joe's formula, here's a bit of code with his formula converted to T-SQL. It also has the "proof" that it generated unique INTEGERS.

--===== Conditionally drop the "proof" table to make reruns in SSMS

IF OBJECT_ID('tempdb..#Proof1','U') IS NOT NULL DROP TABLE #Proof1;

GO

SET NOCOUNT ON;

--===== Declare some obviously named variables

DECLARE @Counter BIGINT,

@KeyVal BIGINT

;

--===== Preset the variables to a known condition

SELECT @Counter = 1,

@KeyVal = 1 -- This is the "seed". Changing this creates a different set of numbers.

;

--===== Create the first proof table for the first experiment

CREATE TABLE #Proof1 (SeqNum BIGINT, KeyVal BIGINT)

;

--===== Populate the first proof table using the "additive congruency generator".

-- We need a While Loop because the next value depends on the previous value.

WHILE @Counter < = 1000000

BEGIN

SELECT @KeyVal = @KeyVal/2 + (((@KeyVal%2)+(@KeyVal/8%2))*2) * POWER(2,30);

INSERT INTO #Proof1 (SeqNum, KeyVal)

VALUES (@Counter, @KeyVal);

SELECT @Counter = @Counter + 1;

END

;

--===== Proof that the generated "KeyVal" numbers are unique over the million rows.

SELECT COUNT(DISTINCT KeyVal)

FROM #Proof1

;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150467 More actions · Answer 11

CELKO (6/3/2012)
This is called an additive congruency generator, as opposed to a random number generator.

There are also Linear Conguential Generators, described as "...one of the oldest and best-known pseudorandom number generator algorithms.".

Using the parameters used in the Numerical Recipes book:

IF OBJECT_ID(N'tempdb..#Proof1', N'U')

IS NOT NULL

DROP TABLE #Proof1;

GO

CREATE TABLE #Proof1

(

SeqNum integer NOT NULL,

KeyVal bigint NOT NULL);

WITH rCTE AS

(

SELECT

Sequence = 1,

KeyVal = (SELECT MAX(ms_ticks) FROM sys.dm_os_sys_info)

UNION ALL

SELECT

Sequence + 1,

((KeyVal * 1664525) + 1013904223) % CONVERT(bigint, 4294967296)

FROM rCTE

WHERE

Sequence < 1000000

)

INSERT #Proof1

(

SeqNum,

KeyVal

)

SELECT

Sequence,

KeyVal

FROM rCTE

OPTION (MAXRECURSION 0);

GO

SELECT TOP (100)

p.SeqNum,

p.KeyVal

FROM #Proof1 AS p;

GO

DROP TABLE #Proof1;

This appears to be a bit faster, and seems to generate numbers without the obvious progression seen in Joe's example. Anyway, the method used to generate the pseudo-random numbers isn't really the issue here 🙂

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Paul White SSC Guru Points: 150467 More actions · Answer 12

...though it's not clear why we wouldn't just use the built-in RAND() instead:

IF OBJECT_ID(N'tempdb..#Proof1', N'U')

IS NOT NULL

DROP TABLE #Proof1;

GO

CREATE TABLE #Proof1

(

SeqNum integer NOT NULL,

KeyVal integer NOT NULL);

WITH rCTE AS

(

SELECT

Sequence = 1,

KeyVal = CONVERT(integer, 2147483647 * RAND((SELECT MAX(ms_ticks) FROM sys.dm_os_sys_info) % 2147483647))

UNION ALL

SELECT

Sequence + 1,

CONVERT(integer, RAND(KeyVal) * 2147483647)

FROM rCTE

WHERE

Sequence < 1000000

)

INSERT #Proof1

(

SeqNum,

KeyVal

)

SELECT

Sequence,

KeyVal

FROM rCTE

OPTION (MAXRECURSION 0);

GO

SELECT TOP (100)

p.SeqNum,

p.KeyVal

FROM #Proof1 AS p;

GO

SELECT COUNT(DISTINCT KeyVal) FROM #Proof1 AS p

GO

DROP TABLE #Proof1;

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 13

SQL Kiwi (6/3/2012)
...though it's not clear why we wouldn't just use the built-in RAND() instead:

Agreed except for one thing... it's possible for RAND() and things like ABS(CHKSUM(NEWID())) to repeat when range limits and domain modifiers are applied. The various congruity generators don't appear to have that problem until every possible number in the sequence has been used (truely pseduo random?). Of course, I could be wrong because (drum roll please 😉 ).... I normally just use the built-in functions instead. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150467 More actions · Answer 14

Jeff Moden (6/4/2012)
Agreed except for one thing... it's possible for RAND() and things like ABS(CHKSUM(NEWID())) to repeat when range limits and domain modifiers are applied. The various congruity generators don't appear to have that problem until every possible number in the sequence has been used (truely pseduo random?).

That appears to be true of RAND too, when the previous value is used as the next seed and the whole range of an integer is used (see my demo code). I really don't think there's anything special about the congruity generators in that respect (and who knows, RAND may be implemented similarly internally). Squishing the output of the generator into a smaller space will generate collisions too. Oh, one more thing, there is another way to generate cryptographic pseudo-random numbers in SQL Server 2008 and later:

SELECT

spv.number,

CONVERT(bigint, CRYPT_GEN_RANDOM(8))

FROM master.dbo.spt_values AS spv

WHERE

spv.[type] = N'P';

You might like that because it can generate multiple values per query, unlike RAND.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi