Gen Bogus SSNs?

Question

Post reply

Gen Bogus SSNs?

robert.szymanski

SSC-Addicted

Points: 484
More actions
May 14, 2009 at 1:07 pm

#132277

Greetings All,
I was hoping some smart person out there can help me out. I have to update a single field in a table containing over 5 million records. The field is to contain a fake Social Security numbers w/o dashes.
I created a simple algorithm to generate a 9 digit number based of the current time. I wish to simply blow this "unique" value into an already existing table. The idea is to make each row.ssn unique. Can someone help me out?
Below is my fake SSN generation code:
DECLARE @Now DATETIME
DECLARE @GenSSN char(9)
SET @Now = GETDATE()
SELECT @GenSSN = RIGHT(REPLICATE('0',2) + CONVERT(VARCHAR(2),DATEPART(hour,@Now)),2) +
RIGHT(REPLICATE('0',2) + CONVERT(VARCHAR(2),DATEPART(minute,@Now)),2) +
RIGHT(REPLICATE('0',2) + CONVERT(VARCHAR(2),DATEPART(second,@Now)),2) +
RIGHT(REPLICATE('0',3) + CONVERT(VARCHAR(3),DATEPART(millisecond,@Now)),3)
Which yields:14041336
The crux of the question is how do I get these generated values into the 5+ million record table in a reasonable amount of time?
If anyone knows of a better way to do this, I'm all ears.
Thanks in advance

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 1

This will update your SSN column with 9 digit SSN numbers.

update MyTable

set

ssn = right(convert(bigint,convert(varbinary(7),newid())),9)

GSquared SSC Guru Points: 260824 More actions · Answer 2

If you base it on time, if you do the query too fast, you'll end up with duplicate values.

On updating it into the primary table, does that have an ID column by any chance? If not, it's easy enough to work around it, but it would be convenient.

This will generate 5-million unique fake SSNs.

set nocount on;

create table #SSNs (

ID int identity,

SSN char(9) primary key nonclustered);

while

(select count(*)

from #SSNs) 5000000;

If you can do an update into the main table using an ID match, you'll get it done pretty efficiently.

The above took 3 minutes, 39 seconds to generate 5-million rows on my desktop machine.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

robert.szymanski SSC-Addicted Points: 484 More actions · Answer 3

Thanks for your response. I tried your suggestion and got a ton of duplicates.

Hmmmm.

robert.szymanski SSC-Addicted Points: 484 More actions · Answer 4

Hi, thanks for your response. Unfortunetly, I don't have an ID. Sux I know.

RBarryYoung SSC Guru Points: 143329 More actions · Answer 5

Michael Valentine Jones (5/14/2009)
This will update your SSN column with 9 digit SSN numbers.
update MyTable
set
ssn = right(convert(bigint,convert(varbinary(7),newid())),9)

Michael: because you are truncating the ID to 9 digits, I do not think that you are gauranteed that they will be unique. And if they are random (and not in some kind of subsequence), then they are almost always going to generate duplicates (I think, I haven't done the math yet).

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 6

GSquared (5/14/2009)
If you base it on time, if you do the query too fast, you'll end up with duplicate values.
...
This will generate 5-million unique fake SSNs.
set nocount on;
create table #SSNs (
ID int identity,
SSN char(9) primary key nonclustered);
while
(select count(*)
from #SSNs) 5000000;
...
The above took 3 minutes, 39 seconds to generate 5-million rows on my desktop machine.

And you didn't get any duplicates? That's very suprising. Hmm, maybe my geusstimation is wrong...

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

GSquared SSC Guru Points: 260824 More actions · Answer 7

RBarryYoung (5/14/2009)
GSquared (5/14/2009)
If you base it on time, if you do the query too fast, you'll end up with duplicate values.
...
This will generate 5-million unique fake SSNs.
set nocount on;
create table #SSNs (
ID int identity,
SSN char(9) primary key nonclustered);
while
(select count(*)
from #SSNs) 5000000;
...
The above took 3 minutes, 39 seconds to generate 5-million rows on my desktop machine.
And you didn't get any duplicates? That's very suprising. Hmm, maybe my geusstimation is wrong...

Barry, this version can't have duplicates. It's a primary key. It's prevented by the distinct and except clauses.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 8

I created a test table:

set nocount on;

go

create table dbo.MyTable (

SSN char(9),

ID uniqueidentifier not null default(newsequentialid()) primary key);

go

insert into dbo.MyTable (SSN)

select top 5000000 null

from dbo.Numbers N1

cross join dbo.Numbers N2;

Then, using the temp table from the prior script, updated the SSNs in it:

;with CTE as

(select SSN,

row_number() over (order by ID) as Row

from dbo.MyTable)

update CTE

set SSN = SSNs.SSN

from #SSNs SSNs

where Row = ID;

As mentioned, without an ID column, it requires one more step. But it's not a big one. The CTE handles that. (Though my test table has a column called "ID", it's not an identity column, it's a GUID.)

Whatever PK you have on the table, the CTE with Row_Number() will allow a 1-to-1 match with the temp SSNs table. On my desktop machine, the update took 1 minute, 20 seconds. So that's about 5-6 minutes total to generate them and put them into the primary table. That's on a Core2 Duo with 2 Gig of RAM. Depending on your environment, it might be faster or slower.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

robert.szymanski SSC-Addicted Points: 484 More actions · Answer 9

Ok, so I execute the following on the table:

update dbo.masterenroll

SET SSN = (

select distinct right('000' + cast(abs(checksum(newid()))%1000 as varchar(3)), 3) +

right('00' + cast(abs(checksum(newid()))%100 as varchar(2)), 2) +

right('0000' + cast(abs(checksum(newid()))%10000 as varchar(4)), 4)

)

Then when I check for duplicates using the following query:

select SSN, COUNT(SSN) As NumOccurrences

FROM [dbo].masterenroll

GROUP BY SSN

HAVING (Count(SSN) > 1)

What's a bummer is I don't have a PK.

It returns 13089 rows with 2 or more occurances. WTF man? What am I doing wrong?

GSquared SSC Guru Points: 260824 More actions · Answer 10

That won't work. You're selecting a distinct value in an inline sub-query, that means that one row is distinct when compared to nothing at all. That won't prevent duplicates at all.

The reason I generate them into a temp table first is because that way you can make sure you have exactly the number you need, and that they are all unique.

Once you've generated them all, in the temp table, then you use that to update the main table.

That gives you unique values.

Of course, depending on what these numbers are being used for, you could just create an incrementing value and pad it with leading zeroes. You'd end up with unique values like "000000001", "000000002", and so on, so they wouldn't look like SSNs, but they'd be the right size, and they would be unique. That might be good enough for things like code tests, where the computer doesn't care if it really looks like a valid SSN.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

RBarryYoung SSC Guru Points: 143329 More actions · Answer 11

GSquared (5/14/2009)
RBarryYoung (5/14/2009)
...And you didn't get any duplicates? That's very suprising. Hmm, maybe my geusstimation is wrong...
Barry, this version can't have duplicates. It's a primary key. It's prevented by the distinct and except clauses.

Ah, I see. For some reason it wasn't all showing up on my screen before.

Never mind then... :blush:

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Lamprey13 SSCertifiable Points: 6210 More actions · Answer 12

Yeah, random data can be problimatic. One quick and dirty thing you coudl do is create a temp table with the PK on the SSN using the IGNORE_DUP_KEY And just do the inserts in batches until you have the number you want. Then insert into your actual table.

robert.szymanski SSC-Addicted Points: 484 More actions · Answer 13

GSquared (5/14/2009)
That won't work. You're selecting a distinct value in an inline sub-query, that means that one row is distinct when compared to nothing at all. That won't prevent duplicates at all.
The reason I generate them into a temp table first is because that way you can make sure you have exactly the number you need, and that they are all unique.
Once you've generated them all, in the temp table, then you use that to update the main table.
That gives you unique values.
Of course, depending on what these numbers are being used for, you could just create an incrementing value and pad it with leading zeroes. You'd end up with unique values like "000000001", "000000002", and so on, so they wouldn't look like SSNs, but they'd be the right size, and they would be unique. That might be good enough for things like code tests, where the computer doesn't care if it really looks like a valid SSN.

You know, this may not be a bad idea. I don't think it has to "look" like a real SSN just simply be unique. I will try this!

Thanks to all who chimed in. Y'all rock!

RBarryYoung SSC Guru Points: 143329 More actions · Answer 14

Actually, here's a method that will give you fairly "pseudorandom" SSNs, but always distinct:

Declare @P as Decimal(17)

Declare @S as Decimal(17)

Select @P = 613452691, @S = 1000000000

;WITH cteenroll as (

Select *

, ROW_NUMBER() Over(Order By YourClusteredIndex) as RecNo

From dbo.masterenroll

)

UPDATE cteEnroll

SET SSN = (@P * RecNo) % @S

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]