Gen Bogus SSNs?

  • Greetings All,

    I was hoping some smart person out there can help me out. I have to update a single field in a table containing over 5 million records. The field is to contain a fake Social Security numbers w/o dashes.

    I created a simple algorithm to generate a 9 digit number based of the current time. I wish to simply blow this "unique" value into an already existing table. The idea is to make each row.ssn unique. Can someone help me out?

    Below is my fake SSN generation code:

    DECLARE @Now DATETIME

    DECLARE @GenSSN char(9)

    SET @Now = GETDATE()

    SELECT @GenSSN = RIGHT(REPLICATE('0',2) + CONVERT(VARCHAR(2),DATEPART(hour,@Now)),2) +

    RIGHT(REPLICATE('0',2) + CONVERT(VARCHAR(2),DATEPART(minute,@Now)),2) +

    RIGHT(REPLICATE('0',2) + CONVERT(VARCHAR(2),DATEPART(second,@Now)),2) +

    RIGHT(REPLICATE('0',3) + CONVERT(VARCHAR(3),DATEPART(millisecond,@Now)),3)

    Which yields:14041336

    The crux of the question is how do I get these generated values into the 5+ million record table in a reasonable amount of time?

    If anyone knows of a better way to do this, I'm all ears.

    Thanks in advance

  • This will update your SSN column with 9 digit SSN numbers.

    update MyTable

    set

    ssn = right(convert(bigint,convert(varbinary(7),newid())),9)

  • If you base it on time, if you do the query too fast, you'll end up with duplicate values.

    On updating it into the primary table, does that have an ID column by any chance? If not, it's easy enough to work around it, but it would be convenient.

    This will generate 5-million unique fake SSNs.

    set nocount on;

    create table #SSNs (

    ID int identity,

    SSN char(9) primary key nonclustered);

    while

    (select count(*)

    from #SSNs) 5000000;

    If you can do an update into the main table using an ID match, you'll get it done pretty efficiently.

    The above took 3 minutes, 39 seconds to generate 5-million rows on my desktop machine.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks for your response. I tried your suggestion and got a ton of duplicates.

    Hmmmm.

  • Hi, thanks for your response. Unfortunetly, I don't have an ID. Sux I know.

  • Michael Valentine Jones (5/14/2009)


    This will update your SSN column with 9 digit SSN numbers.

    update MyTable

    set

    ssn = right(convert(bigint,convert(varbinary(7),newid())),9)

    Michael: because you are truncating the ID to 9 digits, I do not think that you are gauranteed that they will be unique. And if they are random (and not in some kind of subsequence), then they are almost always going to generate duplicates (I think, I haven't done the math yet).

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • GSquared (5/14/2009)


    If you base it on time, if you do the query too fast, you'll end up with duplicate values.

    ...

    This will generate 5-million unique fake SSNs.

    set nocount on;

    create table #SSNs (

    ID int identity,

    SSN char(9) primary key nonclustered);

    while

    (select count(*)

    from #SSNs) 5000000;

    ...

    The above took 3 minutes, 39 seconds to generate 5-million rows on my desktop machine.

    And you didn't get any duplicates? That's very suprising. Hmm, maybe my geusstimation is wrong...

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (5/14/2009)


    GSquared (5/14/2009)


    If you base it on time, if you do the query too fast, you'll end up with duplicate values.

    ...

    This will generate 5-million unique fake SSNs.

    set nocount on;

    create table #SSNs (

    ID int identity,

    SSN char(9) primary key nonclustered);

    while

    (select count(*)

    from #SSNs) 5000000;

    ...

    The above took 3 minutes, 39 seconds to generate 5-million rows on my desktop machine.

    And you didn't get any duplicates? That's very suprising. Hmm, maybe my geusstimation is wrong...

    Barry, this version can't have duplicates. It's a primary key. It's prevented by the distinct and except clauses.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I created a test table:

    set nocount on;

    go

    create table dbo.MyTable (

    SSN char(9),

    ID uniqueidentifier not null default(newsequentialid()) primary key);

    go

    insert into dbo.MyTable (SSN)

    select top 5000000 null

    from dbo.Numbers N1

    cross join dbo.Numbers N2;

    Then, using the temp table from the prior script, updated the SSNs in it:

    ;with CTE as

    (select SSN,

    row_number() over (order by ID) as Row

    from dbo.MyTable)

    update CTE

    set SSN = SSNs.SSN

    from #SSNs SSNs

    where Row = ID;

    As mentioned, without an ID column, it requires one more step. But it's not a big one. The CTE handles that. (Though my test table has a column called "ID", it's not an identity column, it's a GUID.)

    Whatever PK you have on the table, the CTE with Row_Number() will allow a 1-to-1 match with the temp SSNs table. On my desktop machine, the update took 1 minute, 20 seconds. So that's about 5-6 minutes total to generate them and put them into the primary table. That's on a Core2 Duo with 2 Gig of RAM. Depending on your environment, it might be faster or slower.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Ok, so I execute the following on the table:

    update dbo.masterenroll

    SET SSN = (

    select distinct right('000' + cast(abs(checksum(newid()))%1000 as varchar(3)), 3) +

    right('00' + cast(abs(checksum(newid()))%100 as varchar(2)), 2) +

    right('0000' + cast(abs(checksum(newid()))%10000 as varchar(4)), 4)

    )

    Then when I check for duplicates using the following query:

    select SSN, COUNT(SSN) As NumOccurrences

    FROM [dbo].masterenroll

    GROUP BY SSN

    HAVING (Count(SSN) > 1)

    What's a bummer is I don't have a PK.

    It returns 13089 rows with 2 or more occurances. WTF man? What am I doing wrong?

  • That won't work. You're selecting a distinct value in an inline sub-query, that means that one row is distinct when compared to nothing at all. That won't prevent duplicates at all.

    The reason I generate them into a temp table first is because that way you can make sure you have exactly the number you need, and that they are all unique.

    Once you've generated them all, in the temp table, then you use that to update the main table.

    That gives you unique values.

    Of course, depending on what these numbers are being used for, you could just create an incrementing value and pad it with leading zeroes. You'd end up with unique values like "000000001", "000000002", and so on, so they wouldn't look like SSNs, but they'd be the right size, and they would be unique. That might be good enough for things like code tests, where the computer doesn't care if it really looks like a valid SSN.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (5/14/2009)


    RBarryYoung (5/14/2009)


    ...And you didn't get any duplicates? That's very suprising. Hmm, maybe my geusstimation is wrong...

    Barry, this version can't have duplicates. It's a primary key. It's prevented by the distinct and except clauses.

    Ah, I see. For some reason it wasn't all showing up on my screen before.

    Never mind then... :blush:

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Yeah, random data can be problimatic. One quick and dirty thing you coudl do is create a temp table with the PK on the SSN using the IGNORE_DUP_KEY And just do the inserts in batches until you have the number you want. Then insert into your actual table.

  • GSquared (5/14/2009)


    That won't work. You're selecting a distinct value in an inline sub-query, that means that one row is distinct when compared to nothing at all. That won't prevent duplicates at all.

    The reason I generate them into a temp table first is because that way you can make sure you have exactly the number you need, and that they are all unique.

    Once you've generated them all, in the temp table, then you use that to update the main table.

    That gives you unique values.

    Of course, depending on what these numbers are being used for, you could just create an incrementing value and pad it with leading zeroes. You'd end up with unique values like "000000001", "000000002", and so on, so they wouldn't look like SSNs, but they'd be the right size, and they would be unique. That might be good enough for things like code tests, where the computer doesn't care if it really looks like a valid SSN.

    You know, this may not be a bad idea. I don't think it has to "look" like a real SSN just simply be unique. I will try this!

    Thanks to all who chimed in. Y'all rock!

  • Actually, here's a method that will give you fairly "pseudorandom" SSNs, but always distinct:

    Declare @P as Decimal(17)

    Declare @s-2 as Decimal(17)

    Select @P = 613452691, @s-2 = 1000000000

    ;WITH cteenroll as (

    Select *

    , ROW_NUMBER() Over(Order By YourClusteredIndex) as RecNo

    From dbo.masterenroll

    )

    UPDATE cteEnroll

    SET SSN = (@P * RecNo) % @s-2

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply