May 14, 2009 at 1:07 pm
Greetings All,
I was hoping some smart person out there can help me out. I have to update a single field in a table containing over 5 million records. The field is to contain a fake Social Security numbers w/o dashes.
I created a simple algorithm to generate a 9 digit number based of the current time. I wish to simply blow this "unique" value into an already existing table. The idea is to make each row.ssn unique. Can someone help me out?
Below is my fake SSN generation code:
DECLARE @Now DATETIME
DECLARE @GenSSN char(9)
SET @Now = GETDATE()
SELECT @GenSSN = RIGHT(REPLICATE('0',2) + CONVERT(VARCHAR(2),DATEPART(hour,@Now)),2) +
RIGHT(REPLICATE('0',2) + CONVERT(VARCHAR(2),DATEPART(minute,@Now)),2) +
RIGHT(REPLICATE('0',2) + CONVERT(VARCHAR(2),DATEPART(second,@Now)),2) +
RIGHT(REPLICATE('0',3) + CONVERT(VARCHAR(3),DATEPART(millisecond,@Now)),3)
Which yields:14041336
The crux of the question is how do I get these generated values into the 5+ million record table in a reasonable amount of time?
If anyone knows of a better way to do this, I'm all ears.
Thanks in advance
May 14, 2009 at 1:31 pm
This will update your SSN column with 9 digit SSN numbers.
update MyTable
set
ssn = right(convert(bigint,convert(varbinary(7),newid())),9)
May 14, 2009 at 1:39 pm
If you base it on time, if you do the query too fast, you'll end up with duplicate values.
On updating it into the primary table, does that have an ID column by any chance? If not, it's easy enough to work around it, but it would be convenient.
This will generate 5-million unique fake SSNs.
set nocount on;
create table #SSNs (
ID int identity,
SSN char(9) primary key nonclustered);
while
(select count(*)
from #SSNs) 5000000;
If you can do an update into the main table using an ID match, you'll get it done pretty efficiently.
The above took 3 minutes, 39 seconds to generate 5-million rows on my desktop machine.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
May 14, 2009 at 1:44 pm
Thanks for your response. I tried your suggestion and got a ton of duplicates.
Hmmmm.
May 14, 2009 at 1:45 pm
Hi, thanks for your response. Unfortunetly, I don't have an ID. Sux I know.
May 14, 2009 at 1:52 pm
Michael Valentine Jones (5/14/2009)
This will update your SSN column with 9 digit SSN numbers.
update MyTable
set
ssn = right(convert(bigint,convert(varbinary(7),newid())),9)
Michael: because you are truncating the ID to 9 digits, I do not think that you are gauranteed that they will be unique. And if they are random (and not in some kind of subsequence), then they are almost always going to generate duplicates (I think, I haven't done the math yet).
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 14, 2009 at 1:57 pm
GSquared (5/14/2009)
If you base it on time, if you do the query too fast, you'll end up with duplicate values....
This will generate 5-million unique fake SSNs.
set nocount on;
create table #SSNs (
ID int identity,
SSN char(9) primary key nonclustered);
while
(select count(*)
from #SSNs) 5000000;
...
The above took 3 minutes, 39 seconds to generate 5-million rows on my desktop machine.
And you didn't get any duplicates? That's very suprising. Hmm, maybe my geusstimation is wrong...
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 14, 2009 at 2:05 pm
RBarryYoung (5/14/2009)
GSquared (5/14/2009)
If you base it on time, if you do the query too fast, you'll end up with duplicate values....
This will generate 5-million unique fake SSNs.
set nocount on;
create table #SSNs (
ID int identity,
SSN char(9) primary key nonclustered);
while
(select count(*)
from #SSNs) 5000000;
...
The above took 3 minutes, 39 seconds to generate 5-million rows on my desktop machine.
And you didn't get any duplicates? That's very suprising. Hmm, maybe my geusstimation is wrong...
Barry, this version can't have duplicates. It's a primary key. It's prevented by the distinct and except clauses.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
May 14, 2009 at 2:16 pm
I created a test table:
set nocount on;
go
create table dbo.MyTable (
SSN char(9),
ID uniqueidentifier not null default(newsequentialid()) primary key);
go
insert into dbo.MyTable (SSN)
select top 5000000 null
from dbo.Numbers N1
cross join dbo.Numbers N2;
Then, using the temp table from the prior script, updated the SSNs in it:
;with CTE as
(select SSN,
row_number() over (order by ID) as Row
from dbo.MyTable)
update CTE
set SSN = SSNs.SSN
from #SSNs SSNs
where Row = ID;
As mentioned, without an ID column, it requires one more step. But it's not a big one. The CTE handles that. (Though my test table has a column called "ID", it's not an identity column, it's a GUID.)
Whatever PK you have on the table, the CTE with Row_Number() will allow a 1-to-1 match with the temp SSNs table. On my desktop machine, the update took 1 minute, 20 seconds. So that's about 5-6 minutes total to generate them and put them into the primary table. That's on a Core2 Duo with 2 Gig of RAM. Depending on your environment, it might be faster or slower.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
May 14, 2009 at 2:18 pm
Ok, so I execute the following on the table:
update dbo.masterenroll
SET SSN = (
select distinct right('000' + cast(abs(checksum(newid()))%1000 as varchar(3)), 3) +
right('00' + cast(abs(checksum(newid()))%100 as varchar(2)), 2) +
right('0000' + cast(abs(checksum(newid()))%10000 as varchar(4)), 4)
)
Then when I check for duplicates using the following query:
select SSN, COUNT(SSN) As NumOccurrences
FROM [dbo].masterenroll
GROUP BY SSN
HAVING (Count(SSN) > 1)
What's a bummer is I don't have a PK.
It returns 13089 rows with 2 or more occurances. WTF man? What am I doing wrong?
May 14, 2009 at 2:25 pm
That won't work. You're selecting a distinct value in an inline sub-query, that means that one row is distinct when compared to nothing at all. That won't prevent duplicates at all.
The reason I generate them into a temp table first is because that way you can make sure you have exactly the number you need, and that they are all unique.
Once you've generated them all, in the temp table, then you use that to update the main table.
That gives you unique values.
Of course, depending on what these numbers are being used for, you could just create an incrementing value and pad it with leading zeroes. You'd end up with unique values like "000000001", "000000002", and so on, so they wouldn't look like SSNs, but they'd be the right size, and they would be unique. That might be good enough for things like code tests, where the computer doesn't care if it really looks like a valid SSN.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
May 14, 2009 at 2:38 pm
GSquared (5/14/2009)
RBarryYoung (5/14/2009)
...And you didn't get any duplicates? That's very suprising. Hmm, maybe my geusstimation is wrong...Barry, this version can't have duplicates. It's a primary key. It's prevented by the distinct and except clauses.
Ah, I see. For some reason it wasn't all showing up on my screen before.
Never mind then... :blush:
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 14, 2009 at 2:39 pm
Yeah, random data can be problimatic. One quick and dirty thing you coudl do is create a temp table with the PK on the SSN using the IGNORE_DUP_KEY And just do the inserts in batches until you have the number you want. Then insert into your actual table.
May 14, 2009 at 2:45 pm
GSquared (5/14/2009)
That won't work. You're selecting a distinct value in an inline sub-query, that means that one row is distinct when compared to nothing at all. That won't prevent duplicates at all.The reason I generate them into a temp table first is because that way you can make sure you have exactly the number you need, and that they are all unique.
Once you've generated them all, in the temp table, then you use that to update the main table.
That gives you unique values.
Of course, depending on what these numbers are being used for, you could just create an incrementing value and pad it with leading zeroes. You'd end up with unique values like "000000001", "000000002", and so on, so they wouldn't look like SSNs, but they'd be the right size, and they would be unique. That might be good enough for things like code tests, where the computer doesn't care if it really looks like a valid SSN.
You know, this may not be a bad idea. I don't think it has to "look" like a real SSN just simply be unique. I will try this!
Thanks to all who chimed in. Y'all rock!
May 14, 2009 at 3:07 pm
Actually, here's a method that will give you fairly "pseudorandom" SSNs, but always distinct:
Declare @P as Decimal(17)
Declare @s-2 as Decimal(17)
Select @P = 613452691, @s-2 = 1000000000
;WITH cteenroll as (
Select *
, ROW_NUMBER() Over(Order By YourClusteredIndex) as RecNo
From dbo.masterenroll
)
UPDATE cteEnroll
SET SSN = (@P * RecNo) % @s-2
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply