September 20, 2011 at 10:02 am
I need to be able to replace existing surnames and forenames in a data table with names selected at random from forename and surname tables I have created from list of surnames and forenames made available by the US Census Bureau. I have a large volume of data, so I would like to be able to create an update query that does it for me in one go if possible. This is so I can give test data to an offshore developer without contravening data protection legislation.
The scenario is as follows:
Main data table: Subscriber
ID - int, Identity(1,1)
Surname Varchar(15)
Forename Varchar(15)
plus many other fields that are not relevant
500,000+ records
Surname Table: Surname
ID int, Identity (1,1)
Surname Varchar(15)
89,000+ records
Forename Table: Forename
ID int Identity (1,1)
Forename Varchar(15)
2,500+ Records
I want to replace each Surname in the Subscriber table with a randomly selected record from the Surname table, and replace each forename with a randomly selected record from the Forename table, ideally without resorting to a cursor. e.g John Smith might become Francisco Gotti, Mary Williams becomes Albert Jones etc - I do not have to match name genders
All suggestions gratefully received.
September 20, 2011 at 10:27 am
i believe this will do what you want...obviously test this first!
--random first land last names,
;With MyRandomNamePairs AS
( SELECT
ROW_NUMBER() OVER(ORDER BY Surname.ID) AS N,
Forename.Forename, --column name same as tablename? yuck!
Surname.Surname --column name same as tablename? yuck!
FROM Forename
CROSS JOIN Surname )
,CTERANDOMIZEDDATA AS
(
SELECT Subscriber.ID,
MyRandomNamePairs.Forename,
MyRandomNamePairs,Surname,
ROWNUM = ROW_NUMBER() OVER (PARTITION BY Subscriber.ID ORDER BY NEWID())
FROM Subscriber --THE TABLE WITH THE KEY
CROSS JOIN MyRandomNamePairs --THE TABLE WITH THE RANDOM VALUE
)
UPDATE Subscriber --THE JOINING TABLE
SET Subscriber.Forename = CTERANDOMIZEDDATA.Forename, --THE RANDOM VALUE
Subscriber.Surname = CTERANDOMIZEDDATA.Surname --THE RANDOM VALUE
FROM CTERANDOMIZEDDATA
WHERE Subscriber.ID=CTERANDOMIZEDDATA.ID --FOR EACH KEY
AND ROWNUM = 1 --LIMITS TO ONE ARBITRARY VALUE DUE TO THE ORDER BY NEWID
Lowell
September 21, 2011 at 8:03 am
Lowell - thanks for your response
With the exception of a small typo (a comma in MyRandomNamePairs,Surname instead of a period), it does appear to work, but the performance is not good - I restricted it to updating the first 100 records in the subscription table by adding where id < 101 to the selection criteria in the update statement, and it was still executing after about 6 minutes. I suspect this might be down to the two cross joins.
I tried a different tack, selecting random surnames and forenames independently using a couple of functions and a view:
CREATE VIEW [dbo].[GetNewID]
AS
SELECT NewId() AS [NewID]
Create FUNCTION [dbo].[GetRandomForename]()
RETURNS varchar(15)
AS
BEGIN
RETURN (
SELECT TOP 1 Forename
FROM dbo.Forenames --The table and column now have different names!
ORDER BY (SELECT [NewId] FROM GetNewID)
)
END
CREATE FUNCTION [dbo].[GetRandomSurname]()
RETURNS nvarchar(15)
AS
BEGIN
RETURN (
SELECT TOP 1 Surname --The table and column now have different names!
FROM dbo.Surnames
ORDER BY (SELECT [NewId] FROM GetNewID)
)
END
I then ran the update query below - this updated 100 records in 21 secs, and 500 in 1m 56s
UPDATE dbo.subscription
SET dbo.subscription.Forename = dbo.GetRandomForename(),
dbo.subscription.Surname = dbo.GetRandomSurname()
where id < 501
September 21, 2011 at 8:26 am
a lot depends on your data; that cross join can have a cost.
if your data is anything like my copies of the US Census to 90% firstnames/lastnames, a cross join makes a massive 108 Million+ rows;
Census_dist_Male--1219 first names
Census_dist_Last --88799 last names
88799 * 1219 = 108,245,981
myself, i use a smaller subset; top 20 firstnames, and 20 last names, and i don't mind if the names get repeated every 400 records; in that way the performance is a bit better.
seperate updates of firstname, then lastname, without a cross join can perform better than trying to generate firstname + lastname randomly together.
Lowell
September 21, 2011 at 8:31 am
Lowell
I don't want too many repeats, so I'm using the full census tables.
Using the two seperate functions, it updated 2500 rows in 12 minutes - for 500,000 records I will probably leave it running overnight
Many thanks for your suggestions
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply