July 27, 2011 at 4:38 pm
Sorry, if I'm posting this in the wrong area, but I couldn't find a more appropriate place to put these questions.
I'm wondering if anyone is using Red Gate's Data Generator and I'd like their opinion on how they like it. I have also been going through the website on their support pages trying to figure out how to automate a package.
To create our test/dev environments I have SQL jobs that backup the prod db, then restore it to dev/test. So what I'm looking to do is call the SQL data Gen package from the SQL job after the db is restored. ...has anyone done this? Thanks in advance!
Anita
Thanks!
July 27, 2011 at 6:39 pm
Why use the generator after importing prod? The process usually stops there!
July 27, 2011 at 8:49 pm
I agree... what are you going to do with the data generator once you've restored a copy of prod? Are you simply trying to obfuscate data?
--Jeff Moden
Change is inevitable... Change for the better is not.
July 28, 2011 at 8:54 am
Yes, I'm trying to scramble the prod data in the test/dev environment. Is this the wrong way to go about this?
Thanks!
July 28, 2011 at 9:01 am
Well ya. Why copy the data if you don't want it?
Seems easier to just built it once then forget about it.
If you need to tune a specific perf problem, copy the prod db to QA and work on it there.
For everything else, the deployements should be scripted and controlled and it should move from dev to test to qa to prod... not the other way around.
July 28, 2011 at 9:08 am
Ok, so how do you propose I build test/dev? The developers need data. Are you suggesting I generate all fake data? Their argument is that the users need to be able to validate data in when they build them new reports or make changes to the applications and they can't validate against fake data. So I was going to do the refresh into test/dev from prod and then just scramble the critical data only like social security numbers, addresses, personal info etc...
I guess I'd like to hear how others are doing it. Maybe I am going about this all wrong.
Anita
Thanks!
July 28, 2011 at 9:19 am
You may consider taking a copy of the prod database, obfucscating some data if needed, and make it your dev standard. Then you take a copy of that (possibly obfuscated) database and restore for dev and QA as needed, which may be several times a day. Without frequent restores the database would drift far from the original state and may no longer make business sense.
You take a new copy whenever you touch schema or whenever the data in the prod changes significantly, e.g., a new type was added, or number of rows in an important table grew significantly. (This is important for perf testing.)
July 28, 2011 at 9:19 am
Sounds like a good plan.
You can use something like this as well :
SELECT CHECKSUM(NEWID()) % 10000 + 1 FROM sys.columns
This will generate random numbers. From that you can build dates too. You can overwrite the text with the same data or just hash it.
July 28, 2011 at 10:08 am
sindhigrl (7/28/2011)
Yes, I'm trying to scramble the prod data in the test/dev environment. Is this the wrong way to go about this?
Nope. Not wrong. This can be done pretty easily. Do you have the Adventure Works DB loaded? I ask because there's two ways to do this and the use of that DB would make life a little easier. I'm at work right now and can't spend the time on it just now. I'll take a whack at it tonight.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 28, 2011 at 10:32 am
Jeff, I can install Adventure works on my test server. Thanks for your help!
Thanks!
July 28, 2011 at 2:16 pm
sindhigrl (7/28/2011)
Jeff, I can install Adventure works on my test server. Thanks for your help!
That'll work great. Adventure works as a table in it with over 19,000 bogus addresses most of which are in a U.S. format. We can build many more off of those if they're needed. How many rows do you need to have obfuscated?
--Jeff Moden
Change is inevitable... Change for the better is not.
July 28, 2011 at 3:54 pm
I have always felt that it would be better to "profile" the production data to determine its characteristics. Then generate data that fits the character of the production data but has been populated with specifics that allow me to test the functionality of my queries and stored procedures.
A simple example would be a column of amounts paid could be populated with the value 1.00 so that when I test my function that sums the column into a total, the value returned would be easily recognized as the same as the number of rows.
This forces you to better understand the data and clearly exposes the functionality that is being coded hopefully to match specific 'USE Case'.
July 28, 2011 at 7:12 pm
sindhigrl (7/28/2011)
Ok, so how do you propose I build test/dev? The developers need data. Are you suggesting I generate all fake data? Their argument is that the users need to be able to validate data in when they build them new reports or make changes to the applications and they can't validate against fake data. So I was going to do the refresh into test/dev from prod and then just scramble the critical data only like [font="Arial Black"]social security numbers[/font], addresses, [font="Arial Black"]personal info [/font]etc...I guess I'd like to hear how others are doing it. Maybe I am going about this all wrong.
Anita
Gosh... I just read that a little better. I would certainly hope that SSN's and other personal information are fully and safely encrypted. Even having someone's birthday listed in a violation of all that is holy in the world of privacy because a common "security question" is "Please verify your birthday." PLEASE make sure that information in encrypted.
As for addresses, you can randomly shuffle street addresses between rows and then randomize the street numbers. Here's how to randomly shuffle the street addresses. You could do the same for all the address lines if absolutely necessary.
--=====================================================================================================================
-- Setup some "burnable" test data to demonstrate on.
-- Nothing in this section is a part of the solution. We're just building test data.
--=====================================================================================================================
--===== Work in a nice safe place that everyone has
USE TempDB
;
--===== Conditionally drop the test table to make reruns in SSMS easier.
IF OBJECT_ID('tempdb..#TestTable','U') IS NOT NULL
DROP TABLE #TestTable
;
--===== Create and populate the test table on-the-fly.
-- We use a well know source for this demonstration.
SELECT AddressID, AddressLine1
INTO #TestTable
FROM AdventureWorks.Person.Address
;
--===== Add the likely Primary Key to the test table.
-- Don't use named constraints on a temp table because of concurrency issues.
-- No problem on tests like this but will cause problems in Production.
ALTER TABLE #TestTable
ADD PRIMARY KEY CLUSTERED (AddressID)
;
--=====================================================================================================================
-- The section solves the problem by randomly changing all the addresses to some other address in the table.
--=====================================================================================================================
--===== Randomly "Scramble" the addresses to another ID
WITH
cteEnumerateStraight AS
(--==== This creates the original non-randomized "sort" of AddressLine1
SELECT RowNum = ROW_NUMBER() OVER (ORDER BY AddressID),
AddressID,
AddressLine1
FROM #TestTable
),
cteEnumerateRandom AS
(--==== This creates a random "sort" of AddressLine1
SELECT RowNum = ROW_NUMBER() OVER (ORDER BY NEWID()),
AddressLine1
FROM #TestTable
)--==== This changes the "straight" address to one of the "random" ones.
-- All addresses are affected because of the uniqueness of ROW_NUMBER()
-- and the fact both RowNum columns were built from the same table.
-- I say "all addresses are affected" but there is the random chance
-- that the random address will be the original. You could check for
-- such a thing by comparing the original table against the Dev table.
UPDATE tgt
SET AddressLine1 = rnd.AddressLine1
FROM cteEnumerateStraight tgt
INNER JOIN cteEnumerateRandom rnd
ON tgt.RowNum = rnd.RowNum
;
Again, I can't stress it enough... you shouldn't have things like SSN's and birth-dates in clear text in production no matter how good your security against the external world is. It's what someone on the inside could do with the info that could cost your company money and reputation as well as causing extremely grave damage to the financial and personal well being of the people whose information you have in the database. Don't wait... fix it NOW!
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply