Have you ever been tasked with creating test data for an application and then ran into performance problems once the application moves to production?
Many of us manage databases or applications that contain regulated data that can’t leave a production environment. This means that we need to “clean” the data if it’s going to be used in QA or development work and one common way to de-identify the data is to simply update columns like firstname and lastname with a simple format “firstname” + counter; however, this results in all the data being unique and sequential. Firstname1, firstname2, firstname3, …
This isn’t good for getting like for like results with a production database and can lead to questions we’ve heard before in the workplace like “Why didn’t we catch this in QA?”
Dawn of AI
We have so many new tools at our fingers with prompt AI like ChatGPT being able to give suggestions on email formatting and content, research detail on Azure or other topics, creating stories or even poems. I asked myself, “Can ChatGPT generate data for SQL Server (and other databases)?”
The answer is yes; but, it comes with limitations. Let’s discuss…
Generating Realistic Test Data with AI
At first, I questioned if ChatGPT could create test data for a table schema at all, since it uses a generic model that isn’t specific to SQL code. I prompted “Create test data for my table with the following schema: create table x (id int, firstname varchar(20), lastname varchar(20), lastpurchasedate datetime2(0))” and the result was surprisingly good!!
Next I wondered how fancy I could get and asked for some of the values in date columns to be null. Yet again I was quite pleased with the result.
Knowing that the AI can generate test data I figured, why not create 5000 rows that I could dump into my table and get straight to testing. Let’s push this thing and see what we get. The result was not quite what I wanted. In fact it produced code that creates data with the exact issue described in the opening paragraphs of this post. Non-realistic, sequential test data is not what we’re after.
Why not prompt it to create “realistic data”?
OK This code is good but there’s a big problem with it. It doesn’t actually generate any test data it assumes there is a firstname and lastname table to pull values from. But wait, We can use this code if we generate smaller batches of names using the prompts, right? Let’s see. How many names can we generate at one time? The answer is 50. I tried 5000, 100, and then 50. Anything after 50 will result in 50 names and a comment saying “Add more data here”, lol. We can work with this though!
Prompt ChatGPT to “generate 50 realistic first names not based on a counter”.
We can use the strength of this AI to give us many diverse names to use by prompting different countries or regions like “US, France, Germany” or “American, French, German, …” names. Just remember that the real limit is 50. You can even prompt the AI to give you 50 names that are not in a list of names, giving you new names to work with. So if you want 100 Australian names, prompt it for the first 50 then prompt it for another 50 but ask it not to include the names in the original 50 you just got.
Now that we have a few hundred first and last names we can use the code we got earlier to generate test data for our application dev and qa work.
Here’s the simple code I came up with. Keep in mind you can use the first names and last names that you generate to randomly assign data in your “de-identification” process as well. This code is just a very simple example to get you thinking of the possibilities.
— random data generation
SET NOCOUNT ON;
— Create a temporary table for German first names
CREATE TABLE #FirstNames (firstname varchar(20));
— Enable random data generation
SET NOCOUNT ON;
— Insert 50 random German first names
INSERT INTO #FirstNames (firstname)
VALUES
(‘Maximilian’), (‘Sophie’), (‘Lukas’), (‘Mia’), (‘Finn’),
(‘Lena’), (‘Leon’), (‘Emma’), (‘Luca’), (‘Hannah’),
(‘Benjamin’), (‘Marie’), (‘David’), (‘Anna’), (‘Elias’),
(‘Laura’), (‘Tim’), (‘Lea’), (‘Paul’), (‘Johanna’),
(‘Julian’), (‘Emilia’), (‘Jan’), (‘Sophia’), (‘Noah’),
(‘Amelie’), (‘Felix’), (‘Mila’), (‘Simon’), (‘Clara’),
(‘Moritz’), (‘Elena’), (‘Jonas’), (‘Victoria’), (‘Oskar’),
(‘Mara’), (‘Nico’), (‘Isabella’), (‘Erik’), (‘Maria’),
(‘Julius’), (‘Fiona’), (‘Tom’), (‘Maja’), (‘Anton’),
(‘Sofia’), (‘Vincent’), (‘Emily’), (‘Jakob’);
— Insert 50 random French first names into the existing temporary table #lastnames
INSERT INTO #FirstNames (firstname)
VALUES
(‘Léa’), (‘Lucas’), (‘Chloé’), (‘Louis’), (‘Emma’),
(‘Gabriel’), (‘Manon’), (‘Jules’), (‘Inès’), (‘Léo’),
(‘Camille’), (‘Hugo’), (‘Sarah’), (‘Arthur’), (‘Alice’),
(‘Raphaël’), (‘Zoé’), (‘Enzo’), (‘Louise’), (‘Nathan’),
(‘Eva’), (‘Théo’), (‘Clara’), (‘Maël’), (‘Charlotte’),
(‘Tom’), (‘Lola’), (‘Antoine’), (‘Julia’), (‘Mathis’),
(‘Lea’), (‘Paul’), (‘Mia’), (‘Nathan’), (‘Emma’),
(‘Alexandre’), (‘Lina’), (‘Matheo’), (‘Lou’), (‘Ethan’),
(‘Jade’), (‘Evan’), (‘Léna’), (‘Noah’), (‘Lilou’),
(‘Louis’), (‘Elena’), (‘Sacha’), (‘Mila’), (‘Leo’);
— Insert 50 random American first names into the existing temporary table #lastnames
INSERT INTO #firstNames (firstname)
VALUES
(‘Emma’), (‘Liam’), (‘Olivia’), (‘Noah’), (‘Ava’),
(‘Isabella’), (‘Sophia’), (‘Jackson’), (‘Lucas’), (‘Mia’),
(‘Oliver’), (‘Ethan’), (‘Aiden’), (‘Harper’), (‘Evelyn’),
(‘Abigail’), (‘Emily’), (‘Amelia’), (‘Benjamin’), (‘Ella’),
(‘Alexander’), (‘Henry’), (‘Jacob’), (‘Michael’), (‘Daniel’),
(‘William’), (‘James’), (‘Elijah’), (‘Avery’), (‘Sofia’),
(‘Grace’), (‘Scarlett’), (‘Jack’), (‘Chloe’), (‘Lily’),
(‘Luna’), (‘Zoe’), (‘Mila’), (‘Stella’), (‘Aria’),
(‘Eleanor’), (‘Hazel’), (‘Violet’), (‘Leo’), (‘Aurora’),
(‘Levi’), (‘Lucy’), (‘Nora’), (‘Ethan’), (‘Ava’);
— Select the generated German first names
SELECT * FROM #FirstNames;
— Create a temporary table for German last names
CREATE TABLE #LastNames (lastname varchar(20));
— Insert 50 random German last names
INSERT INTO #LastNames (lastname)
VALUES
(‘Schmidt’), (‘Müller’), (‘Schneider’), (‘Fischer’), (‘Weber’),
(‘Meyer’), (‘Wagner’), (‘Becker’), (‘Schulz’), (‘Hoffmann’),
(‘Koch’), (‘Bauer’), (‘Richter’), (‘Wolf’), (‘Schröder’),
(‘Schwarz’), (‘Hartmann’), (‘Lange’), (‘Schmitt’), (‘Werner’),
(‘Schmitz’), (‘Krause’), (‘Maier’), (‘Lehmann’), (‘Herrmann’),
(‘Köhler’), (‘Klein’), (‘Walter’), (‘Schuster’), (‘Böhm’),
(‘Seidel’), (‘Lorenz’), (‘Schwarz’), (‘Bergmann’), (‘Engel’),
(‘Vogel’), (‘Hahn’), (‘Kramer’), (‘Frank’), (‘Sommer’),
(‘Kühn’), (‘Barth’), (‘Zimmermann’), (‘Graf’), (‘Peters’),
(‘Krüger’), (‘Jäger’), (‘Gross’), (‘Voigt’);
— Insert 50 random French last names into the existing temporary table #lastnames
INSERT INTO #lastnames (lastname)
VALUES
(‘Martin’), (‘Bernard’), (‘Dubois’), (‘Thomas’), (‘Robert’),
(‘Richard’), (‘Petit’), (‘Durand’), (‘Leroy’), (‘Moreau’),
(‘Simon’), (‘Laurent’), (‘Lefevre’), (‘Michel’), (‘Garcia’),
(‘David’), (‘Bertrand’), (‘Roux’), (‘Vincent’), (‘Fournier’),
(‘Morel’), (‘Girard’), (‘Andre’), (‘Lefevre’), (‘Mercier’),
(‘Dupont’), (‘Lambert’), (‘Boucher’), (‘Colin’), (‘Caron’),
(‘Renaud’), (‘Picard’), (‘Lemoine’), (‘Lecomte’), (‘Leger’),
(‘Dufour’), (‘Perrin’), (‘Marchand’), (‘Leclerc’), (‘Begue’),
(‘Gauthier’), (‘Faure’), (‘Rousseau’), (‘Blanc’), (‘Guerin’),
(‘Bouvier’), (‘Riviere’), (‘Chevalier’), (‘Beaumont’), (‘Lemaire’);
— Insert 50 random American last names into the existing temporary table #lastnames
INSERT INTO #lastnames (lastname)
VALUES
(‘Smith’), (‘Johnson’), (‘Williams’), (‘Jones’), (‘Brown’),
(‘Davis’), (‘Miller’), (‘Wilson’), (‘Moore’), (‘Taylor’),
(‘Anderson’), (‘Thomas’), (‘Jackson’), (‘White’), (‘Harris’),
(‘Martin’), (‘Thompson’), (‘Garcia’), (‘Martinez’), (‘Robinson’),
(‘Clark’), (‘Rodriguez’), (‘Lewis’), (‘Lee’), (‘Walker’),
(‘Hall’), (‘Allen’), (‘Young’), (‘Hernandez’), (‘King’),
(‘Wright’), (‘Lopez’), (‘Hill’), (‘Scott’), (‘Green’),
(‘Adams’), (‘Baker’), (‘Gonzalez’), (‘Nelson’), (‘Carter’),
(‘Mitchell’), (‘Perez’), (‘Roberts’), (‘Turner’), (‘Phillips’),
(‘Campbell’), (‘Parker’), (‘Evans’), (‘Edwards’), (‘Collins’),
(‘Stewart’);
— Select the generated German last names
SELECT * FROM #LastNames;
— Enable random data generation
SET NOCOUNT ON;
create table #x (id int identity(1,1), firstname varchar(25), lastname varchar(25), lastpurchasedate datetime2(3))
— Declare variables
DECLARE @Counter INT = 1;
— Loop to insert 5000 rows
WHILE @Counter <= 5000
BEGIN
INSERT INTO #x (firstname, lastname, lastpurchasedate)
VALUES
(
(SELECT TOP 1 firstname FROM #FirstNames ORDER BY NEWID()),
(SELECT TOP 1 lastname FROM #LastNames ORDER BY NEWID()),
CASE WHEN @Counter % 5 = 0 THEN NULL
ELSE DATEADD(DAY, -ABS(CHECKSUM(NEWID())) % (365*3) /*3 years*/, GETDATE()) END);
— Increment counter
SET @Counter = @Counter + 1;
END;
select * from #x
In one of my next posts, I hope to use Azure’s Open AI resource to implement this concept further. Keep an eye out for that post.