As part of my work with Redgate, I wanted to do some testing on our subsetting and masking tools. Subsetting needs a big data set, and while Stack Overflow is big, it's kind of simple. I wanted something a little different. Since our engineers use Northwind to do a lot of demos, I decided to choose that, since it's familiar to them and has a few different tables. I remember Jonathan Kehayias blowing up AdventureWorks years ago, and I decided to do that with Northwind.
However, I wanted help. So I decided to ask an AI and I chose Claude from Anthropic. This article looks at how well this worked.
Working with Claude
My prompt: Can you write a script that will expand the Northwind sample database in SQL Server to 50GB?
The results came back with this text:
Certainly, way0utwest. I can help you create a script to expand the Northwind sample database in SQL Server to approximately 50GB. This will involve generating a large amount of sample data based on the existing tables. Here's a script that should accomplish this:
Then I got this script:
USE Northwind; GO -- Create a temporary table to store random words for generating names CREATE TABLE #RandomWords (Word NVARCHAR(50)); INSERT INTO #RandomWords VALUES ('Alpha'), ('Beta'), ('Gamma'), ('Delta'), ('Epsilon'), ('Zeta'), ('Eta'), ('Theta'), ('Iota'), ('Kappa'), ('Lambda'), ('Mu'), ('Nu'), ('Xi'), ('Omicron'), ('Pi'), ('Rho'), ('Sigma'), ('Tau'), ('Upsilon'), ('Phi'), ('Chi'), ('Psi'), ('Omega'); -- Function to generate a random string CREATE FUNCTION dbo.RandomString(@Length INT) RETURNS NVARCHAR(MAX) AS BEGIN DECLARE @CharPool NVARCHAR(255) = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'; DECLARE @Result NVARCHAR(MAX) = ''; DECLARE @i INT = 0; WHILE @i < @Length BEGIN SET @Result = @Result + SUBSTRING(@CharPool, CAST(RAND() * LEN(@CharPool) AS INT) + 1, 1); SET @i = @i + 1; END RETURN @Result; END GO -- Procedure to expand the database CREATE PROCEDURE ExpandNorthwind AS BEGIN SET NOCOUNT ON; DECLARE @TargetSize BIGINT = 50 * 1024 * 1024 * 1024; -- 50GB in bytes DECLARE @CurrentSize BIGINT; DECLARE @BatchSize INT = 10000; DECLARE @LoopCounter INT = 0; -- Get the current database size SELECT @CurrentSize = SUM(size) * 8 * 1024 FROM sys.database_files WHERE type_desc = 'ROWS'; WHILE @CurrentSize < @TargetSize BEGIN -- Expand Customers INSERT INTO Customers (CustomerID, CompanyName, ContactName, ContactTitle, Address, City, Region, PostalCode, Country, Phone, Fax) SELECT TOP (@BatchSize) dbo.RandomString(5), (SELECT TOP 1 Word FROM #RandomWords ORDER BY NEWID()) + ' ' + (SELECT TOP 1 Word FROM #RandomWords ORDER BY NEWID()) + ' Corp', (SELECT TOP 1 Word FROM #RandomWords ORDER BY NEWID()) + ' ' + (SELECT TOP 1 Word FROM #RandomWords ORDER BY NEWID()), 'Manager', dbo.RandomString(20), (SELECT TOP 1 City FROM Customers ORDER BY NEWID()), (SELECT TOP 1 Region FROM Customers WHERE Region IS NOT NULL ORDER BY NEWID()), dbo.RandomString(10), (SELECT TOP 1 Country FROM Customers ORDER BY NEWID()), '(123) 555-' + CAST(FLOOR(RAND() * 10000) AS VARCHAR(4)), '(123) 555-' + CAST(FLOOR(RAND() * 10000) AS VARCHAR(4)) FROM sys.objects; -- Expand Orders INSERT INTO Orders (CustomerID, EmployeeID, OrderDate, RequiredDate, ShippedDate, ShipVia, Freight, ShipName, ShipAddress, ShipCity, ShipRegion, ShipPostalCode, ShipCountry) SELECT TOP (@BatchSize) (SELECT TOP 1 CustomerID FROM Customers ORDER BY NEWID()), (SELECT TOP 1 EmployeeID FROM Employees ORDER BY NEWID()), DATEADD(DAY, -CAST(RAND() * 3650 AS INT), GETDATE()), DATEADD(DAY, CAST(RAND() * 30 AS INT), GETDATE()), DATEADD(DAY, CAST(RAND() * 15 AS INT), GETDATE()), CAST(RAND() * 3 + 1 AS INT), RAND() * 1000, (SELECT TOP 1 CompanyName FROM Customers ORDER BY NEWID()), dbo.RandomString(30), (SELECT TOP 1 City FROM Customers ORDER BY NEWID()), (SELECT TOP 1 Region FROM Customers WHERE Region IS NOT NULL ORDER BY NEWID()), dbo.RandomString(10), (SELECT TOP 1 Country FROM Customers ORDER BY NEWID()) FROM sys.objects; -- Expand Order Details INSERT INTO [Order Details] (OrderID, ProductID, UnitPrice, Quantity, Discount) SELECT TOP (@BatchSize) (SELECT TOP 1 OrderID FROM Orders ORDER BY NEWID()), (SELECT TOP 1 ProductID FROM Products ORDER BY NEWID()), RAND() * 1000, CAST(RAND() * 100 + 1 AS INT), CAST(RAND() * 0.5 AS DECIMAL(4,2)) FROM sys.objects; SET @LoopCounter = @LoopCounter + 1; -- Check size every 10 loops to reduce overhead IF @LoopCounter % 10 = 0 BEGIN SELECT @CurrentSize = SUM(size) * 8 * 1024 FROM sys.database_files WHERE type_desc = 'ROWS'; PRINT 'Current database size: ' + CAST(@CurrentSize / 1024 / 1024 / 1024.0 AS VARCHAR(10)) + ' GB'; END END -- Clean up DROP TABLE #RandomWords; DROP FUNCTION dbo.RandomString; END GO -- Execute the procedure EXEC ExpandNorthwind; -- Clean up DROP PROCEDURE ExpandNorthwind;
At the end of the script, I had this note:
I clicked Explain in the question and it entered the prompt for me. I got this:
Certainly, I'll explain the key components of this script:
- Setup:
- We create a temporary table #RandomWords to store words for generating names.
- We create a function RandomString to generate random strings for various fields.
- Main Procedure (ExpandNorthwind):
- We set a target size of 50GB and initialize variables.
- We enter a loop that continues until the database reaches the target size.
- Data Generation:
- In each iteration, we insert new records into three main tables: Customers, Orders, and Order Details.
- For Customers:
- We generate random CustomerIDs, company names, contact names, etc.
- We use the RandomWords table and RandomString function to create varied, realistic-looking data.
- For Orders:
- We link to random customers and employees.
- We generate random dates, freight costs, and shipping information.
- For Order Details:
- We link to random orders and products.
- We generate random quantities, prices, and discounts.
- Size Checking:
- Every 10 iterations, we check the current database size.
- We print the current size for monitoring progress.
- Cleanup:
- After reaching the target size, we drop the temporary objects.
- Execution:
- Finally, we execute the ExpandNorthwind procedure.
This script effectively "inflates" the Northwind database by adding a large volume of pseudo-random data based on the existing schema and data patterns. It maintains referential integrity by linking new records to existing or newly created related records.
The script is designed to run until the database reaches approximately 50GB, which could take a considerable amount of time depending on your hardware. It's also worth noting that this will put a significant load on your SQL Server, so it should be run in a non-production environment.
Not bad. Except it doesn't work.
First, there was no GO between the temp table and the CREATE FUNCTION, which is a no-no. Easy fix for me, but then if I run the code, I get this error:
Msg 443, Level 16, State 1, Procedure RandomString, Line 11 [Batch Start Line 12]
Invalid use of a side-effecting operator 'rand' within a function.
This is because a function can't have a RAND() function inside it. Lots of posts talk about this, but perhaps Claude was thinking more generic SQL.
Adjusting the Prompt
I then did this: change the code to work in SQL Server without using RAND() in a function
I got some interesting results:
Certainly, way0utwest. You're right to point out that using RAND() in a function can cause performance issues in SQL Server. I'll modify the script to avoid this problem. Here's the updated version:
Then I got more code. This time I had a different explanation at the bottom.
Key changes in this version:
- Removed the
RandomString
function entirely.- Instead of using
RAND()
, we now useNEWID()
andCHECKSUM(NEWID())
to generate random values. This approach is more efficient in SQL Server.- For generating random strings (like CustomerID, Address, etc.), we use
LEFT(CONVERT(VARCHAR(40), NEWID()), n)
wheren
is the desired length.- For generating random numbers, we use
ABS(CHECKSUM(NEWID())) % n
wheren
is the desired range.- For generating random decimals (like Freight and UnitPrice), we combine two random numbers.
- For generating random dates, we use
DATEADD
with a random number of days.This version should perform better in SQL Server while still providing a good distribution of random values. The script maintains the same overall structure and functionality as the previous version, just with a different method of generating random data.
Still a problem with the GO, but then this didn't run because of this line:
DECLARE @TargetSize BIGINT = 50 * 1024 * 1024 * 1024; -- 50GB in bytes
While the target is bigint, the scalar values all default to int, and we can an overflow. Try it.
However, if I change the code to this, it works.
DECLARE @factor BIGINT = 1024; SELECT @targetsize = 50 * @factor * @factor * @factor;
That seemed to work and the script started to run.
Note: I did edit the script to create or replace the procedure as I was testing this.
The Execution
This is slow. I stopped the script, and then ran this code:
CREATE TABLE dbo.Logger ( logid INT NOT NULL IDENTITY(1,1) CONSTRAINT LoggerPK PRIMARY KEY , Logdate DATETIME DEFAULT GETDATE() , LogMsg VARCHAR(1000) )
I then added a couple of INSERTS to the script to track what was happening. This was my code:
INSERT dbo.Logger (Logdate, LogMsg) VALUES (DEFAULT, ('CurrentSize:' + CAST (@CurrentSize AS VARCHAR(20)) + ', TargetSize:' + CAST (@TargetSize AS VARCHAR(20))));
While I ran the script, I then queried this table a few times to be sure I was getting more data added and the database was growing. It was.
I let this run for quite some time, nearly a day.
At this time, I had a 4GB database. That's bigger, but not great. When I stopped the script, I saw this: a lot of FK errors.
When I checked, I found that I had a lot of customers and a lot of orders, but still just a couple thousand order detail lines. I hadn't been checking this table as I assumed if I was getting orders, I was getting order details. My mistake.
That was it for this experiment. I decided not to try and correct things, as I could use this copy of Northwind to help me do some other work.
Conclusion
This was an interesting experiment for me, and it showed that while an AI can come up with a nice first draft, it isn't really intelligent. It doesn't quite understand what happens in databases, and it certainly isn't testing code.
Like a junior assistant.
This got me some basic code, which helped me, though I had to work with it. This AI isn't going to do more than get me started and save a little beginning time, but I still have to invest some effort to ensure the results meet my needs.
I'm still torn on how helpful an AI can be, but this was a fascinating few hours for me.