SQLServerCentral Article

Big Northwind with AI

,

As part of my work with Redgate, I wanted to do some testing on our subsetting and masking tools. Subsetting needs a big data set, and while Stack Overflow is big, it's kind of simple. I wanted something a little different. Since our engineers use Northwind to do a lot of demos, I decided to choose that, since it's familiar to them and has a few different tables. I remember Jonathan Kehayias blowing up AdventureWorks years ago, and I decided to do that with Northwind.

However, I wanted help. So I decided to ask an AI and I chose Claude from Anthropic. This article looks at how well this worked.

Working with Claude

My prompt: Can you write a script that will expand the Northwind sample database in SQL Server to 50GB?

The results came back with this text:

Certainly, way0utwest. I can help you create a script to expand the Northwind sample database in SQL Server to approximately 50GB. This will involve generating a large amount of sample data based on the existing tables. Here's a script that should accomplish this:

Then I got this script:

USE Northwind;
GO
-- Create a temporary table to store random words for generating names
CREATE TABLE #RandomWords (Word NVARCHAR(50));
INSERT INTO #RandomWords VALUES 
('Alpha'), ('Beta'), ('Gamma'), ('Delta'), ('Epsilon'),
('Zeta'), ('Eta'), ('Theta'), ('Iota'), ('Kappa'),
('Lambda'), ('Mu'), ('Nu'), ('Xi'), ('Omicron'),
('Pi'), ('Rho'), ('Sigma'), ('Tau'), ('Upsilon'),
('Phi'), ('Chi'), ('Psi'), ('Omega');
-- Function to generate a random string
CREATE FUNCTION dbo.RandomString(@Length INT)
RETURNS NVARCHAR(MAX)
AS
BEGIN
    DECLARE @CharPool NVARCHAR(255) = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
    DECLARE @Result NVARCHAR(MAX) = '';
    DECLARE @i INT = 0;
    WHILE @i < @Length
    BEGIN
        SET @Result = @Result + SUBSTRING(@CharPool, CAST(RAND() * LEN(@CharPool) AS INT) + 1, 1);
        SET @i = @i + 1;
    END
    RETURN @Result;
END
GO
-- Procedure to expand the database
CREATE PROCEDURE ExpandNorthwind
AS
BEGIN
    SET NOCOUNT ON;
    
    DECLARE @TargetSize BIGINT = 50 * 1024 * 1024 * 1024; -- 50GB in bytes
    DECLARE @CurrentSize BIGINT;
    DECLARE @BatchSize INT = 10000;
    DECLARE @LoopCounter INT = 0;
    -- Get the current database size
    SELECT @CurrentSize = SUM(size) * 8 * 1024 
    FROM sys.database_files 
    WHERE type_desc = 'ROWS';
    WHILE @CurrentSize < @TargetSize
    BEGIN
        -- Expand Customers
        INSERT INTO Customers (CustomerID, CompanyName, ContactName, ContactTitle, Address, City, Region, PostalCode, Country, Phone, Fax)
        SELECT TOP (@BatchSize)
            dbo.RandomString(5),
            (SELECT TOP 1 Word FROM #RandomWords ORDER BY NEWID()) + ' ' + (SELECT TOP 1 Word FROM #RandomWords ORDER BY NEWID()) + ' Corp',
            (SELECT TOP 1 Word FROM #RandomWords ORDER BY NEWID()) + ' ' + (SELECT TOP 1 Word FROM #RandomWords ORDER BY NEWID()),
            'Manager',
            dbo.RandomString(20),
            (SELECT TOP 1 City FROM Customers ORDER BY NEWID()),
            (SELECT TOP 1 Region FROM Customers WHERE Region IS NOT NULL ORDER BY NEWID()),
            dbo.RandomString(10),
            (SELECT TOP 1 Country FROM Customers ORDER BY NEWID()),
            '(123) 555-' + CAST(FLOOR(RAND() * 10000) AS VARCHAR(4)),
            '(123) 555-' + CAST(FLOOR(RAND() * 10000) AS VARCHAR(4))
        FROM sys.objects;
        -- Expand Orders
        INSERT INTO Orders (CustomerID, EmployeeID, OrderDate, RequiredDate, ShippedDate, ShipVia, Freight, ShipName, ShipAddress, ShipCity, ShipRegion, ShipPostalCode, ShipCountry)
        SELECT TOP (@BatchSize)
            (SELECT TOP 1 CustomerID FROM Customers ORDER BY NEWID()),
            (SELECT TOP 1 EmployeeID FROM Employees ORDER BY NEWID()),
            DATEADD(DAY, -CAST(RAND() * 3650 AS INT), GETDATE()),
            DATEADD(DAY, CAST(RAND() * 30 AS INT), GETDATE()),
            DATEADD(DAY, CAST(RAND() * 15 AS INT), GETDATE()),
            CAST(RAND() * 3 + 1 AS INT),
            RAND() * 1000,
            (SELECT TOP 1 CompanyName FROM Customers ORDER BY NEWID()),
            dbo.RandomString(30),
            (SELECT TOP 1 City FROM Customers ORDER BY NEWID()),
            (SELECT TOP 1 Region FROM Customers WHERE Region IS NOT NULL ORDER BY NEWID()),
            dbo.RandomString(10),
            (SELECT TOP 1 Country FROM Customers ORDER BY NEWID())
        FROM sys.objects;
        -- Expand Order Details
        INSERT INTO [Order Details] (OrderID, ProductID, UnitPrice, Quantity, Discount)
        SELECT TOP (@BatchSize)
            (SELECT TOP 1 OrderID FROM Orders ORDER BY NEWID()),
            (SELECT TOP 1 ProductID FROM Products ORDER BY NEWID()),
            RAND() * 1000,
            CAST(RAND() * 100 + 1 AS INT),
            CAST(RAND() * 0.5 AS DECIMAL(4,2))
        FROM sys.objects;
        SET @LoopCounter = @LoopCounter + 1;
        -- Check size every 10 loops to reduce overhead
        IF @LoopCounter % 10 = 0
        BEGIN
            SELECT @CurrentSize = SUM(size) * 8 * 1024 
            FROM sys.database_files 
            WHERE type_desc = 'ROWS';
            
            PRINT 'Current database size: ' + CAST(@CurrentSize / 1024 / 1024 / 1024.0 AS VARCHAR(10)) + ' GB';
        END
    END
    -- Clean up
    DROP TABLE #RandomWords;
    DROP FUNCTION dbo.RandomString;
END
GO
-- Execute the procedure
EXEC ExpandNorthwind;
-- Clean up
DROP PROCEDURE ExpandNorthwind;

At the end of the script, I had this note:

Claude AI asking to explain the code

I clicked Explain in the question and it entered the prompt for me. I got this:

Certainly, I'll explain the key components of this script:

  1. Setup:
    • We create a temporary table #RandomWords to store words for generating names.
    • We create a function RandomString to generate random strings for various fields.
  2. Main Procedure (ExpandNorthwind):
    • We set a target size of 50GB and initialize variables.
    • We enter a loop that continues until the database reaches the target size.
  3. Data Generation:
    • In each iteration, we insert new records into three main tables: Customers, Orders, and Order Details.
    • For Customers:
      • We generate random CustomerIDs, company names, contact names, etc.
      • We use the RandomWords table and RandomString function to create varied, realistic-looking data.
    • For Orders:
      • We link to random customers and employees.
      • We generate random dates, freight costs, and shipping information.
    • For Order Details:
      • We link to random orders and products.
      • We generate random quantities, prices, and discounts.
  4. Size Checking:
    • Every 10 iterations, we check the current database size.
    • We print the current size for monitoring progress.
  5. Cleanup:
    • After reaching the target size, we drop the temporary objects.
  6. Execution:
    • Finally, we execute the ExpandNorthwind procedure.

This script effectively "inflates" the Northwind database by adding a large volume of pseudo-random data based on the existing schema and data patterns. It maintains referential integrity by linking new records to existing or newly created related records.

The script is designed to run until the database reaches approximately 50GB, which could take a considerable amount of time depending on your hardware. It's also worth noting that this will put a significant load on your SQL Server, so it should be run in a non-production environment.

Not bad. Except it doesn't work.

First, there was no GO between the temp table and the CREATE FUNCTION, which is a no-no. Easy fix for me, but then if I run the code, I get this error:

Msg 443, Level 16, State 1, Procedure RandomString, Line 11 [Batch Start Line 12]

Invalid use of a side-effecting operator 'rand' within a function.

This is because a function can't have a RAND() function inside it. Lots of posts talk about this, but perhaps Claude was thinking more generic SQL.

Adjusting the Prompt

I then did this: change the code to work in SQL Server without using RAND() in a function

I got some interesting results:

Certainly, way0utwest. You're right to point out that using RAND() in a function can cause performance issues in SQL Server. I'll modify the script to avoid this problem. Here's the updated version:

Then I got more code. This time I had a different explanation at the bottom.

Key changes in this version:

  1. Removed the RandomString function entirely.
  2. Instead of using RAND(), we now use NEWID() and CHECKSUM(NEWID()) to generate random values. This approach is more efficient in SQL Server.
  3. For generating random strings (like CustomerID, Address, etc.), we use LEFT(CONVERT(VARCHAR(40), NEWID()), n) where n is the desired length.
  4. For generating random numbers, we use ABS(CHECKSUM(NEWID())) % n where n is the desired range.
  5. For generating random decimals (like Freight and UnitPrice), we combine two random numbers.
  6. For generating random dates, we use DATEADD with a random number of days.

This version should perform better in SQL Server while still providing a good distribution of random values. The script maintains the same overall structure and functionality as the previous version, just with a different method of generating random data.

Still a problem with the GO, but then this didn't run because of this line:

 DECLARE @TargetSize BIGINT = 50 * 1024 * 1024 * 1024; -- 50GB in bytes

While the target is bigint, the scalar values all default to int, and we can an overflow. Try it.

However, if I change the code to this, it works.

DECLARE @factor BIGINT = 1024;
SELECT  @targetsize = 50 * @factor * @factor * @factor;

That seemed to work and the script started to run.

Note: I did edit the script to create or replace the procedure as I was testing this.

The Execution

This is slow. I stopped the script, and then ran this code:

CREATE TABLE dbo.Logger
( logid INT NOT NULL IDENTITY(1,1) CONSTRAINT LoggerPK PRIMARY KEY
, Logdate DATETIME DEFAULT GETDATE()
, LogMsg VARCHAR(1000)
)

I then added a couple of INSERTS to the script to track what was happening. This was my code:

           INSERT dbo.Logger
             (Logdate, LogMsg)
           VALUES
             (DEFAULT, ('CurrentSize:' + CAST (@CurrentSize AS VARCHAR(20)) + ', TargetSize:' + CAST (@TargetSize AS VARCHAR(20))));

While I ran the script, I then queried this table a few times to be sure I was getting more data added and the database was growing. It was.

I let this run for quite some time, nearly a day.

script execution time

At this time, I had a 4GB database.  That's bigger, but not great. When I stopped the script, I saw this: a lot of FK errors.

FK errors

When I checked, I found that I had a lot of customers and a lot of orders, but still just a couple thousand order detail lines. I hadn't been checking this table as I assumed if I was getting orders, I was getting order details. My mistake.

That was it for this experiment. I decided not to try and correct things, as I could use this copy of Northwind to help me do some other work.

Conclusion

This was an interesting experiment for me, and it showed that while an AI can come up with a nice first draft, it isn't really intelligent. It doesn't quite understand what happens in databases, and it certainly isn't testing code.

Like a junior assistant.

This got me some basic code, which helped me, though I had to work with it. This AI isn't going to do more than get me started and save a little beginning time, but I still have to invest some effort to ensure the results meet my needs.

I'm still torn on how helpful an AI can be, but this was a fascinating few hours for me.

 

 

Rate

5 (3)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (3)

You rated this post out of 5. Change rating