Over the last couple of months I have on several occasions found myself in need of test data for demos, blogposts etc. You all know AdventureWorks, which will work for a lot of demoing stuff. But the amount of data is limited, and if you are demoing stuff about performance, AdventureWorks just isn’t that good.
I started looking around for tools, and thought I would have a look at TPC. My goal was not to be able to compare performance and transaction loads, but simply to generate data sets a bit bigger than AdventureWorks. So I found the TPC-H DBGen. In this blog post I will show you step by step how to use that tool to generate a database with a set of tables with data volumes large enough to be able to demonstrate performance tuning stuff.
1. Download DBGen
Download DBGen from http://www.tpc.org/tpch/spec/tpch_2_14_3.zip and extract it. In this guide I extracted the zip file to C:\tpch_2_14_3\
This is the project that we need to build to get the executable.
2. Build the solution
Now open C:\tpch_2_14_3\dbgen\tpch.sln in Visual Studio. Depending on your Visual Studio version, you might be faced with a conversion wizard. Just click Finish to execute the conversion. All you need to do, is to build the entire solution. I had some errors because of some locked files, so I had to manually delete all files from the C:\tpch_2_14_3\dbgen\Debug folder before I could compile the solution. The result is the file C:\tpch_2_14_3\dbgen\Debug\dbgen.exe.
3. Generate data using dbgen.exe
Now we need to execution dbgen.exe. If we execute the command with –h we get some help:
If we simply run dbgen.exe, it default to generating 1 GB of data, divided into 8 different tables (customers, nation, lineitem, orders, parts, partsupp, region, supplier). The –s parameter specifies a scale factor, so –s 10 gives us 10GB, and –s 100 generates 100GB of data. Let’s just try the default:
Whoops, we got an error! Why this is, I’m not sure, but I found the solution to be simple: Copy the file dbgen.exe one level up, so it is located in the C:\tpch_2_14_3\dbgen folder, and try again:
The –v gives verbose output. Now it generates files for each table. Depending on the speed of your system, this may take a few minutes.
The resulting files will be located in the same directory as dbgen.exe. The list of generated files is:
4. Create database and tables
The next we need to do, is create an empty database, and create the tables.
CREATE DATABASE DemoData GO
The schema we need, is available in the file C:\tpch_2_14_3\dbgen\dss.ddl and it looks like this:
-- Sccsid: @(#)dss.ddl 2.1.8.1 CREATE TABLE NATION ( N_NATIONKEY INTEGER NOT NULL, N_NAME CHAR(25) NOT NULL, N_REGIONKEY INTEGER NOT NULL, N_COMMENT VARCHAR(152)); CREATE TABLE REGION ( R_REGIONKEY INTEGER NOT NULL, R_NAME CHAR(25) NOT NULL, R_COMMENT VARCHAR(152)); CREATE TABLE PART ( P_PARTKEY INTEGER NOT NULL, P_NAME VARCHAR(55) NOT NULL, P_MFGR CHAR(25) NOT NULL, P_BRAND CHAR(10) NOT NULL, P_TYPE VARCHAR(25) NOT NULL, P_SIZE INTEGER NOT NULL, P_CONTAINER CHAR(10) NOT NULL, P_RETAILPRICE DECIMAL(15,2) NOT NULL, P_COMMENT VARCHAR(23) NOT NULL ); CREATE TABLE SUPPLIER ( S_SUPPKEY INTEGER NOT NULL, S_NAME CHAR(25) NOT NULL, S_ADDRESS VARCHAR(40) NOT NULL, S_NATIONKEY INTEGER NOT NULL, S_PHONE CHAR(15) NOT NULL, S_ACCTBAL DECIMAL(15,2) NOT NULL, S_COMMENT VARCHAR(101) NOT NULL); CREATE TABLE PARTSUPP ( PS_PARTKEY INTEGER NOT NULL, PS_SUPPKEY INTEGER NOT NULL, PS_AVAILQTY INTEGER NOT NULL, PS_SUPPLYCOST DECIMAL(15,2) NOT NULL, PS_COMMENT VARCHAR(199) NOT NULL ); CREATE TABLE CUSTOMER ( C_CUSTKEY INTEGER NOT NULL, C_NAME VARCHAR(25) NOT NULL, C_ADDRESS VARCHAR(40) NOT NULL, C_NATIONKEY INTEGER NOT NULL, C_PHONE CHAR(15) NOT NULL, C_ACCTBAL DECIMAL(15,2) NOT NULL, C_MKTSEGMENT CHAR(10) NOT NULL, C_COMMENT VARCHAR(117) NOT NULL); CREATE TABLE ORDERS ( O_ORDERKEY INTEGER NOT NULL, O_CUSTKEY INTEGER NOT NULL, O_ORDERSTATUS CHAR(1) NOT NULL, O_TOTALPRICE DECIMAL(15,2) NOT NULL, O_ORDERDATE DATE NOT NULL, O_ORDERPRIORITY CHAR(15) NOT NULL, O_CLERK CHAR(15) NOT NULL, O_SHIPPRIORITY INTEGER NOT NULL, O_COMMENT VARCHAR(79) NOT NULL); CREATE TABLE LINEITEM ( L_ORDERKEY INTEGER NOT NULL, L_PARTKEY INTEGER NOT NULL, L_SUPPKEY INTEGER NOT NULL, L_LINENUMBER INTEGER NOT NULL, L_QUANTITY DECIMAL(15,2) NOT NULL, L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL, L_DISCOUNT DECIMAL(15,2) NOT NULL, L_TAX DECIMAL(15,2) NOT NULL, L_RETURNFLAG CHAR(1) NOT NULL, L_LINESTATUS CHAR(1) NOT NULL, L_SHIPDATE DATE NOT NULL, L_COMMITDATE DATE NOT NULL, L_RECEIPTDATE DATE NOT NULL, L_SHIPINSTRUCT CHAR(25) NOT NULL, L_SHIPMODE CHAR(10) NOT NULL, L_COMMENT VARCHAR(44) NOT NULL);
If you want referential integrity on your tables, you can find help in this file: C:\tpch_2_14_3\dbgen\dss.ri
5. Load data into tables
Now we simply need to execute the following commands to import the data:
USE DemoData GO BULK INSERT part FROM 'C:\tpch_2_14_3\dbgen\part.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|') BULK INSERT customer FROM 'C:\tpch_2_14_3\dbgen\customer.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|') BULK INSERT orders FROM 'C:\tpch_2_14_3\dbgen\orders.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|') BULK INSERT partsupp FROM 'C:\tpch_2_14_3\dbgen\partsupp.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|') BULK INSERT supplier FROM 'c:\tpch_2_14_3\dbgen\supplier.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|') BULK INSERT lineitem FROM 'C:\tpch_2_14_3\dbgen\lineitem.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|') BULK INSERT nation FROM 'C:\tpch_2_14_3\dbgen\nation.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|') BULK INSERT region FROM 'C:\tpch_2_14_3\dbgen\region.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
And that’s it
Using this guide we can easily create 1, 10 or hundreds of GB of data, giving us the base we need to truly demonstrate the effect of performance tuning. This demo database will be used in later blog posts about query analyzing and performance optimizing. At some point I will also look into using the TPC tools for testing the transactional capacity of different systems, to be able to see the effect of configuration changes, hardware upgrades and other things we might wanna change to improve a system.