Clustered Columnstore Indexes – part 30 (“Bulk Load API Magic Number”)

Continuation from the previous 29 parts, starting from http://www.nikoport.com/2013/07/05/clustered-columnstore-indexes-part-1-intro/

It has been a long time (almost a year) since Microsoft revealed the public preview version of SQL Server with Clustered Columnstore Indexes available, and ever since there were hundreds of presentation talking about the Bulk loading and the magic number around 100.000 rows which separates the final result as an open Delta-Store vs a compressed Segment.

I am guilty in this imprecise information as well – ever since the first presentation on Clustered Columnstore in July of 2013 I was telling everyone about “around 100.000 rows”.

Now it is the time to find out the exact number.

I have already shown this demos last weekend at SQLSaturday Edinburgh 2014, and so it is time to publish it on the web.

I will kick of by creating my test table:

CREATE TABLE [dbo].[MaxDataTable](
[c1] [bigint] NULL,
[c2] [numeric](36, 3) NULL,
[c3] [bit] NULL,
[c4] [smallint] NULL,
[c5] [decimal](18, 3) NULL,
[c6] [smallmoney] NULL,
[c7] [int] NULL,
[c8] [tinyint] NULL,
[c9] [money] NULL,
[c10] [real] NULL,
[c11] [real] NULL,
[c12] [date] NULL,
[c13] [datetimeoffset](7) NULL,
[c14] [datetime2](7) NULL,
[c15] [smalldatetime] NULL,
[c16] [datetime] NULL,
[c17] [time](7) NULL,
[c18] [char](100) NULL,
[c19] [varchar](100) NULL,
[c20] [nchar](100) NULL,
[c21] [nvarchar](100) NULL,
[c22] [binary](8) NULL,
[c23] [varbinary](8) NULL,
[c24] [uniqueidentifier] NULL
) ON [PRIMARY];
GO
-- Create Clustered Columnstore Index:
create clustered columnstore index CCI_MaxDataTable
on  [dbo].[MaxDataTable];

Now let us load a full Segment (1045678 rows) and update it with some random data and then invoke Tuple Mover (since it will keep on staying as an open Delta-store even when it reaches the maximum number of rows):

Set NoCount ON
Truncate table dbo.MaxDataTable;
declare @i as int;
declare @max as int;
select @max = isnull(max(C1),0) from dbo.MaxDataTable;
set @i = 1;
begin tran
while @i <= 1048576
begin
insert into dbo.MaxDataTable
default values
set @i = @i + 1;
end;
commit;
-- Lets Update all rows with some random data
with updTable as
(
select *
, row_number() over(partition by C1 order by C1) as rnk
from dbo.MaxDataTable
)
update updTable
set C1 = rnk,
C2 = ABS(CHECKSUM(NewId())) % 142359.0,
C3 = 1,
C4 = cast(ABS(CHECKSUM(NewId())) % 10000 as smallint),
C5 = ABS(CHECKSUM(NewId())) % 242359.0,
C6 = rnk / ABS(CHECKSUM(NewId())) % 242359.0,
C7 = ABS(CHECKSUM(NewId())) % 2000000,
C8 = ABS(CHECKSUM(NewId())) % 255,
C9 = rnk / ABS(CHECKSUM(NewId())) % 242359.0,
C10 = rnk / ABS(CHECKSUM(NewId())) % 242359.0,
C11 = rnk / ABS(CHECKSUM(NewId())) % 242359.0,
C12 = getDate(),
C14 = DATEADD(day, (ABS(CHECKSUM(NEWID())) % 65530), 0),
C15 = DATEADD(day, (ABS(CHECKSUM(NEWID())) % 65530), 0),
C16 = DATEADD(day, (ABS(CHECKSUM(NEWID())) % 65530), 0),
C17 = getDate(),
C18 = cast( ABS(CHECKSUM(NewId())) % 242359.0 as char(25)),
C19 = cast( ABS(CHECKSUM(NewId())) % 232659.0 as varchar(25)),
C20 = cast( ABS(CHECKSUM(NewId())) % 242359.0 as nchar(25)),
C21 = cast( ABS(CHECKSUM(NewId())) % 232659.0 as nvarchar(25));
-- Build our Segment
alter index CCI_MaxDataTable on dbo.MaxDataTable
Rebuild;

Now we can verify if what kind of structures we have in our Clustered Columnstore Table:

-- Check on the Row Groups status
SELECT rg.total_rows, 
state_description,
cast(100.0*(total_rows - ISNULL(deleted_rows,0))/iif(total_rows = 0, 1, total_rows) as Decimal(6,3)) AS PercentFull, 
i.object_id, object_name(i.object_id) AS TableName, 
i.name AS IndexName, i.index_id, i.type_desc, 
rg.*
FROM sys.indexes AS i
INNEr JOIN sys.column_store_row_groups AS rg
ON i.object_id = rg.object_id
AND i.index_id = rg.index_id 
WHERE object_name(i.object_id) in ( 'MaxDataTable')
ORDER BY object_name(i.object_id), i.name, row_group_id;

We have enough data to play, so let us export 100K & 105K Rows into a file and load them back into our table using BULK API to see what happens:

-- Export 100K Rows
EXEC xp_cmdshell 'bcp "SELECT top 100000 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_100K.rpt" -T -c -t,';
-- Export 105K Rows
EXEC xp_cmdshell 'bcp "SELECT top 105000 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_105K.rpt" -T -c -t,';
-- *****************************************************************
-- Import 100.000 Rows
BULK INSERT dbo.MaxDataTable
   FROM 'C:\Install\MaxDataTable_100K.rpt'
   WITH 
      (
         FIELDTERMINATOR =',',
         ROWTERMINATOR ='\n'
      );
-- Import 105.000 Rows
BULK INSERT dbo.MaxDataTable
   FROM 'C:\Install\MaxDataTable_105K.rpt'
   WITH 
      (
         FIELDTERMINATOR =',',
         ROWTERMINATOR ='\n'
      );

I ran again my query to analyse what is going on with our table on the Row Group level and as you can see the 100.000 Rows BULK Load API generated a Delta-Store, while 105.000 Rows Build Load API created a compressed Segment.

Let us keep on and create 102.500 Rows Segment and see how it goes:

EXEC xp_cmdshell 'bcp "SELECT top 102500 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_102500.rpt" -T -c -t,';
-- *****************************************************************
-- Import 102.500 Rows
BULK INSERT dbo.MaxDataTable
   FROM 'C:\Install\MaxDataTable_102500.rpt'
   WITH 
      (
         FIELDTERMINATOR =',',
         ROWTERMINATOR ='\n'
      );

Here is our result:

This basic means that the Tipping Point of Bulk Load API is to be found between 100.000 & 102.500 Rows. Lets find the exact number!

Actually there is one interesting number that might be connected to this story – let us try something containing 1024, such as 102.400 :

EXEC xp_cmdshell 'bcp "SELECT top 102399 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_102399.rpt" -T -c -t,';
-- 102.399 Rows
BULK INSERT dbo.MaxDataTable
   FROM 'C:\Install\MaxDataTable_102399.rpt'
   WITH 
      (
         FIELDTERMINATOR =',',
         ROWTERMINATOR ='\n'
      );
-- *****************************************************************
-- 102.400 Rows
EXEC xp_cmdshell 'bcp "SELECT top 102400 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_102400.rpt" -T -c -t,';
-- Import 100.000 Rows
BULK INSERT dbo.MaxDataTable
   FROM 'C:\Install\MaxDataTable_102400.rpt'
   WITH 
      (
         FIELDTERMINATOR =',',
         ROWTERMINATOR ='\n'
      );

Bingo, we have our winner – 102400 Rows is the magic number when Bulk Load API is switching from the Delta-Store usage into loading data into a fully operation & compressed Segment. Our 102.3909 File was added to an open Delta-Store, while we have a full segment with 102.400 Rows.

Wait a second, but I was actually testing rather short table, and what about much wider table ? I will create a table with Clustered Columnstore with almost the maximum width of the columns allowed – 8K+:

drop table dbo.MaxDataTable;
create table dbo.MaxDataTable(
id int identity,
c1 char(8000));
-- Create Clustered Columnstore Index:
create clustered columnstore index CCI_MaxDataTable
on  [dbo].[MaxDataTable];
-- Let us load a segment
set nocount on
declare @i as int;
declare @max as int;
select @max = isnull(max(C1),0) from dbo.MaxDataTable;
set @i = 1;
begin tran
while @i <= 1048576
begin
insert into dbo.MaxDataTable
default values
set @i = @i + 1;
end;
commit;
-- Ensure we have a segment:
alter index CCI_MaxDataTable on dbo.MaxDataTable
REBUILD;
-- *****************************************************************
EXEC xp_cmdshell 'bcp "SELECT top 102399 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_102399.rpt" -T -c -t,';
-- 102.399 Rows
BULK INSERT dbo.MaxDataTable
   FROM 'C:\Install\MaxDataTable_102399.rpt'
   WITH 
      (
         FIELDTERMINATOR =',',
         ROWTERMINATOR ='\n'
      );
-- 102.400 Rows
EXEC xp_cmdshell 'bcp "SELECT top 102400 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_102400.rpt" -T -c -t,';
-- Import 100.000 Rows
BULK INSERT dbo.MaxDataTable
   FROM 'C:\Install\MaxDataTable_102400.rpt'
   WITH 
      (
         FIELDTERMINATOR =',',
         ROWTERMINATOR ='\n'
      );

— Check on the Row Groups status

SELECT rg.total_rows, 
state_description,
cast(100.0*(total_rows - ISNULL(deleted_rows,0))/iif(total_rows = 0, 1, total_rows) as Decimal(6,3)) AS PercentFull, 
i.object_id, object_name(i.object_id) AS TableName, 
i.name AS IndexName, i.index_id, i.type_desc, 
rg.*
FROM sys.indexes AS i
INNEr JOIN sys.column_store_row_groups AS rg
ON i.object_id = rg.object_id
AND i.index_id = rg.index_id 
WHERE object_name(i.object_id) in ( 'MaxDataTable')
ORDER BY object_name(i.object_id), i.name, row_group_id;

And here we go:

102400 is the magic number for the Clustered Columnstore Bulk Load API

to be continued …

Clustered Columnstore Indexes – part 30 (“Bulk Load API Magic Number”)

Rate

Share

Share

Rate

Clustered Columnstore Indexes – part 30 (“Bulk Load API Magic Number”)

Rate

Share

Share

Rate

Related content

Book Review: Big Red - Voyage of a Trident Submarine

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Inserting Markup into a String with SQL

Networking - Part 4

Speaking at Community Events - More Thoughts