Continuation from the previous 29 parts, starting from http://www.nikoport.com/2013/07/05/clustered-columnstore-indexes-part-1-intro/
It has been a long time (almost a year) since Microsoft revealed the public preview version of SQL Server with Clustered Columnstore Indexes available, and ever since there were hundreds of presentation talking about the Bulk loading and the magic number around 100.000 rows which separates the final result as an open Delta-Store vs a compressed Segment.
I am guilty in this imprecise information as well – ever since the first presentation on Clustered Columnstore in July of 2013 I was telling everyone about “around 100.000 rows”.
Now it is the time to find out the exact number.
I have already shown this demos last weekend at SQLSaturday Edinburgh 2014, and so it is time to publish it on the web.
I will kick of by creating my test table:
CREATE TABLE [dbo].[MaxDataTable]( [c1] [bigint] NULL, [c2] [numeric](36, 3) NULL, [c3] [bit] NULL, [c4] [smallint] NULL, [c5] [decimal](18, 3) NULL, [c6] [smallmoney] NULL, [c7] [int] NULL, [c8] [tinyint] NULL, [c9] [money] NULL, [c10] [real] NULL, [c11] [real] NULL, [c12] [date] NULL, [c13] [datetimeoffset](7) NULL, [c14] [datetime2](7) NULL, [c15] [smalldatetime] NULL, [c16] [datetime] NULL, [c17] [time](7) NULL, [c18] [char](100) NULL, [c19] [varchar](100) NULL, [c20] [nchar](100) NULL, [c21] [nvarchar](100) NULL, [c22] [binary](8) NULL, [c23] [varbinary](8) NULL, [c24] [uniqueidentifier] NULL ) ON [PRIMARY]; GO -- Create Clustered Columnstore Index: create clustered columnstore index CCI_MaxDataTable on [dbo].[MaxDataTable];
Now let us load a full Segment (1045678 rows) and update it with some random data and then invoke Tuple Mover (since it will keep on staying as an open Delta-store even when it reaches the maximum number of rows):
Set NoCount ON Truncate table dbo.MaxDataTable; declare @i as int; declare @max as int; select @max = isnull(max(C1),0) from dbo.MaxDataTable; set @i = 1; begin tran while @i <= 1048576 begin insert into dbo.MaxDataTable default values set @i = @i + 1; end; commit; -- Lets Update all rows with some random data with updTable as ( select * , row_number() over(partition by C1 order by C1) as rnk from dbo.MaxDataTable ) update updTable set C1 = rnk, C2 = ABS(CHECKSUM(NewId())) % 142359.0, C3 = 1, C4 = cast(ABS(CHECKSUM(NewId())) % 10000 as smallint), C5 = ABS(CHECKSUM(NewId())) % 242359.0, C6 = rnk / ABS(CHECKSUM(NewId())) % 242359.0, C7 = ABS(CHECKSUM(NewId())) % 2000000, C8 = ABS(CHECKSUM(NewId())) % 255, C9 = rnk / ABS(CHECKSUM(NewId())) % 242359.0, C10 = rnk / ABS(CHECKSUM(NewId())) % 242359.0, C11 = rnk / ABS(CHECKSUM(NewId())) % 242359.0, C12 = getDate(), C14 = DATEADD(day, (ABS(CHECKSUM(NEWID())) % 65530), 0), C15 = DATEADD(day, (ABS(CHECKSUM(NEWID())) % 65530), 0), C16 = DATEADD(day, (ABS(CHECKSUM(NEWID())) % 65530), 0), C17 = getDate(), C18 = cast( ABS(CHECKSUM(NewId())) % 242359.0 as char(25)), C19 = cast( ABS(CHECKSUM(NewId())) % 232659.0 as varchar(25)), C20 = cast( ABS(CHECKSUM(NewId())) % 242359.0 as nchar(25)), C21 = cast( ABS(CHECKSUM(NewId())) % 232659.0 as nvarchar(25)); -- Build our Segment alter index CCI_MaxDataTable on dbo.MaxDataTable Rebuild;
Now we can verify if what kind of structures we have in our Clustered Columnstore Table:
-- Check on the Row Groups status SELECT rg.total_rows, state_description, cast(100.0*(total_rows - ISNULL(deleted_rows,0))/iif(total_rows = 0, 1, total_rows) as Decimal(6,3)) AS PercentFull, i.object_id, object_name(i.object_id) AS TableName, i.name AS IndexName, i.index_id, i.type_desc, rg.* FROM sys.indexes AS i INNEr JOIN sys.column_store_row_groups AS rg ON i.object_id = rg.object_id AND i.index_id = rg.index_id WHERE object_name(i.object_id) in ( 'MaxDataTable') ORDER BY object_name(i.object_id), i.name, row_group_id;
We have enough data to play, so let us export 100K & 105K Rows into a file and load them back into our table using BULK API to see what happens:
-- Export 100K Rows EXEC xp_cmdshell 'bcp "SELECT top 100000 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_100K.rpt" -T -c -t,'; -- Export 105K Rows EXEC xp_cmdshell 'bcp "SELECT top 105000 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_105K.rpt" -T -c -t,'; -- ***************************************************************** -- Import 100.000 Rows BULK INSERT dbo.MaxDataTable FROM 'C:\Install\MaxDataTable_100K.rpt' WITH ( FIELDTERMINATOR =',', ROWTERMINATOR ='\n' ); -- Import 105.000 Rows BULK INSERT dbo.MaxDataTable FROM 'C:\Install\MaxDataTable_105K.rpt' WITH ( FIELDTERMINATOR =',', ROWTERMINATOR ='\n' );
I ran again my query to analyse what is going on with our table on the Row Group level and as you can see the 100.000 Rows BULK Load API generated a Delta-Store, while 105.000 Rows Build Load API created a compressed Segment.
Let us keep on and create 102.500 Rows Segment and see how it goes:
EXEC xp_cmdshell 'bcp "SELECT top 102500 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_102500.rpt" -T -c -t,'; -- ***************************************************************** -- Import 102.500 Rows BULK INSERT dbo.MaxDataTable FROM 'C:\Install\MaxDataTable_102500.rpt' WITH ( FIELDTERMINATOR =',', ROWTERMINATOR ='\n' );
This basic means that the Tipping Point of Bulk Load API is to be found between 100.000 & 102.500 Rows. Lets find the exact number!
Actually there is one interesting number that might be connected to this story – let us try something containing 1024, such as 102.400 :
EXEC xp_cmdshell 'bcp "SELECT top 102399 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_102399.rpt" -T -c -t,'; -- 102.399 Rows BULK INSERT dbo.MaxDataTable FROM 'C:\Install\MaxDataTable_102399.rpt' WITH ( FIELDTERMINATOR =',', ROWTERMINATOR ='\n' ); -- ***************************************************************** -- 102.400 Rows EXEC xp_cmdshell 'bcp "SELECT top 102400 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_102400.rpt" -T -c -t,'; -- Import 100.000 Rows BULK INSERT dbo.MaxDataTable FROM 'C:\Install\MaxDataTable_102400.rpt' WITH ( FIELDTERMINATOR =',', ROWTERMINATOR ='\n' );
Bingo, we have our winner – 102400 Rows is the magic number when Bulk Load API is switching from the Delta-Store usage into loading data into a fully operation & compressed Segment. Our 102.3909 File was added to an open Delta-Store, while we have a full segment with 102.400 Rows.
Wait a second, but I was actually testing rather short table, and what about much wider table ? I will create a table with Clustered Columnstore with almost the maximum width of the columns allowed – 8K+:
drop table dbo.MaxDataTable; create table dbo.MaxDataTable( id int identity, c1 char(8000)); -- Create Clustered Columnstore Index: create clustered columnstore index CCI_MaxDataTable on [dbo].[MaxDataTable]; -- Let us load a segment set nocount on declare @i as int; declare @max as int; select @max = isnull(max(C1),0) from dbo.MaxDataTable; set @i = 1; begin tran while @i <= 1048576 begin insert into dbo.MaxDataTable default values set @i = @i + 1; end; commit; -- Ensure we have a segment: alter index CCI_MaxDataTable on dbo.MaxDataTable REBUILD; -- ***************************************************************** EXEC xp_cmdshell 'bcp "SELECT top 102399 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_102399.rpt" -T -c -t,'; -- 102.399 Rows BULK INSERT dbo.MaxDataTable FROM 'C:\Install\MaxDataTable_102399.rpt' WITH ( FIELDTERMINATOR =',', ROWTERMINATOR ='\n' ); -- 102.400 Rows EXEC xp_cmdshell 'bcp "SELECT top 102400 * FROM Columnstore_Play.dbo.MaxDataTable" queryout "C:\Install\MaxDataTable_102400.rpt" -T -c -t,'; -- Import 100.000 Rows BULK INSERT dbo.MaxDataTable FROM 'C:\Install\MaxDataTable_102400.rpt' WITH ( FIELDTERMINATOR =',', ROWTERMINATOR ='\n' );
— Check on the Row Groups status
SELECT rg.total_rows, state_description, cast(100.0*(total_rows - ISNULL(deleted_rows,0))/iif(total_rows = 0, 1, total_rows) as Decimal(6,3)) AS PercentFull, i.object_id, object_name(i.object_id) AS TableName, i.name AS IndexName, i.index_id, i.type_desc, rg.* FROM sys.indexes AS i INNEr JOIN sys.column_store_row_groups AS rg ON i.object_id = rg.object_id AND i.index_id = rg.index_id WHERE object_name(i.object_id) in ( 'MaxDataTable') ORDER BY object_name(i.object_id), i.name, row_group_id;
And here we go:
102400 is the magic number for the Clustered Columnstore Bulk Load API
to be continued …