February 23, 2016 at 12:27 pm
I think in this case you weren't hijacking at all (I believe you said you arrived here via Google for a similar issue), and I've appreciated the additional testing and discussion. Thanks for adding additional value to the thread,
Rich
February 24, 2016 at 10:20 pm
Just a quick update:
I have a good solution that I'm still testing (note my earlier comment). It has taken me longer than expected. I'll try to post it in the morning.
-- Itzik Ben-Gan 2001
February 25, 2016 at 9:32 am
Ok, here we go. This may be a bit much to follow but the key take away is that I'm showing you an excellent alternative to NTILE using a tally table[/url] and CROSS APPLY[/url].
Code for the tally table
Note that this solution uses a permanent tally table (a CTE tally or "getnumbers" function will result in much poorer performance.)
IF OBJECT_ID('dbo.NTally') IS NOT NULL DROP FUNCTION dbo.NTally;
IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;
CREATE TABLE dbo.tally (N int NOT NULL);
INSERT dbo.tally
SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a, sys.all_columns b;
ALTER TABLE dbo.tally ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR=100;
ALTER TABLE dbo.tally ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);
The NTally Inline Table Valued Function
Play around with the example code inside the comments to better understand how you would use dbo.NTally to replace NTile.
IF OBJECT_ID('dbo.NTally') IS NOT NULL DROP FUNCTION dbo.NTally;
GO
CREATE FUNCTION dbo.NTally(@tiles bigint, @rows bigint)
/****************************************************************************************
Purpose:
Returns a tally table with "tile groups" and can be used as an alternative to the T-SQL
NTILE function introduced in SQL Server 2005. See the usage examples below for more
details on how to use this function.
Compatibility:
SQL Server 2005+
Syntax:
SELECT rn, tile
FROM dbo.NTally(@tiles, @rows);
Parameters:
@tiles = bigint; requested number of tile groups (same as the parameter passed to NTILE)
@rows = bigint; the number of rows to be "tiled" (have group number assigned to it)
Return Types:
Inline Table Valued Function returns:
rn = bigint; a row number beginning with 1 and ending with @rows
tile = int; a "tile number" or group number the same
Developer Notes:
1. Requires a tally table named dbo.tally. Run the code below to create a correctly
indexed tally table (note the "Beginning" and "End" of tally code).
--===== Beginning of dbo.tally code
-- Drop if NTALLY function and tally table if they exist
IF OBJECT_ID('dbo.NTally') IS NOT NULL DROP FUNCTION dbo.NTally;
IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;
-- Create the tally table
CREATE TABLE dbo.tally (N int NOT NULL);
-- Insert the numbers 1 through 1,000,0000 into dbo.tally
INSERT dbo.tally
SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a, sys.all_columns b;
-- Create required primary key, clustered index and unique nonclustered indexes
ALTER TABLE dbo.tally ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N)
WITH FILLFACTOR=100;
-- Note that, when present, the optimizer will chose the unique nonclustered index
-- which gets the job done with fewer reads. This index is mandetory for dbo.NTally
-- to function optimally
ALTER TABLE dbo.tally ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);
--===== End of tally table code
Note that, if you need to use a different schema and/or name for your tally table
make sure to change the tally table schema/name in the function to match. To learn
more about tally tables see http://www.sqlservercentral.com/articles/T-SQL/62867/
2. With R as the number of rows in your tally table the maximum number of rows this
function will create is (R*R) for each "tile" group per partition. R also represents
the maximum number of tile groups available. A one million row tally table will requires
roughly 20MB of uncompressed disk space and will support up to one million tile groups
with up to one trillion rows per tile group.
3. For best results make sure that a P.O.C. index is available for your query. For more
information about P.O.C. indexes see:
http://sqlmag.com/sql-server-2012/sql-server-2012-how-write-t-sql-window-functions-part-3
4. NTally is deterministic; for more about deterministic and nondeterministic functions
see https://msdn.microsoft.com/en-us/library/ms178091.aspx
Examples:
--===== 1. Demonstrating how the function mimics NTILE
-- To better understand NTally, run the code below by running with different values
-- assigned to @rows and @tiles.
DECLARE @rows bigint = 8, @tiles bigint = 3;
SELECT rn, tile, NTILE(@tiles) OVER (ORDER BY rn) as [NTILE]
FROM dbo.NTally(@tiles, @rows);
--===== 2. Using NTally as a faster alternative to NTILE (with no PARTITION BY clause)
-- Run the code below from <START> to <END>.
-- Note how you get the same result but how, the more rows you add, the more efficient
-- the NTALLY solution is, with respect to reads, when compared to NTILE:
-- e.g. NTILE against 100K rows = 200K+ reads, only 560+ reads for the the NTally method
-- <START>
-- Declare variables
DECLARE @rows bigint = 8, @tiles bigint = 5;
-- Setup sample data
DECLARE @SomeTable TABLE (SomeValue int primary key);
INSERT @SomeTable
SELECT TOP(@rows) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))*5
FROM sys.all_columns a, sys.all_columns b;
-- How to divide @some table into 3 tile groups using NTILE
SET STATISTICS IO ON;
PRINT 'NTILE version:';
SELECT SomeValue, NTILE(@tiles) OVER (ORDER BY SomeValue) AS TileGroup
FROM @SomeTable;
-- How to divide @SomeTable into 3 tile groups using NTally
PRINT CHAR(10)+'NTally version:';
WITH anchor AS
(
SELECT SomeValue, ROW_NUMBER() OVER (ORDER BY SomeValue) AS rn
FROM @SomeTable
)
SELECT SomeValue, nt.tile AS TileGroup
FROM anchor a
CROSS APPLY dbo.NTally(@tiles, (SELECT COUNT(*) FROM @SomeTable)) nt
WHERE a.rn = nt.rn;
SET STATISTICS IO OFF;
-- <END>
--===== 3. Using NTally an alternative to NTILE with a PARTITION BY clause
-- Create sample table with 10 rows and 3 partitions
IF OBJECT_ID('tempdb..#SomeTable') IS NOT NULL DROP TABLE #SomeTable;
CREATE TABLE #SomeTable
(
PartitionKey int NOT NULL,
SomeValue int NOT NULL,
CONSTRAINT pk_SomeTable PRIMARY KEY(PartitionKey,SomeValue)
);
INSERT #SomeTable
SELECT TOP (12)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))/5+1,
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))*5
FROM sys.all_columns;
-- Using NTILE and PARTITION BY
SELECT
s.PartitionKey,
s.SomeValue,
NTILE(3) OVER (PARTITION BY s.PartitionKey ORDER BY s.SomeValue) AS TileGroup
FROM #SomeTable s;
-- Using the NTally function
WITH
anchor AS -- Use ROW_NUMBER for your partitioning and sorting
(
SELECT
rn = ROW_NUMBER() OVER (PARTITION BY PartitionKey ORDER BY SomeValue),
PartitionKey,
SomeValue
FROM #SomeTable v
),
parts AS -- collect the number of rows per partition
(
SELECT PartitionKey, mxrn = MAX(rn)
FROM anchor
GROUP BY PartitionKey
)
SELECT a.PartitionKey, a.SomeValue, nt.tile AS TileGroup
FROM parts p
CROSS APPLY dbo.NTally(3,mxrn) nt
CROSS APPLY anchor a
WHERE p.PartitionKey = a.PartitionKey AND a.rn = nt.rn;
DROP TABLE #SomeTable;
---------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20140501 - Initial Creation - Alan Burstein
Rev 01 - 20160224 - Final touches and optimization including comments - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
calculate_tiles AS -- Calculate the number of tiles per tile group
(
SELECT t.N, tile = (@rows/@tiles) + CASE WHEN t.N <= (@rows%@tiles) THEN 1 ELSE 0 END
FROM dbo.tally t
WHERE t.N <= @tiles
),
assemble_tiles AS
(
SELECT tile = topn.N
FROM calculate_tiles ct
CROSS APPLY
(
SELECT TOP(ct.tile) N = ct.N
FROM dbo.tally t1 CROSS JOIN dbo.tally t2
) topn
)
SELECT TOP 100 PERCENT
rn = ROW_NUMBER() OVER (ORDER BY a.tile), -- Your anchor row
a.tile
FROM assemble_tiles a
ORDER BY a.tile; -- Your spoon
GO
A Quick premier on how to use dbo.NTally to replace NTILE in your requirement. Looking at my solution from earlier in this thread let's start with how we'd use dbo.NTally to calculate Q1 instead of NTILE.
-- what you have now
SELECT ID, val1, Q1 = NTILE(4) OVER (ORDER BY val1), val2
FROM #t
WHERE val1 IS NOT NULL;
-- new and improved code using my NTally function
WITH
anchor AS
(
SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2
FROM #t
WHERE val1 IS NOT NULL
)
SELECT ID, val1, Q1 = nt.tile, val2
FROM anchor a
CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt
WHERE a.rn = nt.rn;
Yes, that's more code but, in a 1,000,000 row test the NTILE method generates nearly 3,000,0000 reads whereas the dbo.NTally method only generates fewer than 1,000 reads.
Next lets look at how we would improve the UNION ALL part of the query...
SELECT ID, val1, Q1 = NULL, val2
FROM #t
WHERE val1 IS NULL
UNION ALL
SELECT ID, val1, Q1 = NTILE(4) OVER (ORDER BY val1), val2
FROM #t
WHERE val1 IS NOT NULL;
-- new and improved code using my NTally function
WITH
anchor AS
(
SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2
FROM #t
WHERE val1 IS NOT NULL
)
SELECT ID, val1, Q1 = NULL, val2
FROM #t
WHERE val1 IS NULL
UNION ALL
SELECT ID, val1, Q1 = nt.tile, val2
FROM anchor a
CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt
WHERE a.rn = nt.rn;
Looking at my previous solution it would get too ugly and complex to replace NTile with my dbo.NTally function all in one query. Here's what it would look like to include a Q2:
-- what you have now
PRINT 'NTILE';
WITH
cteQ1 AS
(
SELECT ID, val1, Q1 = NULL, val2
FROM #t
WHERE val1 IS NULL
UNION ALL
SELECT ID, val1, Q1 = NTILE(4) OVER (ORDER BY val1), val2
FROM #t
WHERE val1 IS NOT NULL
),
cteQ2 AS
(
SELECT ID, val2, Q2 = NTILE(4) OVER (ORDER BY val2)
FROM cteQ1
WHERE val2 IS NOT NULL
)
SELECT
cteQ1.ID,
cteQ1.val1,
cteQ1.Q1,
cteQ2.val2,
cteQ2.Q2
FROM cteQ1
LEFT JOIN cteQ2 ON cteQ1.ID = cteQ2.ID;
-- How we'd do it using dbo.NTally
PRINT CHAR(10)+'NTally';
WITH
anchor AS
(
SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2
FROM #t
WHERE val1 IS NOT NULL
),
cteQ1 AS
(
SELECT ID, val1, Q1 = NULL, val2
FROM #t
WHERE val1 IS NULL
UNION ALL
SELECT ID, val1, Q1 = nt.tile, val2
FROM anchor a
CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt
WHERE a.rn = nt.rn
),
cteQ2anchor AS
(
SELECT rn = ROW_NUMBER() OVER (ORDER BY val2), ID, val2
FROM cteQ1
WHERE val2 IS NOT NULL
),
cteQ2anchorCount AS (SELECT Q2ac = COUNT(*) FROM cteQ2anchor),
cteQ2 AS
(
SELECT ID, val2, Q2 = nt.tile
FROM cteQ2anchor Q2a
CROSS APPLY cteQ2anchorCount Q2ac
CROSS APPLY dbo.NTally(4, Q2ac.Q2ac) nt
WHERE Q2a.rn = nt.rn
)
SELECT
cteQ1.ID,
cteQ1.val1,
cteQ1.Q1,
cteQ2.val2,
cteQ2.Q2
FROM cteQ1
LEFT JOIN cteQ2 ON cteQ1.ID = cteQ2.ID;
So, instead, here's how I'd do it for maximum performance. First, I'd create this index:
CREATE NONCLUSTERED INDEX nc_t_val1_temp2 ON #t(val1) INCLUDE (ID, val2, val3, val4);
then do this:
The Final Solution using dbo.NTally
IF OBJECT_ID('tempdb..#FR') IS NOT NULL DROP TABLE #FR;
CREATE TABLE #FR
(
ID int NOT NULL,
val1 int NULL,
Q1 int NULL,
val2 int NULL,
Q2 int NULL,
val3 int NULL,
Q3 int NULL,
val4 int NULL,
Q4 int NULL,
);
GO
WITH
anchor AS
(
SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2, val3, val4
FROM #t
WHERE val1 IS NOT NULL
)
INSERT INTO #FR
SELECT ID, val1, Q1 = NULL, val2, Q2 = NULL, val3, Q3 = NULL, val4, Q4 = NULL
FROM #t
WHERE val1 IS NULL
UNION ALL
SELECT ID, val1, Q1 = nt.tile, val2, NULL, val3, NULL, val4, NULL
FROM anchor a
CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt
WHERE a.rn = nt.rn;
CREATE CLUSTERED INDEX pk_FR ON #FR(ID);
CREATE NONCLUSTERED INDEX nc__V2_ID ON #FR(val2) INCLUDE (ID,Q2);
CREATE NONCLUSTERED INDEX nc__V3_ID ON #FR(val3) INCLUDE (ID,Q3);
CREATE NONCLUSTERED INDEX nc__V4_ID ON #FR(val4) INCLUDE (ID,Q4);
-- Q2
WITH cteQ2 AS
(
SELECT rn = ROW_NUMBER() OVER (ORDER BY Val2), ID, Val2, Q2
FROM #FR
WHERE val2 IS NOT NULL
)
UPDATE cteQ2
SET Q2 = nt.tile
FROM cteQ2
CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM cteQ2)) nt
WHERE cteQ2.rn = nt.rn;
-- Q3
WITH cteQ3 AS
(
SELECT rn = ROW_NUMBER() OVER (ORDER BY val3), ID, val3, Q3
FROM #FR
WHERE val3 IS NOT NULL
)
UPDATE cteQ3
SET Q3 = nt.tile
FROM cteQ3
CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM cteQ3)) nt
WHERE cteQ3.rn = nt.rn;
-- Q4
WITH cteQ4 AS
(
SELECT rn = ROW_NUMBER() OVER (ORDER BY val4), ID, val4, Q4
FROM #FR
WHERE val4 IS NOT NULL
)
UPDATE cteQ4
SET Q4 = nt.tile
FROM cteQ4
CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM cteQ4)) nt
WHERE cteQ4.rn = nt.rn;
-- The final results:
SELECT * FROM #FR;
This is obviously much more code but, for 92 million rows, we're talking about several 1,000 reads vs several billion reads. Hope that answers your question. :hehe:
-- Itzik Ben-Gan 2001
February 25, 2016 at 11:26 am
February 25, 2016 at 11:48 am
EdgeOfKnowWare (2/25/2016)
+1
Cracked me up!
-- Itzik Ben-Gan 2001
February 25, 2016 at 11:57 am
Seriously though, thanks. That's a lot to pour over. I will definitely test this out and see if it's something that we can use.
July 13, 2016 at 1:12 pm
Alan.B (2/25/2016)
The NTally Inline Table Valued Function
N-I-I-I-I-C-E! Great documentation, too!
--Jeff Moden
Change is inevitable... Change for the better is not.
July 19, 2016 at 11:33 am
Jeff Moden (7/13/2016)
Alan.B (2/25/2016)
The NTally Inline Table Valued FunctionN-I-I-I-I-C-E! Great documentation, too!
Thank you, thank you! I thought you might like that ;-).
-- Itzik Ben-Gan 2001
Viewing 8 posts - 16 through 22 (of 22 total)
You must be logged in to reply to this topic. Login to reply