September 22, 2017 at 7:35 am
Good afternoon all.
I am trying to figure a way of getting the total row count of a table and dividing that into equal sizes, but varying number, of batches for an insert statement to process.
i.e. if a table has 100,000 rows, and I specify 5 batches I currently get the following from my CTE (below):
RowStart RowEnd
1 1 20000
2 20001 40000
3 40001 60000
4 60001 80000
5 80001 100000
so I can insert into a new table like this: INSERT INTO <SOMETABLE> WHERE <PRIMARY KEY> BETWEEN @ROWSTART AND @ROWEND
.. then loop to next row
This means I won't fill-up the LOG file by trying to insert all in 1 go. The actual row counts of the tables I need to do this for number in the 10's of millions.
I was hoping to divide whatever the row count of the table was into <row_count>/100000 # of batches of 100,000 - so for a 1M row table, we have 10 batches and so on.
I have purloined some code from the internet, apologies as I cannot remember where and so cannot attribute it, but it is as follows:
;
WITH Pass0
AS (SELECT
1 AS c
UNION ALL
SELECT
1),
Pass1
AS (SELECT
1 AS c
FROM Pass0 AS a,
Pass0 AS b),
Pass2
AS (SELECT
1 AS c
FROM Pass1 AS a,
Pass1 AS b),
Pass3
AS (SELECT
1 AS c
FROM Pass2 AS a,
Pass2 AS b),
Pass4
AS (SELECT
1 AS c
FROM Pass3 AS a,
Pass3 AS b),
Pass5
AS (SELECT
1 AS c
FROM Pass4 AS a,
Pass4 AS b),
vals (val)
AS (SELECT
ROW_NUMBER() OVER (ORDER BY c) AS number
FROM Pass5)
SELECT
[TheTile] AS [TheRange],
MIN(val) AS RangeStart,
MAX(val) AS [RangeEnd]
FROM (SELECT
NTILE(5) OVER (ORDER BY val) AS [TheTile],
val
FROM vals
WHERE val BETWEEN 1 AND 100000) a
GROUP BY [TheTile]
ORDER BY [TheTile];
where the NTILE(#) is the number of batches, and the "WHERE val BETWEEN <START> AND <END>" is 1 and the total row count of the table.
My problem is, that this works well until I have a row count > 1M, I have a table with 92M rows where I want to have <start> and <end> values for the insert statement, but it takes forever to run and I need 1,000,000/100,000 = 10 batches for that one, for 92M I need 344 batches and the whole thing grinds to a halt.
Does anyone have an idea of a way to achieve what I am trying to do please, the code I posted does not scale very well at all - there must be a better way?
Thank you all for your input.
Kindest
Duncan.
September 22, 2017 at 7:45 am
EDIT:
I thought it was quicker, but it's still really slow on 92M rows.. I would welcome any ideas please?
thank you
it's ok, I've sorted it using the tried and tested GetNums() function
SELECT
[TheTile] AS [TheRange],
MIN(a.n) AS RangeStart,
MAX(a.n) AS [RangeEnd]
FROM (SELECT
NTILE(344) OVER (ORDER BY gn.n) AS [TheTile],
gn.n
FROM dbo.GetNums(1, 10000000) gn
WHERE gn.n BETWEEN 1 AND 10000000) a
GROUP BY [TheTile]
ORDER BY [TheTile];
thanks all!
September 22, 2017 at 8:17 am
You're thinking in the wrong direction. Instead of generating the rows and grouping them, you need to generate the groups and calculate which rows would you need to include. So, instead of creating 10,000 rows, you just create 5.
DECLARE @Rows int = 10000000,
@Batches int = 5;
WITH
E(n) AS(
SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)
),
E2(n) AS(
SELECT a.n FROM E a, E b
),
E4(n) AS(
SELECT a.n FROM E2 a, E2 b
),
cteTally(n) AS(
SELECT TOP (@Batches) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n
FROM E4
)
SELECT n AS [TheRange],
1 + CEILING(@Rows / @Batches) * (n -1) AS [RangeStart],
CEILING(@Rows / @Batches) * n AS [RangeEnd]
FROM cteTally
September 25, 2017 at 12:48 am
Thank you Luis, I was indeed looking at the issue from the wrong direction. Your code is excellent and very fast, I will learn what it is doing and take this as a learning example for the future. Much obliged! D.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply