November 3, 2017 at 11:16 pm
Hi,
I have to insert sales order data from a staging table to a base table where further processing is done on this data. The staging table has around half a million records for every insert. The problem is that the processing can only be done on a maximum of 10,000 records at a time. For this, I plan to divide the inserted data into batches of roughly 10,000 records each. An important point to be considered while allocating batch numbers is that Orders must be kept together. Some orders have only one line while others (majority) have multiple lines. So, each batch should have a maximum of 10,000 lines but should include all lines of any order.
For example, if in batch 11 an order starts on line number 9800 and has 201 records, then a new batch number (batch 12) must start with this order even though only 9800 lines are there for the previous batch (batch 11).
I would really appreciate any help or suggestions on how I can achieve this.
Sudarshan
November 4, 2017 at 8:44 am
pssudarshan - Friday, November 3, 2017 11:16 PMHi,
I have to insert sales order data from a staging table to a base table where further processing is done on this data. The staging table has around half a million records for every insert. The problem is that the processing can only be done on a maximum of 10,000 records at a time. For this, I plan to divide the inserted data into batches of roughly 10,000 records each. An important point to be considered while allocating batch numbers is that Orders must be kept together. Some orders have only one line while others (majority) have multiple lines. So, each batch should have a maximum of 10,000 lines but should include all lines of any order.For example, if in batch 11 an order starts on line number 9800 and has 201 records, then a new batch number (batch 12) must start with this order even though only 9800 lines are there for the previous batch (batch 11).
I would really appreciate any help or suggestions on how I can achieve this.
Sudarshan
Here is a quick example
😎
USE [TEEST]
GO
DECLARE @SET_SIZE BIGINT = 500;
DECLARE @BATCH_SIZE BIGINT = 100;
DECLARE @NUM_LINES BIGINT = 29;
DECLARE @FIRST_ORDER BIGINT = 7984;
WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
, NUMS(N) AS (SELECT TOP(@SET_SIZE) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6)
,SAMPLE_DATA AS
(
SELECT
FLOOR(NM.N / @NUM_LINES) + @FIRST_ORDER AS ORDER_NUMBER
,NM.N AS ROWNUMBER
FROM NUMS NM
)
,SAMPLE_ORDERS AS
(
SELECT
SD.ORDER_NUMBER
,ROW_NUMBER() OVER
(
PARTITION BY SD.ORDER_NUMBER
ORDER BY SD.ROWNUMBER
) AS LINE_NUMBER
,SD.ROWNUMBER
FROM SAMPLE_DATA SD
)
,SO_MIN_MAX AS
(
SELECT
SO.ROWNUMBER
,SO.ORDER_NUMBER
,SO.LINE_NUMBER
,MIN(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS START_ROW
,MAX(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS END_ROW
FROM SAMPLE_ORDERS SO
)
,BASE_BATCH_MATRIX AS
(
SELECT
SMM.ROWNUMBER
,SMM.ORDER_NUMBER
,SMM.LINE_NUMBER
,SMM.START_ROW
,SMM.END_ROW
,CASE WHEN (FLOOR(SMM.ROWNUMBER / @BATCH_SIZE) + 1) * @BATCH_SIZE < SMM.END_ROW THEN 0 ELSE 1 END AS BATCH_MATRIX
FROM SO_MIN_MAX SMM
)
,MARKED_SET AS
(
SELECT
BBM.ORDER_NUMBER
,BBM.START_ROW
,BBM.END_ROW
,MAX(BBM.BATCH_MATRIX) - MIN(BBM.BATCH_MATRIX) AS BATCH_MARKER
FROM BASE_BATCH_MATRIX BBM
GROUP BY BBM.ORDER_NUMBER
,BBM.START_ROW
,BBM.END_ROW
)
SELECT
MS.ORDER_NUMBER
,MS.START_ROW
,MS.END_ROW
,SUM(MS.BATCH_MARKER) OVER
(
ORDER BY MS.START_ROW
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) + 1 AS BATCH_NUMBER
FROM MARKED_SET MS
;
OutputORDER_NUMBER START_ROW END_ROW BATCH_NUMBER
-------------------- -------------------- -------------------- ------------
7984 1 28 1
7985 29 57 1
7986 58 86 1
7987 87 115 2
7988 116 144 2
7989 145 173 2
7990 174 202 3
7991 203 231 3
7992 232 260 3
7993 261 289 3
7994 290 318 4
7995 319 347 4
7996 348 376 4
7997 377 405 5
7998 406 434 5
7999 435 463 5
8000 464 492 5
8001 493 500 5
='font-size:9.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"courier>
November 4, 2017 at 9:22 am
It's probably easier to just process them in batches rather than run through them to add batch numbers and then run through them again to process them. Here is an approach that will add the batch numbers. I've tested with a table with 5 records and used batch sizes of 2 and 3.
DECLARE @t TABLE (order_num tinyint, batch_num tinyint)
;
DECLARE @batch_num tinyint = 0,
@batch_size tinyint = 2 -- 3
;
INSERT @t(order_num)
VALUES(1),(2), (2), (3), (3), (4)
;
WHILE @@ROWCOUNT > 0
BEGIN
SET @batch_num = @batch_num + 1 /* increment @batch_num here to prevent infinite loop */
;
WITH batch AS
(
SELECT TOP(@batch_size) WITH TIES *, RANK() OVER(ORDER BY order_num) + COUNT(*) OVER(PARTITION BY order_num) - 1 AS cur_batch_size
FROM @t
WHERE batch_num IS NULL
ORDER BY order_num
)
UPDATE batch
SET batch_num = @batch_num
WHERE cur_batch_size <= @batch_size
END
SELECT *
FROM @t
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
November 4, 2017 at 1:11 pm
drew.allen - Saturday, November 4, 2017 9:22 AMIt's probably easier to just process them in batches rather than run through them to add batch numbers and then run through them again to process them. Here is an approach that will add the batch numbers. I've tested with a table with 5 records and used batch sizes of 2 and 3.
DECLARE @t TABLE (order_num tinyint, batch_num tinyint)
;
DECLARE @batch_num tinyint = 0,
@batch_size tinyint = 2 -- 3
;
INSERT @t(order_num)
VALUES(1),(2), (2), (3), (3), (4)
;
WHILE @@ROWCOUNT > 0
BEGIN
SET @batch_num = @batch_num + 1 /* increment @batch_num here to prevent infinite loop */
;
WITH batch AS
(
SELECT TOP(@batch_size) WITH TIES *, RANK() OVER(ORDER BY order_num) + COUNT(*) OVER(PARTITION BY order_num) - 1 AS cur_batch_size
FROM @t
WHERE batch_num IS NULL
ORDER BY order_num
)
UPDATE batch
SET batch_num = @batch_num
WHERE cur_batch_size <= @batch_size
ENDSELECT *
FROM @tDrew
I don't think a RBAR method is the best option here, a while loop and a separate execution per batch is bound to add a significant overhead compared to a set based method. The code I posted uses a set based method to produce a result set that then can be used to control the batch, it doesn't do but a portion of the work your code dose.
😎
November 5, 2017 at 12:35 am
Drew Allen & Eirikur Eiriksson,
Thanks a lot for your replies. I really appreciate your efforts.
I have been able to test the solution provided by Drew Allen and it works precisely as desired. Thank you. The reason we need to assign batch numbers to the records and then process them by the batch numbers instead of initially processing by batches is that there is a re-process procedure that may need to run if the main procedure returns data validation errors that can be corrected. By that time, the original records in the staging table will be cleared to make way for the next days files, so we need to have all records in the process table with batch numbers against them.
I will also test the solution provided by Eirikur Eiriksson and get back with my updates on that.
Once again, thanks a lot to both of you. You are champions....
Sudarshan.
November 6, 2017 at 8:32 am
Eirikur Eiriksson - Saturday, November 4, 2017 1:10 PMdrew.allen - Saturday, November 4, 2017 9:22 AMIt's probably easier to just process them in batches rather than run through them to add batch numbers and then run through them again to process them. Here is an approach that will add the batch numbers. I've tested with a table with 5 records and used batch sizes of 2 and 3.
DECLARE @t TABLE (order_num tinyint, batch_num tinyint)
;
DECLARE @batch_num tinyint = 0,
@batch_size tinyint = 2 -- 3
;
INSERT @t(order_num)
VALUES(1),(2), (2), (3), (3), (4)
;
WHILE @@ROWCOUNT > 0
BEGIN
SET @batch_num = @batch_num + 1 /* increment @batch_num here to prevent infinite loop */
;
WITH batch AS
(
SELECT TOP(@batch_size) WITH TIES *, RANK() OVER(ORDER BY order_num) + COUNT(*) OVER(PARTITION BY order_num) - 1 AS cur_batch_size
FROM @t
WHERE batch_num IS NULL
ORDER BY order_num
)
UPDATE batch
SET batch_num = @batch_num
WHERE cur_batch_size <= @batch_size
ENDSELECT *
FROM @tDrew
I don't think a RBAR method is the best option here, a while loop and a separate execution per batch is bound to add a significant overhead compared to a set based method. The code I posted uses a set based method to produce a result set that then can be used to control the batch, it doesn't do but a portion of the work your code dose.
😎
This is not RBAR. It will process approximately 10000 records at a time, which is required by the original description. My approach was written with the idea of actually processing the records instead of just assigning a batch number, and a seemingly straightforward comparison of the two different approaches doesn't take that into account.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
November 6, 2017 at 9:33 am
drew.allen - Monday, November 6, 2017 8:32 AMThis is not RBAR. It will process approximately 10000 records at a time, which is required by the original description. My approach was written with the idea of actually processing the records instead of just assigning a batch number, and a seemingly straightforward comparison of the two different approaches doesn't take that into account.
Drew
Accoriding to the original requirements, order lines from each order must be processed within a single batch, not across batches, your solution doesn't cater for any such thing as there is no provisioning for the grouping of order lines into an order.
😎
An important point to be considered while allocating batch numbers is that Orders must be kept together.
November 6, 2017 at 9:59 am
Eirikur Eiriksson - Monday, November 6, 2017 9:33 AMdrew.allen - Monday, November 6, 2017 8:32 AMThis is not RBAR. It will process approximately 10000 records at a time, which is required by the original description. My approach was written with the idea of actually processing the records instead of just assigning a batch number, and a seemingly straightforward comparison of the two different approaches doesn't take that into account.
Drew
Accoriding to the original requirements, order lines from each order must be processed within a single batch, not across batches, your solution doesn't cater for any such thing as there is no provisioning for the grouping of order lines into an order.
😎An important point to be considered while allocating batch numbers is that Orders must be kept together.
Actually mine does account for that. The RANK function orders by the order number and the COUNT counts the number of records within an order. These are combined to calculate the current batch size. This number will be the same for all records with the same order number, and those records will only be processed if that total is less than or equal to the desired batch size.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
November 6, 2017 at 10:17 am
drew.allen - Monday, November 6, 2017 9:59 AMEirikur Eiriksson - Monday, November 6, 2017 9:33 AMdrew.allen - Monday, November 6, 2017 8:32 AMThis is not RBAR. It will process approximately 10000 records at a time, which is required by the original description. My approach was written with the idea of actually processing the records instead of just assigning a batch number, and a seemingly straightforward comparison of the two different approaches doesn't take that into account.
Drew
Accoriding to the original requirements, order lines from each order must be processed within a single batch, not across batches, your solution doesn't cater for any such thing as there is no provisioning for the grouping of order lines into an order.
😎An important point to be considered while allocating batch numbers is that Orders must be kept together.
Actually mine does account for that. The RANK function orders by the order number and the COUNT counts the number of records within an order. These are combined to calculate the current batch size. This number will be the same for all records with the same order number, and those records will only be processed if that total is less than or equal to the desired batch size.
Drew
Sorry, I must be slow today
😎
November 6, 2017 at 12:39 pm
Eirikur Eiriksson - Saturday, November 4, 2017 8:44 AMpssudarshan - Friday, November 3, 2017 11:16 PMHi,
I have to insert sales order data from a staging table to a base table where further processing is done on this data. The staging table has around half a million records for every insert. The problem is that the processing can only be done on a maximum of 10,000 records at a time. For this, I plan to divide the inserted data into batches of roughly 10,000 records each. An important point to be considered while allocating batch numbers is that Orders must be kept together. Some orders have only one line while others (majority) have multiple lines. So, each batch should have a maximum of 10,000 lines but should include all lines of any order.For example, if in batch 11 an order starts on line number 9800 and has 201 records, then a new batch number (batch 12) must start with this order even though only 9800 lines are there for the previous batch (batch 11).
I would really appreciate any help or suggestions on how I can achieve this.
Sudarshan
Here is a quick example
😎
USE [TEEST]
GODECLARE @SET_SIZE BIGINT = 500;
DECLARE @BATCH_SIZE BIGINT = 100;
DECLARE @NUM_LINES BIGINT = 29;
DECLARE @FIRST_ORDER BIGINT = 7984;WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
, NUMS(N) AS (SELECT TOP(@SET_SIZE) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6)
,SAMPLE_DATA AS
(
SELECT
FLOOR(NM.N / @NUM_LINES) + @FIRST_ORDER AS ORDER_NUMBER
,NM.N AS ROWNUMBER
FROM NUMS NM
)
,SAMPLE_ORDERS AS
(
SELECT
SD.ORDER_NUMBER
,ROW_NUMBER() OVER
(
PARTITION BY SD.ORDER_NUMBER
ORDER BY SD.ROWNUMBER
) AS LINE_NUMBER
,SD.ROWNUMBER
FROM SAMPLE_DATA SD
)
,SO_MIN_MAX AS
(
SELECT
SO.ROWNUMBER
,SO.ORDER_NUMBER
,SO.LINE_NUMBER
,MIN(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS START_ROW
,MAX(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS END_ROW
FROM SAMPLE_ORDERS SO
)
,BASE_BATCH_MATRIX AS
(
SELECT
SMM.ROWNUMBER
,SMM.ORDER_NUMBER
,SMM.LINE_NUMBER
,SMM.START_ROW
,SMM.END_ROW
,CASE WHEN (FLOOR(SMM.ROWNUMBER / @BATCH_SIZE) + 1) * @BATCH_SIZE < SMM.END_ROW THEN 0 ELSE 1 END AS BATCH_MATRIX
FROM SO_MIN_MAX SMM
)
,MARKED_SET AS
(
SELECT
BBM.ORDER_NUMBER
,BBM.START_ROW
,BBM.END_ROW
,MAX(BBM.BATCH_MATRIX) - MIN(BBM.BATCH_MATRIX) AS BATCH_MARKER
FROM BASE_BATCH_MATRIX BBM
GROUP BY BBM.ORDER_NUMBER
,BBM.START_ROW
,BBM.END_ROW
)
SELECT
MS.ORDER_NUMBER
,MS.START_ROW
,MS.END_ROW
,SUM(MS.BATCH_MARKER) OVER
(
ORDER BY MS.START_ROW
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) + 1 AS BATCH_NUMBER
FROM MARKED_SET MS
;Output
ORDER_NUMBER START_ROW END_ROW BATCH_NUMBER
-------------------- -------------------- -------------------- ------------
7984 1 28 1
7985 29 57 1
7986 58 86 1
7987 87 115 2
7988 116 144 2
7989 145 173 2
7990 174 202 3
7991 203 231 3
7992 232 260 3
7993 261 289 3
7994 290 318 4
7995 319 347 4
7996 348 376 4
7997 377 405 5
7998 406 434 5
7999 435 463 5
8000 464 492 5
8001 493 500 5
='font-size:9.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"courier>
I went back and looked at this. There is an error in the logic. It's not evident from the sample data used, but here is an example where a different set of sample data produces an incorrect result (all records are placed into the first batch). The only change I made was in setting up the sample data (CTE "Sample Data").
DECLARE @SET_SIZE BIGINT = 500;
DECLARE @BATCH_SIZE BIGINT = 100;
DECLARE @NUM_LINES BIGINT = 29;
DECLARE @FIRST_ORDER BIGINT = 7984;
WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
, NUMS(N) AS (SELECT TOP(@SET_SIZE) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6)
,SAMPLE_DATA AS
(
SELECT
@FIRST_ORDER + CASE
WHEN NM.n < 41 THEN 0
WHEN NM.n < 101 THEN 1
WHEN NM.n < 161 THEN 2
ELSE NM.n / 50 - 1
END AS ORDER_NUMBER
,NM.N AS ROWNUMBER
FROM NUMS NM
)
,SAMPLE_ORDERS AS
(
SELECT
SD.ORDER_NUMBER
,ROW_NUMBER() OVER
(
PARTITION BY SD.ORDER_NUMBER
ORDER BY SD.ROWNUMBER
) AS LINE_NUMBER
,SD.ROWNUMBER
FROM SAMPLE_DATA SD
)
,SO_MIN_MAX AS
(
SELECT
SO.ROWNUMBER
,SO.ORDER_NUMBER
,SO.LINE_NUMBER
,MIN(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS START_ROW
,MAX(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS END_ROW
FROM SAMPLE_ORDERS SO
)
,BASE_BATCH_MATRIX AS
(
SELECT
SMM.ROWNUMBER
,SMM.ORDER_NUMBER
,SMM.LINE_NUMBER
,SMM.START_ROW
,SMM.END_ROW
,CASE WHEN (FLOOR(SMM.ROWNUMBER / @BATCH_SIZE) + 1) * @BATCH_SIZE < SMM.END_ROW THEN 0 ELSE 1 END AS BATCH_MATRIX
FROM SO_MIN_MAX SMM
)
,MARKED_SET AS
(
SELECT
BBM.ORDER_NUMBER
,BBM.START_ROW
,BBM.END_ROW
,MAX(BBM.BATCH_MATRIX) - MIN(BBM.BATCH_MATRIX) AS BATCH_MARKER
FROM BASE_BATCH_MATRIX BBM
GROUP BY BBM.ORDER_NUMBER
,BBM.START_ROW
,BBM.END_ROW
)
SELECT
MS.ORDER_NUMBER
,MS.START_ROW
,MS.END_ROW
,SUM(MS.BATCH_MARKER) OVER
(
ORDER BY MS.START_ROW
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) + 1 AS BATCH_NUMBER
FROM MARKED_SET MS
;
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply