Dividing data equally in 4 sessions

Question

Dividing data equally in 4 sessions

S_Kumar_S

SSCrazy Eights

Points: 9037
More actions
October 26, 2020 at 6:56 pm

#3801440

Hi All
I have a table which contains the transactions done by cards. Below is the table structure and sample data:
create table CardTransactions(CardNo int, Transactionid int, Processed bit)
Insert into CardTransactions values(1,1001,0)
Insert into CardTransactions values(1,1002,0)
Insert into CardTransactions values(1,1003,0)
Insert into CardTransactions values(1,1004,0)
Insert into CardTransactions values(1,1005,0)
Insert into CardTransactions values(2,1006,0)
Insert into CardTransactions values(2,1007,0)
Insert into CardTransactions values(2,1008,0)
Insert into CardTransactions values(3,1009,0)
Insert into CardTransactions values(3,1010,0)
Insert into CardTransactions values(4,1011,0)
Insert into CardTransactions values(4,1012,0)
Insert into CardTransactions values(5,1013,0)
Insert into CardTransactions values(6,1014,0)
Insert into CardTransactions values(7,1015,0)
Now a procedure will be called from 4 sessions. Each session is expected to pick data in almost equal amount (not exact). So here we have 15 transactions, so each session should pick approx 4 transactions. Transaction of 1 card should stay only in one session. So here CardNo 1 has 5 transactions, so one of the sessions should get all 5 transactions, not 4. Second session might get CardNo 2 and 5 transactions(3+1=4). Third Session might get CardNo 3 and 4 (2+2=4). Fourth session will get CardNo 6 and 7(1+1=2) transactions.
Can any query or function in SQL Server does this kind of approximate distribution?
Thanks
Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

pietlinden SSC Guru Points: 63501 More actions · Answer 1

NTILE(n) will divide the set into n nearly evenly sized chunks. Not sure how you would force full groups into a single "chunk" though

Mike01 SSChampion Points: 11395 More actions · Answer 2

I think you're going to need a master proc that has this logic or SSIS. Then the proc could call the procedure 4 times and pass in the TransactionIDs that the process should work with. You could kick them off asynchronously, then monitor for when they complete

For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Steve Collins SSCrazy Points: 2157 More actions · Answer 3

You could do it by offset paging if each of the "sessions" requests a different page. Is it possible for each request to query for a different page? Otherwise, it seems like you'd have to keep track of which rows were allocated and exhaust the supply

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Phil Parkin SSC Guru Points: 246959 More actions · Answer 4

Can you expand a little on what the 'sessions' are?

How should one 'session' know what the other sessions are doing (so that they don't have any overlapping or missed rows))?

How is the 'batch' of data which the four sessions are working with determined? Is this a process which will run repeatedly while transactions are being created?

What happens if that batch of data cannot be divided into four (because there are fewer than four card numbers in the batch)?

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

S_Kumar_S SSCrazy Eights Points: 9037 More actions · Answer 5

Phil Parkin wrote:

Can you expand a little on what the 'sessions' are?
How should one 'session' know what the other sessions are doing (so that they don't have any overlapping or missed rows))?
How is the 'batch' of data which the four sessions are working with determined? Is this a process which will run repeatedly while transactions are being created?
What happens if that batch of data cannot be divided into four (because there are fewer than four card numbers in the batch)?

Hi Phil

This proc is called from 4 different SSMS sessions i.e. through 4 spids. (actual implementation use SSIS to start 4 sessions, but for simplicity, we can assume we are running this proc manually by opening 4 SSMS sessions). Whatever cards are picked by a session, all their transactions are marked as Processed =1 so that other sessions are not picking the same cards again. (Atual implementation use tablockx etc to make sure no overlapping cards are picked).

Batch size can be determined like this: total no of transactions/4. So in this case, we get 15/4 ~4

The purpose of this entire exercise is to be able to run one single proc from multiple sessions to get performance gain as we have millions of records to processes. This gives us approx 40% performance gain when all other optimization tricks are already tried out.

Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

Steve Collins SSCrazy Points: 2157 More actions · Answer 6

It appears you're looking to use horizontal partitioning.

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

DesNorton SSC-Insane Points: 24384 More actions · Answer 7

The following code will not divide the values exactly. But since you have a large number of records, it uses the card number to split the records. It will guarantee that all the records from a given card number are kept in the same session.

Start by pre-diving the transactions in the table based on CardID

ALTER TABLE CardTransactions
ADD SessionID AS CardNo % 4 + 1 PERSISTED;

CREATE NONCLUSTERED INDEX ix_CardTransactions_Session
ON CardTransactions (Processed, SessionID)
INCLUDE (CardNo, Transactionid);

Then each process needs to know which session it is, and pass that into the proc to update the table and return the data

CREATE PROCEDURE dbo.pr_GetNextBatch
    @SessionID int        -- Paramter to identify the sesison
  , @BatchSize int = 1000 -- Paramter to limit batch sizes
AS
BEGIN
  UPDATE TOP ( @BatchSize )
  CardTransactions
    SET Processed = 1
  OUTPUT INSERTED.CardNo, INSERTED.Transactionid
  WHERE Processed = 0
    AND SessionID = @SessionID;
END;

How to post data/code on a forum to get the best help.

Phil Parkin SSC Guru Points: 246959 More actions · Answer 8

@desnorton, this sounds like a good solution, well done.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

GaryV Hall of Fame Points: 3307 More actions · Answer 9

These schemes will fall apart if the data is too skewed. For example, if for some reason 80% of the card numbers were divisible by 4, one session would be doing 80% of the work.

A much more complex solution could be devised by getting a count of all the card numbers, and then assign them in card number order. The next card number goes to the session that has the least currently assigned to it. But cure of the overhead for doing that may be worse than the disease.

This reply was modified 4 years ago by GaryV.