Dividing data equally in 4 sessions

  • Hi All

    I have a table which contains the transactions done by cards. Below is the table structure and sample data:

    create table CardTransactions(CardNo int, Transactionid int, Processed bit)

    Insert into CardTransactions values(1,1001,0)

    Insert into CardTransactions values(1,1002,0)

    Insert into CardTransactions values(1,1003,0)

    Insert into CardTransactions values(1,1004,0)

    Insert into CardTransactions values(1,1005,0)

    Insert into CardTransactions values(2,1006,0)

    Insert into CardTransactions values(2,1007,0)

    Insert into CardTransactions values(2,1008,0)

    Insert into CardTransactions values(3,1009,0)

    Insert into CardTransactions values(3,1010,0)

    Insert into CardTransactions values(4,1011,0)

    Insert into CardTransactions values(4,1012,0)

    Insert into CardTransactions values(5,1013,0)

    Insert into CardTransactions values(6,1014,0)

    Insert into CardTransactions values(7,1015,0)

    Now a procedure will be called from 4 sessions. Each session is expected to pick data in almost equal amount (not exact). So here we have 15 transactions, so each session should pick approx 4 transactions. Transaction of 1 card should stay only in one session. So here CardNo 1 has 5 transactions, so one of the sessions should get all 5 transactions, not 4. Second session might get CardNo 2 and 5 transactions(3+1=4). Third Session might get CardNo 3 and 4 (2+2=4). Fourth session will get CardNo 6 and 7(1+1=2) transactions.

    Can any query or function in SQL Server does this kind of approximate distribution?

    Thanks

    Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

  • NTILE(n) will divide the set into n nearly evenly sized chunks. Not sure how you would force full groups into a single "chunk" though

  • I think you're going to need a master proc that has this logic or SSIS.  Then the proc could call the procedure 4 times and pass in the TransactionIDs that the process should work with.  You could kick them off asynchronously, then monitor for when they complete

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • You could do it by offset paging if each of the "sessions" requests a different page.  Is it possible for each request to query for a different page?  Otherwise, it seems like you'd have to keep track of which rows were allocated and exhaust the supply

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Can you expand a little on what the 'sessions' are?

    How should one 'session' know what the other sessions are doing (so that they don't have any overlapping or missed rows))?

    How is the 'batch' of data which the four sessions are working with determined? Is this a process which will run repeatedly while transactions are being created?

    What happens if that batch of data cannot be divided into four (because there are fewer than four card numbers in the batch)?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin wrote:

    Can you expand a little on what the 'sessions' are?

    How should one 'session' know what the other sessions are doing (so that they don't have any overlapping or missed rows))?

    How is the 'batch' of data which the four sessions are working with determined? Is this a process which will run repeatedly while transactions are being created?

    What happens if that batch of data cannot be divided into four (because there are fewer than four card numbers in the batch)?

    Hi Phil

    This proc is called from 4 different SSMS sessions i.e. through 4 spids. (actual implementation use SSIS to start 4 sessions, but for simplicity, we can assume we are running this proc manually by opening 4 SSMS sessions). Whatever cards are picked by a session, all their transactions are marked as Processed =1 so that other sessions are not picking the same cards again. (Atual implementation use tablockx etc to make sure no overlapping cards are picked).

    Batch size can be determined like this: total no of transactions/4. So in this case, we get 15/4 ~4

    The purpose of this entire exercise is to be able to run one single proc from multiple sessions to get performance gain as we have millions of records to processes. This gives us approx 40% performance gain when all other optimization tricks are already tried out.

    Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

  • It appears you're looking to use horizontal partitioning.

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • The following code will not divide the values exactly.  But since you have a large number of records, it uses the card number to split the records.  It will guarantee that all the records from a given card number are kept in the same session.

    Start by pre-diving the transactions in the table based on CardID

    ALTER TABLE CardTransactions
    ADD SessionID AS CardNo % 4 + 1 PERSISTED;

    CREATE NONCLUSTERED INDEX ix_CardTransactions_Session
    ON CardTransactions (Processed, SessionID)
    INCLUDE (CardNo, Transactionid);

    Then each process needs to know which session it is, and pass that into the proc to update the table and return the data

    CREATE PROCEDURE dbo.pr_GetNextBatch
    @SessionID int -- Paramter to identify the sesison
    , @BatchSize int = 1000 -- Paramter to limit batch sizes
    AS
    BEGIN
    UPDATE TOP ( @BatchSize )
    CardTransactions
    SET Processed = 1
    OUTPUT INSERTED.CardNo, INSERTED.Transactionid
    WHERE Processed = 0
    AND SessionID = @SessionID;
    END;

  • @desnorton, this sounds like a good solution, well done.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • These schemes will fall apart if the data is too skewed.  For example, if for some reason 80% of the card numbers were divisible by 4, one session would be doing 80% of the work.

    A much more complex solution could be devised by getting a count of all the card numbers, and then assign them in card number order.  The next card number goes to the session that has the least currently assigned to it.  But cure of the overhead for doing that may be worse than the disease.

    • This reply was modified 4 years ago by  GaryV.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply