Dividing a list of filenames into roughly equal batch sizes in KB

  • First a little bit of background:

    I have a list of 14 files (and growing) that need to be processed. The files are dropped to a shared folder on our network and I have to sweep them into our database daily. The files always have the same name.

    I wrote an SSIS package that divides the file list into four pathways (numbered 0-3) and processes each path with a For Each Loop, in parallel. The split is done using a quantile function:

    (RANK() over(PARTITION BY proj_name order by file_name) - 1) * 4 / COUNT(*) over(PARTITION BY proj_name) as quartile

    Each FEL looks at a specific quartile value to process. So, the first FEL is WHERE quartile = 0, the second is 1, so on.

    This works fine, but it's not really optimized for file size. Some of the files, particularly those earlier in the alphabet, are much larger than the others. This results in path 0 taking at least twice as long, if not more.

    What I want to do is, instead of splitting the file names into roughly even counts, I want to add a column to my table to include average file size and use that to split the files into roughly even sizes in KB. Example:

    File NameFile Size (KB)Original SplitDesired Split

    A4,00000

    B2,00001

    C2,00001

    D2,00002

    E1,00012

    F1,00012

    G1,00013

    H 50013

    I 50023

    J 50023

    K 50023

    L 50033

    M 25033

    N 25033

    I think this will end up being much more efficient, but at the moment, I can't think of a way to achieve this kind of split in T-SQL. Any help would be appreciated. Thanks in advance!

    -Paul G.

  • The included example might be just a little too easy since all of the file sizes are nice, even numbers. There are a couple of additional requirements to this:

    1- There must be at least one file per path.

    2- All files must be processed, so none can be left out.

  • You will find that a lot more people are willing to volunteer their time and effort if you provide some sort of sample to work in a nice consumable format.

    I put this together for you to show you what I mean.

    create table #Files

    (

    FileName char(1),

    FileSize int,

    OriginalSplit tinyint,

    DesiredSplit tinyint

    )

    insert #Files

    select 'A', 4000, 0, 0 union all

    select 'B', 2000, 0, 1 union all

    select 'C', 2000, 0, 1 union all

    select 'D', 2000, 0, 2 union all

    select 'E', 1000, 1, 2 union all

    select 'F', 1000, 1, 2 union all

    select 'G', 1000, 1, 3 union all

    select 'H', 500, 1, 3 union all

    select 'I', 500, 2, 3 union all

    select 'J', 500, 2, 3 union all

    select 'K', 500, 2, 3 union all

    select 'L', 500, 3, 3 union all

    select 'M', 250, 3, 3 union all

    select 'N', 250, 3, 3

    Now everybody who looks at this is easily able to start to work on your problem instead taking the 20 minutes I did to format this.

    The answer to your problem lies in how accurate you need this to be. To be really accurate you will probably need to look into running totals. Jeff Moden has written a great article about that here. http://www.sqlservercentral.com/articles/T-SQL/68467/[/url]

    The following is at least a little closer without adding all the extra work of running totals. It is however not mathematically accurate but given what you are describing I don't know if that is totally necessary.

    declare @Split1 int, @Split2 int, @Split3 int

    select @Split1 = (MAX(FileSize) - MIN(FileSize)) * .75,

    @Split2 = (Max(FileSize) - MIN(FileSize)) * .5,

    @Split3 = (Max(FileSize) - MIN(FileSize)) * .25

    from #Files

    select @Split1, @Split2, @Split3

    select *

    , case when FileSize > @Split1 then 0

    else case when FileSize between @Split2 and @Split1 then 1

    else case when FileSize between @Split3 and @Split2 then 2

    else 3

    end

    end

    end as NewSplit

    from #Files

    This may appear to be a bit closer to an acceptable middle ground because of the flatness of the sample data (as you suggested). It may however be close enough so your processes end up reasonably similar sizes.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • I guess my question would be, which is most important? To use all 4 FELs or to balance the load? If you have one file that is much larger than all of the rest as in the following example...

    FileName FileSize RunningTotal FEL

    A 20000 20000 0

    B 2000 2000 1

    C 2000 4000 1

    D 2000 6000 1

    E 1000 7000 1

    F 1000 8000 1

    G 1000 1000 2

    H 500 1500 2

    I 500 2000 2

    J 500 2500 2

    K 500 3000 2

    L 500 3500 2

    M 250 3750 2

    N 250 4000 2

    Then there's no need to even bother the 4th FEL because it'll only take the 2nd and 3rd FEL to complete the job before the first fell can complete the one file. There's actually no advantage to bringing the 4th fell online for this type of mix.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The example given was just that, an example. The package was designed partially as a proof of concept (since I couldn't find any examples on how to parallelize a FEL process, just people complaining how FEL can't do it within itself), and partially with the intention of being a template usable with multiple projects. All a user has to do is change the project name assigned to a package variable.

    As for the actual file sizes, I have 6 files that weigh in at less than 500KB. I have 1 at just over 600KB, 2 at 1000KB, 2 at 4000KB, and 1 at 13000KB. As you said, the fourth FEL could probably be switched off in this case. The first FEL could choke on the big file, both during the file copy process and the bulk insert. The second and third could likely bang out the two 4 meg files and all the smaller ones before the first one finishes, still leaving the fourth idle.

    I'm almost entirely self-taught, so I appreciate the guidance. Thanks!

  • dembones79 (2/28/2012)


    The example given was just that, an example.

    No problem. I get that. I just wanted to know if an idle FEL or two would be OK if such a thing were to occur.

    I'm on my way to work. I'll try to get back to this tonight. Sean was correct. A "Quirky Update" may be just what the doctor ordered.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yeah, that came off a bit ruder than I intended. I hadn't had my caffeine yet. I'm sorry.

    Thanks again for the help. 🙂

  • Yowch. I set up a spreadsheet to emulate the Quirky Update and I've had no joy getting something to work in a set based fashion for all scenarios. This reminds me of Hugo Kornelis' "Bin Stacking" problem (it's nearly identical, in fact) where he stated that he hasn't been able to find a set based method that works with all scenarios. Sure, you can come close but all it takes is one special scenario and BOOM! For example...

    Reffilenamefilesize

    1A 2750

    2B 2750

    3C 2750

    4D 2750

    5E 2750

    6F 500

    7G 500

    8H 500

    9I 500

    10J 500

    11K 500

    12L 500

    13M 250

    14N 250

    I've got the funny feeling that Hugo is correct for these types of things. To do it accurately for every scenario, it's going to take a loop and an accumulator for each "BIN" (FEL in this case).

    I was going to suggest using Sean's code but it takes a major beating (as does the Quirky Update) on the scenario above. Here's the data from above with Sean's code.

    DROP TABLE #Files

    GO

    create table #Files

    (

    FileName char(1),

    FileSize int,

    OriginalSplit tinyint,

    DesiredSplit tinyint

    )

    insert #Files

    select 'A', 2750, 0, 0 union all

    select 'B', 2750, 0, 1 union all

    select 'C', 2750, 0, 1 union all

    select 'D', 2750, 0, 2 union all

    select 'E', 2750, 1, 2 union all

    select 'F', 500, 1, 2 union all

    select 'G', 500, 1, 3 union all

    select 'H', 500, 1, 3 union all

    select 'I', 500, 2, 3 union all

    select 'J', 500, 2, 3 union all

    select 'K', 500, 2, 3 union all

    select 'L', 500, 3, 3 union all

    select 'M', 250, 3, 3 union all

    select 'N', 250, 3, 3

    select filename,filesize from #files

    declare @Split1 int, @Split2 int, @Split3 int

    select @Split1 = (MAX(FileSize) - MIN(FileSize)) * .75,

    @Split2 = (Max(FileSize) - MIN(FileSize)) * .5,

    @Split3 = (Max(FileSize) - MIN(FileSize)) * .25

    from #Files

    select @Split1, @Split2, @Split3

    select *

    , case when FileSize > @Split1 then 0

    else case when FileSize between @Split2 and @Split1 then 1

    else case when FileSize between @Split3 and @Split2 then 2

    else 3

    end

    end

    end as NewSplit

    from #Files

    That results in the following which doesn't quite work out right...

    FileName FileSize OriginalSplit DesiredSplit NewSplit

    -------- ----------- ------------- ------------ -----------

    A 2750 0 0 0

    B 2750 0 1 0

    C 2750 0 1 0

    D 2750 0 2 0

    E 2750 1 2 0

    F 500 1 2 3

    G 500 1 3 3

    H 500 1 3 3

    I 500 2 3 3

    J 500 2 3 3

    K 500 2 3 3

    L 500 3 3 3

    M 250 3 3 3

    N 250 3 3 3

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok... since we had to use a loop for this, I tried to make it as efficient as possible.

    First, the really nasty test data I posted in my previous post just to keep it all together. Thank you Sean for putting together the first bit of test data generation code.

    --===== Conditionally drop Temp Table(s) to make reruns in SSMS easier

    IF OBJECT_ID('tempdb..#Files','U') IS NOT NULL

    DROP TABLE #Files

    ;

    GO

    CREATE TABLE #Files

    (

    RowNum INT IDENTITY(1,1),

    FileName CHAR(1) PRIMARY KEY NONCLUSTERED,

    FileSize BIGINT,

    FEL TINYINT

    )

    ;

    INSERT #Files

    (FileName, FileSize)

    SELECT 'A', 2750 UNION ALL

    SELECT 'B', 2750 UNION ALL

    SELECT 'C', 2750 UNION ALL

    SELECT 'D', 2750 UNION ALL

    SELECT 'E', 2750 UNION ALL

    SELECT 'F', 500 UNION ALL

    SELECT 'G', 500 UNION ALL

    SELECT 'H', 500 UNION ALL

    SELECT 'I', 500 UNION ALL

    SELECT 'J', 500 UNION ALL

    SELECT 'K', 500 UNION ALL

    SELECT 'L', 500 UNION ALL

    SELECT 'M', 250 UNION ALL

    SELECT 'N', 250

    ;

    Now the (ugh!) solution. As always, the details of the (haaaaccckkk...) loop code (paaaattooooiiii!) are in the comments of the code. I really hated to use a loop and if someone can come up with a guaranteed accurate set-based way of doing this with an assignable number of "Bins" (FELs, in this case), you'd not only make my day but I think you'd make Hugo's year!

    I'd settle for a nice recursive CTE but remember that things with Triangular Joins aren't what I call "Set-Based" because they "touch" the same rows a whole lot more than once.

    --========================================================================================

    -- Presets (made all the presets compatible with all versions of SQL Server)

    --========================================================================================

    --===== Conditionally drop Temp Table(s) to make reruns in SSMS easier

    IF OBJECT_ID('tempdb..#Accumulator','U') IS NOT NULL

    DROP TABLE #Accumulator

    ;

    --===== Suppress the auto-display of rowcounts because we're going to use a loop.

    -- This will actually help the loop run faster.

    SET NOCOUNT ON

    ;

    --===== If anything in the loop goes wrong, this will force a "quit" and do a rollback.

    -- You'll see why we use a transaction in a bit.

    SET XACT_ABORT ON

    ;

    --===== Declare some obviously named variables

    DECLARE @CurrentFile INT, --Current RowNum we're working on in the #Files table

    @Fel INT, --This could be a parameter for a stored procedure

    @Files INT, --Total count of rows in the #Files table

    @FileSize INT, --File size from the current row in the #Files table

    @TgtFel INT --The FEL in the accumulator table with the lease # of bytes.

    ;

    --===== Preset the variables

    SELECT @Fel = 4,

    @Files = MAX(RowNum),

    @CurrentFile = 1

    FROM #Files

    ;

    --===== Create the accumulator table where we'll keep track

    -- of the total bytes assigned to each FEL.

    SELECT TOP (@Fel)

    FEL = IDENTITY(INT,0,1), Bytes = CAST(0 AS BIGINT)

    INTO #Accumulator

    FROM sys.all_columns ac1 --Use dbo.syscolumns for SQL Server 2000 or earlier

    ;

    --========================================================================================

    -- All set. Make the FEL assignments for each file in the #Files table.

    --========================================================================================

    BEGIN TRANSACTION --This will add a bit more speed to the loop because it will commit

    --all of the rows we change all at once instead of individually.

    WHILE @CurrentFile <= @Files

    BEGIN

    --===== Get the file size from the current file row.

    SELECT @FileSize = FileSize

    FROM #Files

    WHERE RowNum = @CurrentFile

    ;

    --===== Find the FEL with the least number of bytes assigned

    SELECT TOP 1

    @TgtFel = FEL

    FROM #Accumulator

    ORDER BY Bytes ASC, FEL ASC

    ;

    --===== Add the bytes to the FEL we just found in the accumulator.

    UPDATE #Accumulator

    SET Bytes = Bytes + @Filesize

    WHERE FEL = @TgtFel

    ;

    --===== Assign the FEL we just found to the file row.

    UPDATE #Files

    SET FEL = @TgtFel

    WHERE RowNum = @CurrentFile

    ;

    --===== Get ready to read the next file row

    SELECT @CurrentFile = @CurrentFile + 1

    ;

    END

    ;

    COMMIT

    ;

    --========================================================================================

    -- Display the results

    --========================================================================================

    SELECT * FROM #Accumulator;

    SELECT * FROM #Files;

    Here are the results from the run above.

    FEL Bytes

    -------------------- --------------------

    0 5500

    1 4250

    2 4000

    3 4000

    RowNum FileName FileSize FEL

    ----------- -------- -------------------- ----

    1 A 2750 0

    2 B 2750 1

    3 C 2750 2

    4 D 2750 3

    5 E 2750 0

    6 F 500 1

    7 G 500 2

    8 H 500 3

    9 I 500 1

    10 J 500 2

    11 K 500 3

    12 L 500 1

    13 M 250 2

    14 N 250 3

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here's the result using the original data that Sean used. It's a bit different than the original posted desired result but it has the same effect... all of the FELs are assigned the same number of bytes to work with.

    FEL Bytes

    ----------- --------------------

    0 4000

    1 4000

    2 4000

    3 4000

    RowNum FileName FileSize FEL

    ----------- -------- -------------------- ----

    1 A 4000 0

    2 B 2000 1

    3 C 2000 2

    4 D 2000 3

    5 E 1000 1

    6 F 1000 2

    7 G 1000 3

    8 H 500 1

    9 I 500 2

    10 J 500 3

    11 K 500 1

    12 L 500 2

    13 M 250 3

    14 N 250 3

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I too suspected this wasn't going to end up set based because of the nature of the whole thing. I sort of eluded to that in my post that the code I did was reasonably close on the pretty flat example data. I had a feeling that on real data it wouldn't be any better than what the OP was doing originally.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Wow! This is awesome. It'll take me a bit to process this. 🙂 Thanks again for all of the help!

  • dembones79 (3/1/2012)


    Wow! This is awesome. It'll take me a bit to process this. 🙂 Thanks again for all of the help!

    In retrospect, it's actually pretty easy. The most important part is that the files have to be sorted by the FileSize in descending order and the "RowNum" column must reflect that sort order.

    After that, a simple table containing 1 entry for each FEL is created and the "Total Filesize" column is preset to zero.

    The loop then reads the first row from the #Files table and finds the "top 1" FEL with the smallest total and adds the FileSize to the "total" and assigns that "top 1" FEL to the #Files table. The next interation does the same thing. Since the row we just updated for the "total" is no longer 0, it won't be the "top 1" of the smallest total anymore. It keeps doing that... always adding the filesize to the smallest "total" and using that FEL as the assignment for the frow in #Files.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just for the record, this is a form of the 'Bin Packing problem', it's a classic AI and optimisation problem and it's NP-hard in the general case.

    http://en.wikipedia.org/wiki/Bin_packing

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • You know I have a different path and think it was mentioned above. Why do we care what process handles a particular file, just that they all get handled and as fast as possible.. It just so happens I know of an article or two here on SSC that covers this:

    Part 1

    http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/70346/[/url]

    Part 2

    http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/70347/[/url]

    CEWII

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply