Dividing a list of filenames into roughly equal batch sizes in KB

Question

Dividing a list of filenames into roughly equal batch sizes in KB

Paul Golabowski

SSChasing Mays

Points: 646
More actions
February 27, 2012 at 12:53 pm

#257497

First a little bit of background:
I have a list of 14 files (and growing) that need to be processed. The files are dropped to a shared folder on our network and I have to sweep them into our database daily. The files always have the same name.
I wrote an SSIS package that divides the file list into four pathways (numbered 0-3) and processes each path with a For Each Loop, in parallel. The split is done using a quantile function:
(RANK() over(PARTITION BY proj_name order by file_name) - 1) * 4 / COUNT(*) over(PARTITION BY proj_name) as quartile
Each FEL looks at a specific quartile value to process. So, the first FEL is WHERE quartile = 0, the second is 1, so on.
This works fine, but it's not really optimized for file size. Some of the files, particularly those earlier in the alphabet, are much larger than the others. This results in path 0 taking at least twice as long, if not more.
What I want to do is, instead of splitting the file names into roughly even counts, I want to add a column to my table to include average file size and use that to split the files into roughly even sizes in KB. Example:
File NameFile Size (KB)Original SplitDesired Split
A4,00000
B2,00001
C2,00001
D2,00002
E1,00012
F1,00012
G1,00013
H 50013
I 50023
J 50023
K 50023
L 50033
M 25033
N 25033
I think this will end up being much more efficient, but at the moment, I can't think of a way to achieve this kind of split in T-SQL. Any help would be appreciated. Thanks in advance!
-Paul G.

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply

Paul Golabowski SSChasing Mays Points: 646 More actions · Answer 1

The included example might be just a little too easy since all of the file sizes are nice, even numbers. There are a couple of additional requirements to this:

1- There must be at least one file per path.

2- All files must be processed, so none can be left out.

Sean Lange SSC Guru Points: 286590 More actions · Answer 2

You will find that a lot more people are willing to volunteer their time and effort if you provide some sort of sample to work in a nice consumable format.

I put this together for you to show you what I mean.

create table #Files

(

FileName char(1),

FileSize int,

OriginalSplit tinyint,

DesiredSplit tinyint

)

insert #Files

select 'A', 4000, 0, 0 union all

select 'B', 2000, 0, 1 union all

select 'C', 2000, 0, 1 union all

select 'D', 2000, 0, 2 union all

select 'E', 1000, 1, 2 union all

select 'F', 1000, 1, 2 union all

select 'G', 1000, 1, 3 union all

select 'H', 500, 1, 3 union all

select 'I', 500, 2, 3 union all

select 'J', 500, 2, 3 union all

select 'K', 500, 2, 3 union all

select 'L', 500, 3, 3 union all

select 'M', 250, 3, 3 union all

select 'N', 250, 3, 3

Now everybody who looks at this is easily able to start to work on your problem instead taking the 20 minutes I did to format this.

The answer to your problem lies in how accurate you need this to be. To be really accurate you will probably need to look into running totals. Jeff Moden has written a great article about that here. http://www.sqlservercentral.com/articles/T-SQL/68467/[/url]

The following is at least a little closer without adding all the extra work of running totals. It is however not mathematically accurate but given what you are describing I don't know if that is totally necessary.

declare @Split1 int, @Split2 int, @Split3 int

select @Split1 = (MAX(FileSize) - MIN(FileSize)) * .75,

@Split2 = (Max(FileSize) - MIN(FileSize)) * .5,

@Split3 = (Max(FileSize) - MIN(FileSize)) * .25

from #Files

select @Split1, @Split2, @Split3

select *

, case when FileSize > @Split1 then 0

else case when FileSize between @Split2 and @Split1 then 1

else case when FileSize between @Split3 and @Split2 then 2

else 3

end

end as NewSplit

from #Files

This may appear to be a bit closer to an acceptable middle ground because of the flatness of the sample data (as you suggested). It may however be close enough so your processes end up reasonably similar sizes.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 3

I guess my question would be, which is most important? To use all 4 FELs or to balance the load? If you have one file that is much larger than all of the rest as in the following example...

FileName FileSize RunningTotal FEL

A 20000 20000 0

B 2000 2000 1

C 2000 4000 1

D 2000 6000 1

E 1000 7000 1

F 1000 8000 1

G 1000 1000 2

H 500 1500 2

I 500 2000 2

J 500 2500 2

K 500 3000 2

L 500 3500 2

M 250 3750 2

N 250 4000 2

Then there's no need to even bother the 4th FEL because it'll only take the 2nd and 3rd FEL to complete the job before the first fell can complete the one file. There's actually no advantage to bringing the 4th fell online for this type of mix.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul Golabowski SSChasing Mays Points: 646 More actions · Answer 4

The example given was just that, an example. The package was designed partially as a proof of concept (since I couldn't find any examples on how to parallelize a FEL process, just people complaining how FEL can't do it within itself), and partially with the intention of being a template usable with multiple projects. All a user has to do is change the project name assigned to a package variable.

As for the actual file sizes, I have 6 files that weigh in at less than 500KB. I have 1 at just over 600KB, 2 at 1000KB, 2 at 4000KB, and 1 at 13000KB. As you said, the fourth FEL could probably be switched off in this case. The first FEL could choke on the big file, both during the file copy process and the bulk insert. The second and third could likely bang out the two 4 meg files and all the smaller ones before the first one finishes, still leaving the fourth idle.

I'm almost entirely self-taught, so I appreciate the guidance. Thanks!

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 5

dembones79 (2/28/2012)
The example given was just that, an example.

No problem. I get that. I just wanted to know if an idle FEL or two would be OK if such a thing were to occur.

I'm on my way to work. I'll try to get back to this tonight. Sean was correct. A "Quirky Update" may be just what the doctor ordered.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul Golabowski SSChasing Mays Points: 646 More actions · Answer 6

Yeah, that came off a bit ruder than I intended. I hadn't had my caffeine yet. I'm sorry.

Thanks again for the help. 🙂

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 7

Yowch. I set up a spreadsheet to emulate the Quirky Update and I've had no joy getting something to work in a set based fashion for all scenarios. This reminds me of Hugo Kornelis' "Bin Stacking" problem (it's nearly identical, in fact) where he stated that he hasn't been able to find a set based method that works with all scenarios. Sure, you can come close but all it takes is one special scenario and BOOM! For example...

Reffilenamefilesize

1A 2750

2B 2750

3C 2750

4D 2750

5E 2750

6F 500

7G 500

8H 500

9I 500

10J 500

11K 500

12L 500

13M 250

14N 250

I've got the funny feeling that Hugo is correct for these types of things. To do it accurately for every scenario, it's going to take a loop and an accumulator for each "BIN" (FEL in this case).

I was going to suggest using Sean's code but it takes a major beating (as does the Quirky Update) on the scenario above. Here's the data from above with Sean's code.

DROP TABLE #Files

GO

create table #Files

(

FileName char(1),

FileSize int,

OriginalSplit tinyint,

DesiredSplit tinyint

)

insert #Files

select 'A', 2750, 0, 0 union all

select 'B', 2750, 0, 1 union all

select 'C', 2750, 0, 1 union all

select 'D', 2750, 0, 2 union all

select 'E', 2750, 1, 2 union all

select 'F', 500, 1, 2 union all

select 'G', 500, 1, 3 union all

select 'H', 500, 1, 3 union all

select 'I', 500, 2, 3 union all

select 'J', 500, 2, 3 union all

select 'K', 500, 2, 3 union all

select 'L', 500, 3, 3 union all

select 'M', 250, 3, 3 union all

select 'N', 250, 3, 3

select filename,filesize from #files

declare @Split1 int, @Split2 int, @Split3 int

select @Split1 = (MAX(FileSize) - MIN(FileSize)) * .75,

@Split2 = (Max(FileSize) - MIN(FileSize)) * .5,

@Split3 = (Max(FileSize) - MIN(FileSize)) * .25

from #Files

select @Split1, @Split2, @Split3

select *

, case when FileSize > @Split1 then 0

else case when FileSize between @Split2 and @Split1 then 1

else case when FileSize between @Split3 and @Split2 then 2

else 3

end

end as NewSplit

from #Files

That results in the following which doesn't quite work out right...

FileName FileSize OriginalSplit DesiredSplit NewSplit

-------- ----------- ------------- ------------ -----------

A 2750 0 0 0

B 2750 0 1 0

C 2750 0 1 0

D 2750 0 2 0

E 2750 1 2 0

F 500 1 2 3

G 500 1 3 3

H 500 1 3 3

I 500 2 3 3

J 500 2 3 3

K 500 2 3 3

L 500 3 3 3

M 250 3 3 3

N 250 3 3 3

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 8

Ok... since we had to use a loop for this, I tried to make it as efficient as possible.

First, the really nasty test data I posted in my previous post just to keep it all together. Thank you Sean for putting together the first bit of test data generation code.

--===== Conditionally drop Temp Table(s) to make reruns in SSMS easier

IF OBJECT_ID('tempdb..#Files','U') IS NOT NULL

DROP TABLE #Files

;

GO

CREATE TABLE #Files

(

RowNum INT IDENTITY(1,1),

FileName CHAR(1) PRIMARY KEY NONCLUSTERED,

FileSize BIGINT,

FEL TINYINT

)

;

INSERT #Files

(FileName, FileSize)

SELECT 'A', 2750 UNION ALL

SELECT 'B', 2750 UNION ALL

SELECT 'C', 2750 UNION ALL

SELECT 'D', 2750 UNION ALL

SELECT 'E', 2750 UNION ALL

SELECT 'F', 500 UNION ALL

SELECT 'G', 500 UNION ALL

SELECT 'H', 500 UNION ALL

SELECT 'I', 500 UNION ALL

SELECT 'J', 500 UNION ALL

SELECT 'K', 500 UNION ALL

SELECT 'L', 500 UNION ALL

SELECT 'M', 250 UNION ALL

SELECT 'N', 250

;

Now the (ugh!) solution. As always, the details of the (haaaaccckkk...) loop code (paaaattooooiiii!) are in the comments of the code. I really hated to use a loop and if someone can come up with a guaranteed accurate set-based way of doing this with an assignable number of "Bins" (FELs, in this case), you'd not only make my day but I think you'd make Hugo's year!

I'd settle for a nice recursive CTE but remember that things with Triangular Joins aren't what I call "Set-Based" because they "touch" the same rows a whole lot more than once.

--========================================================================================

-- Presets (made all the presets compatible with all versions of SQL Server)

--========================================================================================

--===== Conditionally drop Temp Table(s) to make reruns in SSMS easier

IF OBJECT_ID('tempdb..#Accumulator','U') IS NOT NULL

DROP TABLE #Accumulator

;

--===== Suppress the auto-display of rowcounts because we're going to use a loop.

-- This will actually help the loop run faster.

SET NOCOUNT ON

;

--===== If anything in the loop goes wrong, this will force a "quit" and do a rollback.

-- You'll see why we use a transaction in a bit.

SET XACT_ABORT ON

;

--===== Declare some obviously named variables

DECLARE @CurrentFile INT, --Current RowNum we're working on in the #Files table

@Fel INT, --This could be a parameter for a stored procedure

@Files INT, --Total count of rows in the #Files table

@FileSize INT, --File size from the current row in the #Files table

@TgtFel INT --The FEL in the accumulator table with the lease # of bytes.

;

--===== Preset the variables

SELECT @Fel = 4,

@Files = MAX(RowNum),

@CurrentFile = 1

FROM #Files

;

--===== Create the accumulator table where we'll keep track

-- of the total bytes assigned to each FEL.

SELECT TOP (@Fel)

FEL = IDENTITY(INT,0,1), Bytes = CAST(0 AS BIGINT)

INTO #Accumulator

FROM sys.all_columns ac1 --Use dbo.syscolumns for SQL Server 2000 or earlier

;

--========================================================================================

-- All set. Make the FEL assignments for each file in the #Files table.

--========================================================================================

BEGIN TRANSACTION --This will add a bit more speed to the loop because it will commit

--all of the rows we change all at once instead of individually.

WHILE @CurrentFile <= @Files

BEGIN

--===== Get the file size from the current file row.

SELECT @FileSize = FileSize

FROM #Files

WHERE RowNum = @CurrentFile

;

--===== Find the FEL with the least number of bytes assigned

SELECT TOP 1

@TgtFel = FEL

FROM #Accumulator

ORDER BY Bytes ASC, FEL ASC

;

--===== Add the bytes to the FEL we just found in the accumulator.

UPDATE #Accumulator

SET Bytes = Bytes + @Filesize

WHERE FEL = @TgtFel

;

--===== Assign the FEL we just found to the file row.

UPDATE #Files

SET FEL = @TgtFel

WHERE RowNum = @CurrentFile

;

--===== Get ready to read the next file row

SELECT @CurrentFile = @CurrentFile + 1

;

END

;

COMMIT

;

--========================================================================================

-- Display the results

--========================================================================================

SELECT * FROM #Accumulator;

SELECT * FROM #Files;

Here are the results from the run above.

FEL Bytes

-------------------- --------------------

0 5500

1 4250

2 4000

3 4000

RowNum FileName FileSize FEL

----------- -------- -------------------- ----

1 A 2750 0

2 B 2750 1

3 C 2750 2

4 D 2750 3

5 E 2750 0

6 F 500 1

7 G 500 2

8 H 500 3

9 I 500 1

10 J 500 2

11 K 500 3

12 L 500 1

13 M 250 2

14 N 250 3

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 9

Here's the result using the original data that Sean used. It's a bit different than the original posted desired result but it has the same effect... all of the FELs are assigned the same number of bytes to work with.

FEL Bytes

----------- --------------------

0 4000

1 4000

2 4000

3 4000

RowNum FileName FileSize FEL

----------- -------- -------------------- ----

1 A 4000 0

2 B 2000 1

3 C 2000 2

4 D 2000 3

5 E 1000 1

6 F 1000 2

7 G 1000 3

8 H 500 1

9 I 500 2

10 J 500 3

11 K 500 1

12 L 500 2

13 M 250 3

14 N 250 3

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Sean Lange SSC Guru Points: 286590 More actions · Answer 10

I too suspected this wasn't going to end up set based because of the nature of the whole thing. I sort of eluded to that in my post that the code I did was reasonably close on the pretty flat example data. I had a feeling that on real data it wouldn't be any better than what the OP was doing originally.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Paul Golabowski SSChasing Mays Points: 646 More actions · Answer 11

Wow! This is awesome. It'll take me a bit to process this. 🙂 Thanks again for all of the help!

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 12

dembones79 (3/1/2012)
Wow! This is awesome. It'll take me a bit to process this. 🙂 Thanks again for all of the help!

In retrospect, it's actually pretty easy. The most important part is that the files have to be sorted by the FileSize in descending order and the "RowNum" column must reflect that sort order.

After that, a simple table containing 1 entry for each FEL is created and the "Total Filesize" column is preset to zero.

The loop then reads the first row from the #Files table and finds the "top 1" FEL with the smallest total and adds the FileSize to the "total" and assigns that "top 1" FEL to the #Files table. The next interation does the same thing. Since the row we just updated for the "total" is no longer 0, it won't be the "top 1" of the smallest total anymore. It keeps doing that... always adding the filesize to the smallest "total" and using that FEL as the assignment for the frow in #Files.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 13

Just for the record, this is a form of the 'Bin Packing problem', it's a classic AI and optimisation problem and it's NP-hard in the general case.

http://en.wikipedia.org/wiki/Bin_packing

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Elliott Whitlow SSC Guru Points: 102402 More actions · Answer 14

You know I have a different path and think it was mentioned above. Why do we care what process handles a particular file, just that they all get handled and as fast as possible.. It just so happens I know of an article or two here on SSC that covers this:

Part 1

http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/70346/[/url]

Part 2

http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/70347/[/url]

CEWII