February 27, 2012 at 12:53 pm
First a little bit of background:
I have a list of 14 files (and growing) that need to be processed. The files are dropped to a shared folder on our network and I have to sweep them into our database daily. The files always have the same name.
I wrote an SSIS package that divides the file list into four pathways (numbered 0-3) and processes each path with a For Each Loop, in parallel. The split is done using a quantile function:
(RANK() over(PARTITION BY proj_name order by file_name) - 1) * 4 / COUNT(*) over(PARTITION BY proj_name) as quartile
Each FEL looks at a specific quartile value to process. So, the first FEL is WHERE quartile = 0, the second is 1, so on.
This works fine, but it's not really optimized for file size. Some of the files, particularly those earlier in the alphabet, are much larger than the others. This results in path 0 taking at least twice as long, if not more.
What I want to do is, instead of splitting the file names into roughly even counts, I want to add a column to my table to include average file size and use that to split the files into roughly even sizes in KB. Example:
File NameFile Size (KB)Original SplitDesired Split
A4,00000
B2,00001
C2,00001
D2,00002
E1,00012
F1,00012
G1,00013
H 50013
I 50023
J 50023
K 50023
L 50033
M 25033
N 25033
I think this will end up being much more efficient, but at the moment, I can't think of a way to achieve this kind of split in T-SQL. Any help would be appreciated. Thanks in advance!
-Paul G.
February 27, 2012 at 1:01 pm
The included example might be just a little too easy since all of the file sizes are nice, even numbers. There are a couple of additional requirements to this:
1- There must be at least one file per path.
2- All files must be processed, so none can be left out.
February 27, 2012 at 2:41 pm
You will find that a lot more people are willing to volunteer their time and effort if you provide some sort of sample to work in a nice consumable format.
I put this together for you to show you what I mean.
create table #Files
(
FileName char(1),
FileSize int,
OriginalSplit tinyint,
DesiredSplit tinyint
)
insert #Files
select 'A', 4000, 0, 0 union all
select 'B', 2000, 0, 1 union all
select 'C', 2000, 0, 1 union all
select 'D', 2000, 0, 2 union all
select 'E', 1000, 1, 2 union all
select 'F', 1000, 1, 2 union all
select 'G', 1000, 1, 3 union all
select 'H', 500, 1, 3 union all
select 'I', 500, 2, 3 union all
select 'J', 500, 2, 3 union all
select 'K', 500, 2, 3 union all
select 'L', 500, 3, 3 union all
select 'M', 250, 3, 3 union all
select 'N', 250, 3, 3
Now everybody who looks at this is easily able to start to work on your problem instead taking the 20 minutes I did to format this.
The answer to your problem lies in how accurate you need this to be. To be really accurate you will probably need to look into running totals. Jeff Moden has written a great article about that here. http://www.sqlservercentral.com/articles/T-SQL/68467/[/url]
The following is at least a little closer without adding all the extra work of running totals. It is however not mathematically accurate but given what you are describing I don't know if that is totally necessary.
declare @Split1 int, @Split2 int, @Split3 int
select @Split1 = (MAX(FileSize) - MIN(FileSize)) * .75,
@Split2 = (Max(FileSize) - MIN(FileSize)) * .5,
@Split3 = (Max(FileSize) - MIN(FileSize)) * .25
from #Files
select @Split1, @Split2, @Split3
select *
, case when FileSize > @Split1 then 0
else case when FileSize between @Split2 and @Split1 then 1
else case when FileSize between @Split3 and @Split2 then 2
else 3
end
end
end as NewSplit
from #Files
This may appear to be a bit closer to an acceptable middle ground because of the flatness of the sample data (as you suggested). It may however be close enough so your processes end up reasonably similar sizes.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
February 27, 2012 at 7:50 pm
I guess my question would be, which is most important? To use all 4 FELs or to balance the load? If you have one file that is much larger than all of the rest as in the following example...
FileName FileSize RunningTotal FEL
A 20000 20000 0
B 2000 2000 1
C 2000 4000 1
D 2000 6000 1
E 1000 7000 1
F 1000 8000 1
G 1000 1000 2
H 500 1500 2
I 500 2000 2
J 500 2500 2
K 500 3000 2
L 500 3500 2
M 250 3750 2
N 250 4000 2
Then there's no need to even bother the 4th FEL because it'll only take the 2nd and 3rd FEL to complete the job before the first fell can complete the one file. There's actually no advantage to bringing the 4th fell online for this type of mix.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 28, 2012 at 7:25 am
The example given was just that, an example. The package was designed partially as a proof of concept (since I couldn't find any examples on how to parallelize a FEL process, just people complaining how FEL can't do it within itself), and partially with the intention of being a template usable with multiple projects. All a user has to do is change the project name assigned to a package variable.
As for the actual file sizes, I have 6 files that weigh in at less than 500KB. I have 1 at just over 600KB, 2 at 1000KB, 2 at 4000KB, and 1 at 13000KB. As you said, the fourth FEL could probably be switched off in this case. The first FEL could choke on the big file, both during the file copy process and the bulk insert. The second and third could likely bang out the two 4 meg files and all the smaller ones before the first one finishes, still leaving the fourth idle.
I'm almost entirely self-taught, so I appreciate the guidance. Thanks!
February 28, 2012 at 7:32 am
dembones79 (2/28/2012)
The example given was just that, an example.
No problem. I get that. I just wanted to know if an idle FEL or two would be OK if such a thing were to occur.
I'm on my way to work. I'll try to get back to this tonight. Sean was correct. A "Quirky Update" may be just what the doctor ordered.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 28, 2012 at 10:13 am
Yeah, that came off a bit ruder than I intended. I hadn't had my caffeine yet. I'm sorry.
Thanks again for the help. 🙂
February 28, 2012 at 6:10 pm
Yowch. I set up a spreadsheet to emulate the Quirky Update and I've had no joy getting something to work in a set based fashion for all scenarios. This reminds me of Hugo Kornelis' "Bin Stacking" problem (it's nearly identical, in fact) where he stated that he hasn't been able to find a set based method that works with all scenarios. Sure, you can come close but all it takes is one special scenario and BOOM! For example...
Reffilenamefilesize
1A 2750
2B 2750
3C 2750
4D 2750
5E 2750
6F 500
7G 500
8H 500
9I 500
10J 500
11K 500
12L 500
13M 250
14N 250
I've got the funny feeling that Hugo is correct for these types of things. To do it accurately for every scenario, it's going to take a loop and an accumulator for each "BIN" (FEL in this case).
I was going to suggest using Sean's code but it takes a major beating (as does the Quirky Update) on the scenario above. Here's the data from above with Sean's code.
DROP TABLE #Files
GO
create table #Files
(
FileName char(1),
FileSize int,
OriginalSplit tinyint,
DesiredSplit tinyint
)
insert #Files
select 'A', 2750, 0, 0 union all
select 'B', 2750, 0, 1 union all
select 'C', 2750, 0, 1 union all
select 'D', 2750, 0, 2 union all
select 'E', 2750, 1, 2 union all
select 'F', 500, 1, 2 union all
select 'G', 500, 1, 3 union all
select 'H', 500, 1, 3 union all
select 'I', 500, 2, 3 union all
select 'J', 500, 2, 3 union all
select 'K', 500, 2, 3 union all
select 'L', 500, 3, 3 union all
select 'M', 250, 3, 3 union all
select 'N', 250, 3, 3
select filename,filesize from #files
declare @Split1 int, @Split2 int, @Split3 int
select @Split1 = (MAX(FileSize) - MIN(FileSize)) * .75,
@Split2 = (Max(FileSize) - MIN(FileSize)) * .5,
@Split3 = (Max(FileSize) - MIN(FileSize)) * .25
from #Files
select @Split1, @Split2, @Split3
select *
, case when FileSize > @Split1 then 0
else case when FileSize between @Split2 and @Split1 then 1
else case when FileSize between @Split3 and @Split2 then 2
else 3
end
end
end as NewSplit
from #Files
That results in the following which doesn't quite work out right...
FileName FileSize OriginalSplit DesiredSplit NewSplit
-------- ----------- ------------- ------------ -----------
A 2750 0 0 0
B 2750 0 1 0
C 2750 0 1 0
D 2750 0 2 0
E 2750 1 2 0
F 500 1 2 3
G 500 1 3 3
H 500 1 3 3
I 500 2 3 3
J 500 2 3 3
K 500 2 3 3
L 500 3 3 3
M 250 3 3 3
N 250 3 3 3
--Jeff Moden
Change is inevitable... Change for the better is not.
February 28, 2012 at 7:15 pm
Ok... since we had to use a loop for this, I tried to make it as efficient as possible.
First, the really nasty test data I posted in my previous post just to keep it all together. Thank you Sean for putting together the first bit of test data generation code.
--===== Conditionally drop Temp Table(s) to make reruns in SSMS easier
IF OBJECT_ID('tempdb..#Files','U') IS NOT NULL
DROP TABLE #Files
;
GO
CREATE TABLE #Files
(
RowNum INT IDENTITY(1,1),
FileName CHAR(1) PRIMARY KEY NONCLUSTERED,
FileSize BIGINT,
FEL TINYINT
)
;
INSERT #Files
(FileName, FileSize)
SELECT 'A', 2750 UNION ALL
SELECT 'B', 2750 UNION ALL
SELECT 'C', 2750 UNION ALL
SELECT 'D', 2750 UNION ALL
SELECT 'E', 2750 UNION ALL
SELECT 'F', 500 UNION ALL
SELECT 'G', 500 UNION ALL
SELECT 'H', 500 UNION ALL
SELECT 'I', 500 UNION ALL
SELECT 'J', 500 UNION ALL
SELECT 'K', 500 UNION ALL
SELECT 'L', 500 UNION ALL
SELECT 'M', 250 UNION ALL
SELECT 'N', 250
;
Now the (ugh!) solution. As always, the details of the (haaaaccckkk...) loop code (paaaattooooiiii!) are in the comments of the code. I really hated to use a loop and if someone can come up with a guaranteed accurate set-based way of doing this with an assignable number of "Bins" (FELs, in this case), you'd not only make my day but I think you'd make Hugo's year!
I'd settle for a nice recursive CTE but remember that things with Triangular Joins aren't what I call "Set-Based" because they "touch" the same rows a whole lot more than once.
--========================================================================================
-- Presets (made all the presets compatible with all versions of SQL Server)
--========================================================================================
--===== Conditionally drop Temp Table(s) to make reruns in SSMS easier
IF OBJECT_ID('tempdb..#Accumulator','U') IS NOT NULL
DROP TABLE #Accumulator
;
--===== Suppress the auto-display of rowcounts because we're going to use a loop.
-- This will actually help the loop run faster.
SET NOCOUNT ON
;
--===== If anything in the loop goes wrong, this will force a "quit" and do a rollback.
-- You'll see why we use a transaction in a bit.
SET XACT_ABORT ON
;
--===== Declare some obviously named variables
DECLARE @CurrentFile INT, --Current RowNum we're working on in the #Files table
@Fel INT, --This could be a parameter for a stored procedure
@Files INT, --Total count of rows in the #Files table
@FileSize INT, --File size from the current row in the #Files table
@TgtFel INT --The FEL in the accumulator table with the lease # of bytes.
;
--===== Preset the variables
SELECT @Fel = 4,
@Files = MAX(RowNum),
@CurrentFile = 1
FROM #Files
;
--===== Create the accumulator table where we'll keep track
-- of the total bytes assigned to each FEL.
SELECT TOP (@Fel)
FEL = IDENTITY(INT,0,1), Bytes = CAST(0 AS BIGINT)
INTO #Accumulator
FROM sys.all_columns ac1 --Use dbo.syscolumns for SQL Server 2000 or earlier
;
--========================================================================================
-- All set. Make the FEL assignments for each file in the #Files table.
--========================================================================================
BEGIN TRANSACTION --This will add a bit more speed to the loop because it will commit
--all of the rows we change all at once instead of individually.
WHILE @CurrentFile <= @Files
BEGIN
--===== Get the file size from the current file row.
SELECT @FileSize = FileSize
FROM #Files
WHERE RowNum = @CurrentFile
;
--===== Find the FEL with the least number of bytes assigned
SELECT TOP 1
@TgtFel = FEL
FROM #Accumulator
ORDER BY Bytes ASC, FEL ASC
;
--===== Add the bytes to the FEL we just found in the accumulator.
UPDATE #Accumulator
SET Bytes = Bytes + @Filesize
WHERE FEL = @TgtFel
;
--===== Assign the FEL we just found to the file row.
UPDATE #Files
SET FEL = @TgtFel
WHERE RowNum = @CurrentFile
;
--===== Get ready to read the next file row
SELECT @CurrentFile = @CurrentFile + 1
;
END
;
COMMIT
;
--========================================================================================
-- Display the results
--========================================================================================
SELECT * FROM #Accumulator;
SELECT * FROM #Files;
Here are the results from the run above.
FEL Bytes
-------------------- --------------------
0 5500
1 4250
2 4000
3 4000
RowNum FileName FileSize FEL
----------- -------- -------------------- ----
1 A 2750 0
2 B 2750 1
3 C 2750 2
4 D 2750 3
5 E 2750 0
6 F 500 1
7 G 500 2
8 H 500 3
9 I 500 1
10 J 500 2
11 K 500 3
12 L 500 1
13 M 250 2
14 N 250 3
--Jeff Moden
Change is inevitable... Change for the better is not.
February 28, 2012 at 7:34 pm
Here's the result using the original data that Sean used. It's a bit different than the original posted desired result but it has the same effect... all of the FELs are assigned the same number of bytes to work with.
FEL Bytes
----------- --------------------
0 4000
1 4000
2 4000
3 4000
RowNum FileName FileSize FEL
----------- -------- -------------------- ----
1 A 4000 0
2 B 2000 1
3 C 2000 2
4 D 2000 3
5 E 1000 1
6 F 1000 2
7 G 1000 3
8 H 500 1
9 I 500 2
10 J 500 3
11 K 500 1
12 L 500 2
13 M 250 3
14 N 250 3
--Jeff Moden
Change is inevitable... Change for the better is not.
February 29, 2012 at 7:21 am
I too suspected this wasn't going to end up set based because of the nature of the whole thing. I sort of eluded to that in my post that the code I did was reasonably close on the pretty flat example data. I had a feeling that on real data it wouldn't be any better than what the OP was doing originally.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
March 1, 2012 at 7:03 am
Wow! This is awesome. It'll take me a bit to process this. 🙂 Thanks again for all of the help!
March 1, 2012 at 2:32 pm
dembones79 (3/1/2012)
Wow! This is awesome. It'll take me a bit to process this. 🙂 Thanks again for all of the help!
In retrospect, it's actually pretty easy. The most important part is that the files have to be sorted by the FileSize in descending order and the "RowNum" column must reflect that sort order.
After that, a simple table containing 1 entry for each FEL is created and the "Total Filesize" column is preset to zero.
The loop then reads the first row from the #Files table and finds the "top 1" FEL with the smallest total and adds the FileSize to the "total" and assigns that "top 1" FEL to the #Files table. The next interation does the same thing. Since the row we just updated for the "total" is no longer 0, it won't be the "top 1" of the smallest total anymore. It keeps doing that... always adding the filesize to the smallest "total" and using that FEL as the assignment for the frow in #Files.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 1, 2012 at 2:38 pm
Just for the record, this is a form of the 'Bin Packing problem', it's a classic AI and optimisation problem and it's NP-hard in the general case.
http://en.wikipedia.org/wiki/Bin_packing
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
March 1, 2012 at 2:45 pm
You know I have a different path and think it was mentioned above. Why do we care what process handles a particular file, just that they all get handled and as fast as possible.. It just so happens I know of an article or two here on SSC that covers this:
Part 1
http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/70346/[/url]
Part 2
http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/70347/[/url]
CEWII
Viewing 15 posts - 1 through 15 (of 22 total)
You must be logged in to reply to this topic. Login to reply