September 21, 2009 at 3:22 pm
Hello everyone,
I had a best practice question with regards to how I'd go about doing this. I need to query a sql database which contains millions of records and process all these records. Instead of just querying the database and procesing the records, I was thinking of maybe querying the database and splitting the records into say 7 equal parts and processing each part at the same time. So 7 pieces would be going at the same time which should be faster.
I just made up 7, but the number of pieces would need to be dynamic. So I may want to change it to 8 or 9 or whatever. However many pieces it gets split into, all of those pieces would get processed asyncronously in SSIS. Splitting the data shouldn't be a problem, but I'm curious as to how I'd process each piece dynamically.
Not sure If the script component (as a source) can create outputs on the fly, but that was the idea that came to mind how I think I'd do it (query data, split into X equal parts and create the number of outputs needed). But I'm not sure if that's a good way to do it or not.
Any best practices on doing this (high level)?
Thanks,
Strick
September 21, 2009 at 3:36 pm
I would say that you could do that, but the number is largely experimental, there are too many variables to figure. Also, if there a good way to break the data down into more manageable chunks via your query? If not then you;ll have to find a way. SSIS does this very easily by default. If you have 7 sources going to 7 destinations in a single dataflow (the easiest way to visualize this) then 5 of them will likely be going at the same time because of the dataflow option "EngineThreads" which is by default 5, I think that is the minimum value.. I have done something similar, but it was with 21 tables of varying sizes.
I'm not sure I would call this asynchronous, but it is effectively multi-threaded.
CEWII
September 21, 2009 at 4:36 pm
It may be possible, but the only way I see it working would be for you to dynamically create data flow tasks based on your # of desigred 'threads'. You wouldn't want to do the multi-threading within one data flow. I think you'll end up with memory errors and trying to micro-manage the memory buffers within the single data flow.
Ideally, you'll want separate data flows for each thread. This can be done dynamically, but I've not gone down this route so I can't speak as to how difficult it would be.
September 22, 2009 at 2:42 am
Or possibly (thinking a level up) a parent package that concurrently runs multiple occurrences of the same child package, with appropriate parameters.
I haven't tried that - and it doesn't sound easy! - but it might work.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
September 22, 2009 at 7:46 am
Phil Parkin (9/22/2009)
Or possibly (thinking a level up) a parent package that concurrently runs multiple occurrences of the same child package, with appropriate parameters.I haven't tried that - and it doesn't sound easy! - but it might work.
Having tried it.. It would work, but if you can accomplish it in one package without resorting to this I would strongly recommend it. The multi-package solution is more difficult to deploy and test, but it would work.
You can have 2 entirely seperate dataflows in a package executing simultaneously. I have included a picture to show it.
CEWII
September 22, 2009 at 8:17 am
John Rowan (9/21/2009)
It may be possible, but the only way I see it working would be for you to dynamically create data flow tasks based on your # of desigred 'threads'. You wouldn't want to do the multi-threading within one data flow. I think you'll end up with memory errors and trying to micro-manage the memory buffers within the single data flow.Ideally, you'll want separate data flows for each thread. This can be done dynamically, but I've not gone down this route so I can't speak as to how difficult it would be.
John,
I don't think it needs to be so complex. If you have 5 seperate pumps in the dataflow it will try to do all five, if you have 7 then you would need to change the "EngineThreads" to 7. SSIS will do the threading on its own without any other magic.. I have included a picture of a set of 5 pumps in a single dataflow to illustrate..
Your biggest issue would be memory, the more pumps going at once the ore memory the process will use..
CEWII
September 22, 2009 at 8:43 am
Hi Elliott W,
I think your concept is close to what I'm looking to do. But the key pieces of the project are that I won't necessarily know is the number of data flow objects needed until run time. So say for example I have a SSIS user variable which the user indicates how many pieces to break the data into. User selects 8. A rowcount is returned from SP. 8 would get divided into the the data returned to determine the number of data flows needed. So if that number is 2 million then 8 into 2 million is 250K.
So 8 identical dataflows would get created. Using the keys (1 to 250000, 250001 to 500000, etc.), each data flow would query its own 250K records and process them. Not sure if SSIS can create the data flows on the fly, but I think ideally this would accomplish the asynchronous processing.
Thanks,
Strick
September 22, 2009 at 8:46 am
Elliott W (9/22/2009)
Phil Parkin (9/22/2009)
Or possibly (thinking a level up) a parent package that concurrently runs multiple occurrences of the same child package, with appropriate parameters.I haven't tried that - and it doesn't sound easy! - but it might work.
Having tried it.. It would work, but if you can accomplish it in one package without resorting to this I would strongly recommend it. The multi-package solution is more difficult to deploy and test, but it would work.
You can have 2 entirely seperate dataflows in a package executing simultaneously. I have included a picture to show it.
CEWII
My (abstract) idea was for two packages only and addressed the "dynamic" requirement mentioned by the OP:
1) A child package which takes appropriate input parameters and executes just one thread.
2) A parent process which "decides" how many threads will be created, then creates the required number of "Execute Package" tasks to run the child package in parallel, with the requisite parameters.
Is it possible? No idea. Sounds like more effort than it's worth to me, but I thought I'd put it out there as an idea.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
September 22, 2009 at 8:48 am
Also to add, this process would be running on a pretty beefy server (I think if I remember some of the specs correctly, it'll have atleast 250GB RAM and 4 high end processors.) I'll have to check the specs to be sure though.
Strick
September 22, 2009 at 8:59 am
Elliott W (9/22/2009)
John Rowan (9/21/2009)
It may be possible, but the only way I see it working would be for you to dynamically create data flow tasks based on your # of desigred 'threads'. You wouldn't want to do the multi-threading within one data flow. I think you'll end up with memory errors and trying to micro-manage the memory buffers within the single data flow.Ideally, you'll want separate data flows for each thread. This can be done dynamically, but I've not gone down this route so I can't speak as to how difficult it would be.
John,
I don't think it needs to be so complex. If you have 5 seperate pumps in the dataflow it will try to do all five, if you have 7 then you would need to change the "EngineThreads" to 7. SSIS will do the threading on its own without any other magic.. I have included a picture of a set of 5 pumps in a single dataflow to illustrate..
Your biggest issue would be memory, the more pumps going at once the ore memory the process will use..
CEWII
I'm with you Elliot. Yes, you can thread multiple pumps inside a single data flow. Memory becomes the limiting factor. I typically use 1 pump per data flow because I've seen too many memory problems otherwise. It will work if configured properly.
Part of the requirement was that the number of threads to use would be dynamic so there's a bit of a challange there. From the OP:
I just made up 7, but the number of pieces would need to be dynamic. So I may want to change it to 8 or 9 or whatever. However many pieces it gets split into, all of those pieces would get processed asyncronously in SSIS. Splitting the data shouldn't be a problem, but I'm curious as to how I'd process each piece dynamically.
So in your example, you've set the data flow up w/ 5 pumps. Let's say the package is configured for 3. You can't just dynamically set the EngineThreads property. Each of the source components needs to be 'smart' enough to grab a certain subset of the data. How would you suggest this gets configured into the example?
September 22, 2009 at 9:41 am
stricknyn (9/22/2009)
Hi Elliott W,I think your concept is close to what I'm looking to do. But the key pieces of the project are that I won't necessarily know is the number of data flow objects needed until run time. So say for example I have a SSIS user variable which the user indicates how many pieces to break the data into. User selects 8. A rowcount is returned from SP. 8 would get divided into the the data returned to determine the number of data flows needed. So if that number is 2 million then 8 into 2 million is 250K.
So 8 identical dataflows would get created. Using the keys (1 to 250000, 250001 to 500000, etc.), each data flow would query its own 250K records and process them. Not sure if SSIS can create the data flows on the fly, but I think ideally this would accomplish the asynchronous processing.
Thanks,
Strick
Yikes, well that is a bit more complex than I was thinking.. However if you set a max limit to the number you could add say 5-10(if 10 was your max number) of threads and use expressions on the precendents so that only 1-8 would run if they chose 8. That is probably the simplest.
The next simplest is probably a single package that you instantiate a number of times by a master package. The master package would need to be built dynamically and then run. Which you could probably due without too much trouble. I still like setting the max idea.. But thats me..
CEWII
September 22, 2009 at 9:47 am
John Rowan (9/22/2009)
I'm with you Elliot. Yes, you can thread multiple pumps inside a single data flow. Memory becomes the limiting factor. I typically use 1 pump per data flow because I've seen too many memory problems otherwise. It will work if configured properly.Part of the requirement was that the number of threads to use would be dynamic so there's a bit of a challange there. From the OP:
I just made up 7, but the number of pieces would need to be dynamic. So I may want to change it to 8 or 9 or whatever. However many pieces it gets split into, all of those pieces would get processed asyncronously in SSIS. Splitting the data shouldn't be a problem, but I'm curious as to how I'd process each piece dynamically.
So in your example, you've set the data flow up w/ 5 pumps. Let's say the package is configured for 3. You can't just dynamically set the EngineThreads property. Each of the source components needs to be 'smart' enough to grab a certain subset of the data. How would you suggest this gets configured into the example?
You could change the "EngineThreads" at runtime to increase the number. Given his restated requirements I think we are talking about multiple independent dataflows and not pumps INSIDE a single dataflow. Also you are correct about memory, that is often a problem. If he sets an upper limit on the number of dataflows, like 10 (or maybe 15) he could build a single package and set precedence using expressions to decide how many to run and the others just wouldn't.. I might build a package to POC this out. IF I do I'll post it to this thread.
CEWII
September 22, 2009 at 10:02 am
Even w/ multiple data flows, the challenge is still to have each data flow work on a distinct subset of the data. If you have 100 rows that need processing and you've configured the package to use 4 threads, you'd want each data flow to be able to identify and work off of 25 rows. If the package is later changed to use 10 threads, each data flow should grab and process 10 of the rows.
So you need a step to segment the data and assign it to a data flow. This is the challenge. Without having to dynamically create data flows, you'll most likely want to create 10-15 of them up front use that as the upper limit of your thread configuration option. From there, you need a way to divide to total working set into x number of chunks and assign those chunks to each dataflow.
I'm not a big fan of using staging tables in SSIS, but this may work well here. You could create a staging table for each data flow and use an Execute SQL task to populate the staging tables so that the data flows just work off of what exists inside of it's own staging tables. So all of the 'intelligence' is inside the Execute SQL task. If that task does not assign any rows into a staging table, that particular data flow runs and processes zero rows.
Just a thought......
September 22, 2009 at 10:39 am
Ok, here is a really quick and dirty POC. How many Dataflows run depends on the variable "ExecutionThreadCount", it can be up to 5. I even added some extra variables so that the start and end ids for each flow could be passed. You could fill those in in the placeholder. Which is really important because it handles the precendence.. Give it a look..
Take the TXT off the end, DTSX isn't normally an allowed type..
CEWII
September 22, 2009 at 11:59 am
Elliott W (9/22/2009)
John Rowan (9/22/2009)
I'm with you Elliot. Yes, you can thread multiple pumps inside a single data flow. Memory becomes the limiting factor. I typically use 1 pump per data flow because I've seen too many memory problems otherwise. It will work if configured properly.Part of the requirement was that the number of threads to use would be dynamic so there's a bit of a challange there. From the OP:
I just made up 7, but the number of pieces would need to be dynamic. So I may want to change it to 8 or 9 or whatever. However many pieces it gets split into, all of those pieces would get processed asyncronously in SSIS. Splitting the data shouldn't be a problem, but I'm curious as to how I'd process each piece dynamically.
So in your example, you've set the data flow up w/ 5 pumps. Let's say the package is configured for 3. You can't just dynamically set the EngineThreads property. Each of the source components needs to be 'smart' enough to grab a certain subset of the data. How would you suggest this gets configured into the example?
You could change the "EngineThreads" at runtime to increase the number. Given his restated requirements I think we are talking about multiple independent dataflows and not pumps INSIDE a single dataflow. Also you are correct about memory, that is often a problem. If he sets an upper limit on the number of dataflows, like 10 (or maybe 15) he could build a single package and set precedence using expressions to decide how many to run and the others just wouldn't.. I might build a package to POC this out. IF I do I'll post it to this thread.
CEWII
Yeah I may just set a max number of "chunks" to be processed at say 10 and tell the users to deal with it..lol. The "users" would be IT people anyway, so the number of users will be really small if any. What I'll probably end up doing is doing 10 dataflows and like one of the other posts I saw in this thread, send each one to a dataflow. If it's less than 10. The ones without data just wont process any records.
I'm working on a development box, so what I imagine what will happen when it actually gets deployed to a server is we'd find a "sweet spot" while benchmarking on the production server; this would be the amount of "chunks" it processes fastest. Then I'd just set it at that and leave it alone.
Strick
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply