May 1, 2009 at 4:10 am
Hi all,
I've picked up a task that involves processing a complex numeric calculation for a large number of records (c 1bn rows), involving several stages of calculation. Apologies for being vague but it's a confidential project.
I'm happy enough with writing stored procedures / UDFs to do the work. But, I'm not sure if my planned task management approach is a good idea or not.
I'm planning to split the job into batches to avoid overloading the server. I'd like to process each batch by creating a SQL Agent job per batch, then executing each in turn.
Can I set the jobs to run sequentially, or will I need to create a master job to run them in turn?
I've got a dedicated server running SQL 2K5, so don't need to worry about other processes.
I'd be interested in what people think of this as a strategy, or advice for alternative approaches.
May 1, 2009 at 6:26 am
Hi
Splitting the work into batches is a good approach! I would handle it in one step with a WHILE loop (this is not RBAR 😉 ). Take your first batch (e.g. 100,000 rows), do your calculations, take the next batch... I don't think that several steps bring an advantage but a maintenance overhead.
Another approach could be several jobs (not steps). This brings also a maintenance overhead, but maybe scales better. This depends on the specific business case.
Greets
Flo
May 1, 2009 at 6:40 am
Hi Flo, thanks for the reply.
I initially thought about a simple while loop structure as you've suggested but think that using the SQL Agent gives me a couple of advantages:
I can pause execution at any time to see what has been happening.
A failed batch is just a failed job, which won't crash my entire process. I then have a very transparent way of understanding which batch has failed, without having to implement a lot of error logging and handling.
I'll still be using a while loop to create the individual jobs (one job per batch, not one step per batch) and add them to the queue.
Can I set the jobs up to execute sequentially do you know?
And no RBAR I promise 😉
Thanks
May 12, 2009 at 3:26 am
Yes you can create the job sequentially and run as per your requirement. If your logic refired multiple insertion/update in batches in bulk then i suggest you to go with SSIS packages. This will give you better control over your overall process.
I also suggest you to log some activity/records on one single table where you can come to know the status of each batch step during execution.
May 13, 2009 at 3:31 am
You mention several stages of calculation.
a) Is this several stages, all of which apply to a single record?
b) Or is is several stages where each stage applies to a subset of records, and all records in the subset have to be updated before the next stage can be run?
If the answer is a), then you will get best performance by writing the calculation as a CLR routine. This could be at least 1 order of magnitude faster than writing it in T-SQL. This is because managed code is far more performant at doing calculations (and string manipulation) than SQL Server.
If the answer is b), a series of CLR routines may still be the most performant solution.
With 1bn records you hav already made a considerable investment in IT infrastructure. If you are not sure if T-SQL or CLR would be best, the cost of making a prototype of each solution would be very small compared to the run costs of taking the wrong approach.
When you get to implementation, you need to look at the impact of doing a 1bn row update.
If you have applied the calculations once, will you get a different result the next time the calculation is run?
If the answer is no, then look to see if you can filter out the rows that do not need to be updated with a WHERE clause, as this will significantly reduce run-time by limiting the records read from disk and/or passed to the application. Even if a WHERE clause is possible, make sure your application only issues an UPDATE for those rows that are actually changed. This will significantly reduce your log file usage. You then need to assess how many rows will be changed on an average run to decide if this can be done within the scope of a single transaction of should be broken into multiple transactions.
If you have to update all 1bn rows every time, then you really need to look at doing this in multiple transactions. This will have a significant impact on your peak log file use, and will limit the impact of a rollback if a transaction fails.
You should look at the structure of your data to decide how you form the transactions. For example, all records wher a key field starts with the letter 'A' could be in one transaction, 'B' in another, etc. You will have to work out what is best for your application.
Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.
When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara
May 13, 2009 at 3:50 am
Hi Ed,
Interesting. I'm in situation b.
I'd thought about CLR as a possible solution, but being honest, my VB/C# isn't up to the job. So, I'm prepared to take a performance hit to work in a language I know (fairly) well.
My calculation involves 2 stages. An on-load pre-aggregation to create summarised meta-data, then a calculation based on input variables provided via a simple GUI.
All calculations are numeric, but there is an element of row-specific decisioning involved to decide the actual calculation to apply to a particular row (just to be clear, I'm not processing row-by-row).
For the calculation, sometimes I'll need to process all of the data, sometimes just a subset. The processes flow into each other and there is a dependency chain.
My transaction wrapping is going to be fairly simple I think, as there are no updates -all outputs are select...into, then these will be pulled together to produce a summary output.
If I use SQL Agent jobs to manage the flow, does this have implications for performance, e.g. additional logging overhead or suchlike?
Is my original supposition that using SQL Agent gives me better control valid? Or do you think I'm barking up the wrong tree and creating overhead I don't need?
Thanks, Iain
May 13, 2009 at 4:05 am
IMHO you should do a planning piece and present this to management before starting work. If your table sizes were trivial this would not be necessary, but with 1bn rows the implications of choosing the wrong option will be very increased run times, and potentially increased log space and/or tempdb requirements.
It should not be up to you as a developer to decide to commit the organisation to supporting the costs of design decisions, that is the job of management.
If you do the planning work, tell manaement what you are doing and why. If they have any degree of competance this will improve their respect for you. It is worth considering the implications of doing the work in T-SQL and CLR, and how it wll get executed. It could be run as a T-SQL call in a Agent job, or wrapped in a SSIS routine. It may even be possible to do all the calculations in SSIS.
Management can then decide if they want to proceed with a method that uses the skills already in-house, with a potential of longer run times and resource requirements, or finding (or training) the skills needed to do the job optimally.
Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.
When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara
May 13, 2009 at 4:22 am
Thanks Ed, am underway with a planning memo, detailing all of the assumptions, limitations and issues I can come up with.
Have been burnt before on that one... 🙂
I've also been managing expectations around run times. Unusually, there isn't an expectation that this will run quickly (but I'd like to make it as fast as possible of course). It's also going to run on a dedicated server, so no other processes to get in the way (or in the way of).
Storage isn't an issue, even at the data volumes we're talking about. Bt even so, keeping log-writes to a minimum would seem sensible.
Is there much difference between SSIS and T-SQL procedures when it comes to logging? Will I get better perfomance from one or the other or do they do the same thing?
Are there any other strategies I can employ to reduce the number of log writes made (or the hit I take from them)?
Thanks, Iain
May 13, 2009 at 4:29 am
If you want to reduce the logging and make it faster on Delete/Update/Insert statment then i think making smaller batches is good idea.
You can take chunk of desire records and then divide it into smaller batch. Execute such a batches in while loop until it's meet condition.
So commit smaller batch transactions gives you good performance than longer batch.
May 13, 2009 at 4:34 am
Hi ajitgadge,
Batch sizing is another of my vexing questions at the minute. Is there a way other than trial and error to decide the optimum batch size?
Thanks, Iain
May 13, 2009 at 5:02 am
Deciding batch size is often partly based on facts and partly on personal preference. The facts give the lower limit on transaction size, preference often governs the upper limit.
Each time you do a COMMIT you take time doing something other than the UPDATE or INSERT or DELETE. I have not measured how long a commit takes compared to a typical I/U/D, but lets guess at 10 I/U/D operations.
A bit of maths shows that when you do 1 commit per 10,000 I/U/D statements, the % of time in commit gets very low, so a batch size of 10,000 is probably the minimum you should have.
Working out an upper limit is where preference comes in. If you can work out how may DB changes occurr to your table in a minute, you can start to work out how much log space this will use. You can then decide how much log space you want to devote to a transaction. There are 2 factors for this - log space use and rollback time.
Remember that SQL will reserve at least double the log space needed for the changes, so it has space for a rollback. So, if you are happy using 10GB log space, work out how many updates will fit in 5 GB log space and you have a potential upper limit for transaction size.
The next aspect is rollback time. Rollback takes at least the same amount of time as the original change, and can take double the time. If you know that (say) 10,000 changes to your table will be processed in 1 minute and you are happy with a rollback time of up to 10 minutes, then you can run for about 7 minutes and still meet your rollback target if the transaction fails. This means you can have a potential upper size of 70,000 changes per transaction.
Finally, take the lower of the log size limit and the rollback limit and you have yuor upper limit for transaction size. If this is a big number, round it to the nearest 1,000 or 10,000 as appropriate.
Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.
When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara
May 13, 2009 at 5:28 am
Hi ,
Thats all depends upon your transactions and writing speed of transaction log. T- Log write is most improtant factor when you are processing bulk records.
Whats type transactions you are doing ? Delete/Update/Insert. ?
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply