Does SSIS determine the DefaultBufferMaxRows the right way?

Question

Post reply

Does SSIS determine the DefaultBufferMaxRows the right way?

Dave Hoogendoorn

SSC Eights!

Points: 843
More actions
August 5, 2009 at 6:07 am

#137563

During one of my projects transferring data from a Progress DB to MS SQL Server I encountered the following situation:
SSIS seems to calculate the average record size used to define the ‘MaxRows per Buffer’ parameter based on the logical design of the source and not the actual (physical) contents of the table.
Especially when connecting to a Progress DB using the .NET ODBC provider, this seems to have a negative impact on SSIS data transfer performance. But this might also be the case using other database systems as data source.
The case is that the .NET ODBC provider recognizes some fields in the Progress DB as DT_TEXT, while I know by looking at the values the field actually uses field values with a maximum length of DT_STR(60).
I would like to skip the discussion on the Progress DB and databases design but would like to focus on the parameters ‘DefaultBufferMaxRows’ and ‘DefaultBufferSize’.
For as far as I know from my experience ‘DefaultBufferMaxRows’ and ‘DefaultBufferSize’ is perfect for tuning the transfer of data. However as soon as SSIS thinks that no more rows fit into the defined ‘DefaultBufferSize’, the ‘DefaultBufferMaxRows’ is overruled while in fact there is no reason (except that SSIS thinks, that because of the logical design less rows fit in the defined ‘DefaultBufferSize’).
I understand that SSIS cannot predict average physical record size from an unknown source. But at least it should trust the Package designer or have the option to overrule SSIS and go for the values given by the Package designer.
Can anybody tell me if he or she has the same or different experiences and add to this?
Kind regards,
Dave

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Frank Bazan SSCertifiable Points: 5378 More actions · Answer 1

Hi Dave,

You make some good points there...

SSIS calculates how many rows to fit in the buffer based on the estimated row size. And the estimate is based on the metadata that you create in your package design. Where possible you should explicitly set the columns in your dataflow to the smallest size they can use. In the case you mention, changing those DT_TEXT columns to DT_STR(60) will mean SSIS will calculate a smaller row size and therefore more rows will fit in the buffer.

The general rule of thumb for tweaking the buffer is to "stuff" as many rows in as will fit! If your DefaultMaxBufferSize is 10MB, but the 10k records only adds up to 2MB, then you could increase the DefaultMaxBufferRows to 50k records. If 10k records adds up to 20MB then either SSIS will reduce the number of rows or you need to increase the DefaultMaxBuffer to a size that can handle the 10k rows.

Incidently, SSIS does have an internal limit on the MaxBufferSize (I think its 100MB).

Lastly, to help with your tuning you can turn logging on and capture the BufferSizeTuning events.

Cheers

Kindest Regards,

Frank Bazan

Dave Hoogendoorn SSC Eights! Points: 843 More actions · Answer 2

Hi Frank,

You mention the following:

"SSIS calculates how many rows to fit in the buffer based on the estimated row size. And the estimate is based on the metadata that you create in your package design. "

Do you have an answer to the question what metadata is used (source metadata, destination metadata or a combination of both)?

In the first case I might be able to use a CAST in the source query. In the second and third case I have to do some more investigating...

Thanks in advance,

Kind regards,

Dave

Dave Hoogendoorn SSC Eights! Points: 843 More actions · Answer 3

fyi

As far as I have tested it looks like only the source metadata is relevant.

Kind regards,

Dave

Frank Bazan SSCertifiable Points: 5378 More actions · Answer 4

Hi Dave,

It will be based on the data in the whole pipeline. So it doesn't matter what the external columns look like (they are used by SSIS for validation at design time), it's the output columns on the component that determine how many rows can fit into the buffer. But note also, if you then bring more data into your pipeline from another source i.e. adding more columns by deriving them or looking up data in reference table, then this will also contribute to the estimated row size too. If the row size becomes larger downstream, then SSIS throttles back the number of rows extracted out of the source.

PS: I don't know if your Progress DB provider allows it, but a lot of the data source providers in SSIS allow you to edit the output column properties directly using the Advance Editor. If you can't do this then casting in the select statement is a good way to go too.

Cheers

Kindest Regards,

Frank Bazan