Slowly Changing dimension inside a ForEach Loop container

  • Hi there folks,

    Sorry for the lengthy topic but this is my first post - as you can tell!

    I am having memory issues with my SSIS package.

    I am running SQL Server 2005 Standard Edition on a Windows Server 2003 SP2 Enterprise Edition machine (Intel Xeon 5160 @ 3.00GHz, 4GB RAM) - yes, I am aware how obsolete and old the machine is!

    Anyway, I have over 4 million records I need to load from a table that has been capturing weekly data from a HR system into a data warehouse.

    The SSIS I have created goes through a ForEach Loop where it picks up one employee at a time (there are around 96,000 employees). There is another ForEach Loop that picks up the extract date. Inside the second ForEach Loop there is a Data Flow Task that has an OLE DB Source. I have used a SQL Command and I perform all the look ups in the SQL command. At this stage I am not concerned about null values from the lookup as I have done all this prior to loading the data. Therefore, I know that the SQL command with the lookups will return valid values. After the OLE DB source I have an SCD. I need to keep history for most of the columns in the table that I am loading. I am using 2 business keys in my SCD. The SCD works fine and does what it has to do.

    The problem arises after some time. Normally at around 1 to 1.5 hrs the SSIS package fails with the following error (values differ from execution to execution):

    Microsoft (R) SQL Server Execute Package Utility

    Version 9.00.5000.00 for 32-bit

    Copyright (C) Microsoft Corp 1984-2005. All rights reserved.

    Started: 3:19:00 PM

    Error: 2014-08-15 16:24:50.91

    Code: 0xC0047012

    Source: EmployeeFact with SCD DTS.Pipeline

    Description: A buffer failed while allocating 9109056 bytes.

    End Error

    Error: 2014-08-15 16:24:50.91

    Code: 0xC0047011

    Source: EmployeeFact with SCD DTS.Pipeline

    Description: The system reports 86 percent memory load. There are 4293980160 bytes of physical memory with 583577600 bytes free. There are 2147352576 bytes of virtual memory with 240123904 bytes free. The paging file has 9486200832 bytes with 4991451136 bytes free.

    End Error

    Error: 2014-08-15 16:24:50.91

    Code: 0xC0047056

    Source: EmployeeFact with SCD

    Description: The Data Flow task failed to create a buffer to call PrimeOutput for output "OLE DB Source" (1) on component "OLE DB Source Output" (11). This error usually occurs due to an out-of-memory condition.

    End Error

    Error: 2014-08-15 16:24:50.91

    Code: 0xC0047021

    Source: EmployeeFact with SCD

    Description: SSIS Error Code DTS_E_THREADFAILED. Thread "SourceThread0" has exited with error code 0x8007000E. There may be error messages posted before this with more information on why the thread has exited.

    End Error

    Error: 2014-08-15 16:24:50.91

    Code: 0xC0047039

    Source: EmployeeFact with SCD

    Description: SSIS Error Code DTS_E_THREADCANCELLED. Thread "WorkThread0" received a shutdown signal and is terminating. The user requested a shutdown, or an error in another thread is causing the pipeline to shutdown. There may be error messages posted before this with more information on why the thread was cancelled.

    End Error

    Error: 2014-08-15 16:24:50.91

    Code: 0xC0047039

    Source: EmployeeFact with SCD

    Description: SSIS Error Code DTS_E_THREADCANCELLED. Thread "WorkThread1" received a shutdown signal and is terminating. The user requested a shutdown, or an error in another thread is causing the pipeline to shutdown. There may be error messages posted before this with more information on why the thread was cancelled.

    End Error

    Error: 2014-08-15 16:24:50.91

    Code: 0xC0047021

    Source: EmployeeFact with SCD

    Description: SSIS Error Code DTS_E_THREADFAILED. Thread "WorkThread0" has exited with error code 0xC0047039. There may be error messages posted before this with more information on why the thread has exited.

    End Error

    Error: 2014-08-15 16:24:50.91

    Code: 0xC0047021

    Source: EmployeeFact with SCD

    Description: SSIS Error Code DTS_E_THREADFAILED. Thread "WorkThread1" has exited with error code 0xC0047039. There may be error messages posted before this with more information on why the thread has exited.

    End Error

    DTExec: The package execution returned DTSER_FAILURE (1).

    Started: 3:19:00 PM

    Finished: 4:24:52 PM

    Elapsed: 3952 seconds

    I monitor the server at all times when I am running this package and the CPU usage is always below 50%. I don't really have good counters to measure the memory usage but when I do a Window Task Manager I see the PF usage is hovering around 4 and 4.1 GB.

    I also have played with the DefaultBufferMaxRows and DefaultBufferSize but I found that if I kept at the default values (10000 for DefaultBufferMaxRows and 10485760 for DefaultBufferSize) the package runs for longer than if I increase any of those values.

    I am going nuts trying to find out what could be the reason for this memory failure? Is it the way the package is configured or is it the server (hardware)?

    Please any input would be much appreciated!

    Thanks

    Romina

  • 96,000 times round a Foreach loop sounds like a performance-killer to me.

    It would be useful background to understand why you have decided to do things this way rather than using a set-based approach.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Hi Phil,

    Thank you for your reply.

    I am not sure I get you when you say a set based approach.

    I was just really trying to minimize the number of records entering the SCD with a ForEach Loop. If I just go ahead and don't use the ForEach loop my SSIS package will take hours to complete because of the SCD.

    Thanks

    Romina

  • The SCD transformation can perform badly when compared against other approaches.

    What type of SCD are you using?

    When I refer to a set-based approach, I am referring to performing actions on groups ('sets') of rows concurrently, rather than using any sort of looping. This is usually the standard and optimal way of working against data which is held in a SQL Server database.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • I have an article on how to implement SCD in SSIS using different methods and I compare them against each other.

    Four Methods for Implementing a Slowly Changing Dimension in Integration Services – Part II[/url]

    Giving that you work with SSIS 2005, I would go with the last method: custom implementation (with custom I mean build it yourself using standard components in SSIS, instead of using 3rd party components).

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Phil,

    I am using the SCD wizard from SSIS 2005. Mostly Type 2 changes with very few Type 1.

    What I am trying to achieve is retrieve all 4 million records from the old system to the new and the only way to group them is through the EmployeeID and the ExtractDate (a Datetime field that reflects when the extract was made from the old system to the new).

    The data extract is done via a stored procedure run every week and literally "dumped" into a huge table that now has over 4 million records. Records for each employee for each date the extract was made. There are around 96,000 employees.

    Initially the records will be mostly new but there are a lot of duplicated records that need to be cleaned out before loading into the new datawarehouse system. For example, for a particular employee there might be 141 records in this "huge dumping table" but when I perform the SCD, only 14 records make it to the DW after getting rid of duplicates and updating historical records.

    Working with an employee at a time is fine, but with 96,000 employees the SCD simply takes way too long, and this is what I was trying to avoid with the ForEach loop.

    I guess I will have to try Koen's approach and build a custom SCD.

    Just one last question Phill, how much is the hardware (the server configuration) affecting performance? Or is it purely my SSIS that needs to be configured better?

    Thanks again. I appreciate your help with this!

  • Koen, thank you for this. Very helpful!

    In the custom implementation example, is that a Merge or a Merge Join?

    If it's simply a Merge, can I use the Union All instead?

    In SSIS 2005 you need to sort the source before Merging it, which might cause performance issues. What do you think?

    Thanks

    Romina

  • romina (8/18/2014)


    Koen, thank you for this. Very helpful!

    In the custom implementation example, is that a Merge or a Merge Join?

    Thanks

    Romina

    Neither.

    Which number in the screenshot of the data flow are we talking about?

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • 4 and 9

  • romina (8/18/2014)


    4 and 9

    Ah yes, those are UNION ALL components (designated by the (ALL) naming conventions).

    I used "merge" in their names because they combine two streams of data together.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Thanks Koen and Phil,

    I haven't got too far with this. Still struggling to load the data. I am not sure what else to do. It's getting to my nerves.

    Any other ideas?

    Basically I am trying to read a table with over 4 million records (96,000 employee's data that has been extracted weekly over the last 56 weeks - a lot of duplicate data for the same employee!), apply an SCD (which will reduce the number of records dramatically by getting rid of duplicates) and load it into a Data warehouse.

    Your input will be highly appreciated ...

    Thanks

    Romina

  • I understood you are loading the records from another system. Why not stage the records locally (on your local server) and perform your needed operations there while moving from the staging area to your destination (applying a SCD transform)? Are you sure you need nested looping?

    ----------------------------------------------------

  • romina (8/19/2014)


    Thanks Koen and Phil,

    I haven't got too far with this. Still struggling to load the data. I am not sure what else to do. It's getting to my nerves.

    Any other ideas?

    Basically I am trying to read a table with over 4 million records (96,000 employee's data that has been extracted weekly over the last 56 weeks - a lot of duplicate data for the same employee!), apply an SCD (which will reduce the number of records dramatically by getting rid of duplicates) and load it into a Data warehouse.

    Your input will be highly appreciated ...

    Thanks

    Romina

    Do you have any control on the process that inserts the data into the staging table?

    Anyway, 4 million rows is not that much for SQL Server. A few decent written TSQL statement can take care of your problem in a few minutes.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck (8/20/2014)


    Do you have any control on the process that inserts the data into the staging table?

    Anyway, 4 million rows is not that much for SQL Server. A few decent written TSQL statement can take care of your problem in a few minutes.

    Koen, I have full control of the process. I am the one designing the SSIS package to load the data into the staging table.

    The problem I have is that I need to implement a type 2 SCD. It's a business requirement to keep a history of employee records.

    The other problem I am running into is the hardware configuration. I am not normally one who blames the hardware - very easy to do - but I think this time the limitations on the hardware are making it a bit difficult for me to run this process in a quick and efficient way.

    Just the latest I have done is create a very simple DFT where I have the OLE DB source (a SQL command) and then SCD. It's been running for 12 hours now to load those 4 million records. This is unacceptable to me. The next stage of my project is to load another table with 10 million records ... I don't want to even imagine how long that will take if I don't do it more efficiently.

  • MMartin1 (8/19/2014)


    I understood you are loading the records from another system. Why not stage the records locally (on your local server) and perform your needed operations there while moving from the staging area to your destination (applying a SCD transform)? Are you sure you need nested looping?

    Thank you MMartin1 for your reply.

    I have already got my staging table locally with the 4 million records.

    The nested looping was just a way to not pass on so many records into the SCD. So with the nested looping I was taking one employee at a time and then processing all the records for all the weeks the data has been extracted for that employee. I have got 60 weeks worth of weekly employee data in a staging table at the moment.

    Can't think of any other way to do a set-based approach on this other than with the loop. I can do one week at a time, but each week will have between 75,000 to 80,000 records. I run into the same problem of long processing time.

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply