ETL Processes and SOX

  • I've managed to largely avoid SOX ... up until now.
    I'd like to solicit opinions from those who know about this stuff.
    Consider the following scenario:
    1) 3rd party provides Excel spreadsheet containing data
    2) Our ETL process loads data from spreadsheet
    3) Our ETL process applies some transformations, calculates some results and produces output, for reporting purposes.

    Among other things, I have been told that we need to come up with the following SOX control:
    Find a way of validating that the contents of the source Excel file exactly match what has been loaded to the SQL database

    It seems obvious to me that we cannot use SSIS to perform this test, as it would merely repeat the same process used to load the data and would therefore always produce a match.
    So that brings me to my questions, which relate to part (2) of the above scenario:
    a) Is this definitely a SOX requirement?
    b) How are others doing this? Manually? Using some other tool which does not rely on the risible ACE driver?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • I am not sure what is exactly required for SOX complaince, but we are apparently SOX complaint and don't really do a lot with respect to our ETL process.

    Our auditors (from a SQL side of things) only care about our financial data.  So any database that hold some form or flavor of the financial system's data (for example, if we copy product names from the financial system to the reporting system, the reporting system gets flagged for auditing)., they don't seem to care too much.

    As for verifying the ETL process, basically as long as it doesn't fail, we are happy.  We don't do any excel -> SQL stuff that the auditors look at, but we do table->table stuff.  With this, if the auditors do ask, we just provide a row count.  If the row counts are similar, then we are good.  Plus we track all end user complaints and we have a change management system in place to keep track of any changes to any process (ETL or otherwise).  We use DLM Dashboard to keep track of schema changes.

    When I read through our SOX compliance stuff, I don't remember seeing anything that said you needed to validate that the ETL process completed successfully.  It probably would be pretty easy to build up a powershell script that would pull all of the data in the Excel file then pull all the data in SQL and show you differences though if they really needed that.  Might not hurt to talk to the auditors and get them to tell you what you need to do and what other companies are doing.
    I know when they asked us to provide a list of all database changes we asked them about tools that other people used and they had no idea.  So we asked what type of changes they were interested in (data, structure, both?) and they said "we don't really know".  So we showed them the DLM dashboard tool and they said that should work (and it is free).
    NOTE - I do not work for redgate who makes DLM Dashboard, but it is a neat tool for free.  Although I broke it somehow and support's suggestion was to blow everything away and start fresh as it is busted good.  So if you do use that tool, make sure you check it regularly and back it up frequently so you can recover if things go bad OR be prepared to lose all the data and start fresh.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • a) Depends on what is in the Excel file but I've seen similar requirements.
    b) When required, we just used an auditing policy for the file/folder.
    I don't know exactly what they had configured for the auditing but it appeased the auditors.

    Sue

  • Sue_H - Monday, May 8, 2017 10:57 AM

    a) Depends on what is in the Excel file but I've seen similar requirements.
    b) When required, we just used an auditing policy for the file/folder.
    I don't know exactly what they had configured for the auditing but it appeased the auditors.

    Sue

    Thanks, Sue.
    Can you expand a little on what you mean by an 'auditing policy', please? Does this include a source/target reconciliation of some sort?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • GPO auditing on all the files, folders involved. That was used to show that there were no changes. The files were moved from where they were loaded into the domain to another file/folder. Then this was kept for historical reasons. All of the loads needed a date/time stamp and the source file name and application loading it. We were essentially proving that there were no changes to the original file and that the source file could be accessed at any time for validation.
    I would guess someone somewhere along the lines wrote some policy for sporadic spot checks or it fired an alert for different conditions,something like that. A lot of times, the auditors seem more concerned about the policies in place to address things.
    But all of that is what kept them happy. Their whole thing was about making sure the correct data was loaded and no one accessed or changed the file.

    Sue

  • Phil Parkin - Monday, May 8, 2017 8:23 AM

    ...
    Among other things, I have been told that we need to come up with the following SOX control:
    Find a way of validating that the contents of the source Excel file exactly match what has been loaded to the SQL database
    ...

    Depending on what type of transformation is taking place in this Extract Transform Load process, you may not want what's loaded in the database to exactly match what is in the Excel file.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Monday, May 8, 2017 12:48 PM

    Phil Parkin - Monday, May 8, 2017 8:23 AM

    ...
    Among other things, I have been told that we need to come up with the following SOX control:
    Find a way of validating that the contents of the source Excel file exactly match what has been loaded to the SQL database
    ...

    Depending on what type of transformation is taking place in this Extract Transform Load process, you may not want what's loaded in the database to exactly match what is in the Excel file.

    Hi Eric, in this instance, we do. This is the Extract part of the whole process and we took the decision to keep third-party data here in as untouched a condition as is practical, so that we can always trace forward from what was provided to what was reported.
    The downstream processes perform all the necessary transformations and calculations.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Sue_H - Monday, May 8, 2017 12:25 PM

    GPO auditing on all the files, folders involved. That was used to show that there were no changes. The files were moved from where they were loaded into the domain to another file/folder. Then this was kept for historical reasons. All of the loads needed a date/time stamp and the source file name and application loading it. We were essentially proving that there were no changes to the original file and that the source file could be accessed at any time for validation.
    I would guess someone somewhere along the lines wrote some policy for sporadic spot checks or it fired an alert for different conditions,something like that. A lot of times, the auditors seem more concerned about the policies in place to address things.
    But all of that is what kept them happy. Their whole thing was about making sure the correct data was loaded and no one accessed or changed the file.

    Sue

    So it was all about protecting the integrity of the source file itself, rather than its loading and onward processing. Interesting.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin - Monday, May 8, 2017 2:07 PM

    So it was all about protecting the integrity of the source file itself, rather than its loading and onward processing. Interesting.

    I've seen so many where the requirements were ridiculous or something that can't reasonably be done. Most of the time we clarified those with the auditors - sometimes it was them not really understanding a process or how something worked and sometimes it was whoever relayed the information. For the most part, they are reasonable about those. And we did have an odd one like yours - can't remember the wording. But the import did transformations and there is no way to show its the same as the data we received. After talking with them, they were mostly concerned about the data received being changed outside of any documented processes or by an unauthorized person or accessed by someone who shouldn't have access. We did have to document everything around those processes -providing proof of source control, change management, testing, auditing, etc. 

    Sue

  • Phil Parkin - Monday, May 8, 2017 2:06 PM

    Eric M Russell - Monday, May 8, 2017 12:48 PM

    Phil Parkin - Monday, May 8, 2017 8:23 AM

    ...
    Among other things, I have been told that we need to come up with the following SOX control:
    Find a way of validating that the contents of the source Excel file exactly match what has been loaded to the SQL database
    ...

    Depending on what type of transformation is taking place in this Extract Transform Load process, you may not want what's loaded in the database to exactly match what is in the Excel file.

    Hi Eric, in this instance, we do. This is the Extract part of the whole process and we took the decision to keep third-party data here in as untouched a condition as is practical, so that we can always trace forward from what was provided to what was reported.
    The downstream processes perform all the necessary transformations and calculations.

    So, it's essentially a staging table that you're comparing back to Excel.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • To compare the contents of an Excel sheet to a staging table having a similar structure, you can leverage the OPENROWSET function like so:

    SELECT * FROM OPENROWSET(
    ‘Microsoft.ACE.OLEDB.12.0’
    ,‘Excel 12.0;Database=C:\Temp\OrderValues.xlsx;HDR=YES’,results)

    Then leverage the EXCEPT clause to union the Excel rowset with the staging table to return any rows that are different.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Tuesday, May 9, 2017 8:46 AM

    To compare the contents of an Excel sheet to a staging table having a similar structure, you can leverage the OPENROWSET function like so:

    SELECT * FROM OPENROWSET(
    ‘Microsoft.ACE.OLEDB.12.0’
    ,‘Excel 12.0;Database=C:\Temp\OrderValues.xlsx;HDR=YES’,results)

    Then leverage the EXCEPT clause to union the Excel rowset with the staging table to return any rows that are different.

    Except that uses the ACE driver, which is what was used to import the data in the first place. As I am sure you are aware, this driver has a tendency to alter data in transit because it thinks it knows best.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Eric M Russell - Tuesday, May 9, 2017 8:41 AM

    So, it's essentially a staging table that you're comparing back to Excel.

    Getting into semantics a little, but in an ETL process, 'Staging' usually refers to data after it has been transformed and 'Extract' to the raw data. So I am talking about a comparison of Extract (raw file) to Extract (database) ... to validate the import process.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin - Tuesday, May 9, 2017 8:58 AM

    Eric M Russell - Tuesday, May 9, 2017 8:46 AM

    To compare the contents of an Excel sheet to a staging table having a similar structure, you can leverage the OPENROWSET function like so:

    SELECT * FROM OPENROWSET(
    ‘Microsoft.ACE.OLEDB.12.0’
    ,‘Excel 12.0;Database=C:\Temp\OrderValues.xlsx;HDR=YES’,results)

    Then leverage the EXCEPT clause to union the Excel rowset with the staging table to return any rows that are different.

    Except that uses the ACE driver, which is what was used to import the data in the first place. As I am sure you are aware, this driver has a tendency to alter data in transit because it thinks it knows best.

    Yes, but subjective interpretation also applies when humans are reading an Excel document. That's why Excel sucks as a data exchange format. I prefer TAB delimited format.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Tuesday, May 9, 2017 9:14 AM

    Yes, but subjective interpretation also applies when humans are reading an Excel document. That's why Excel sucks as a data exchange format. I prefer TAB delimited format.

    No argument there. I'd prefer mirror-image quad-byte Kanji, never mind tab-delimited ASCII.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply