Data flow concept - Thoughts wanted

  • Looking for thoughts on this.

    I have an application where we receive data from customers via text file.  Unfortunately, we basically receive a different file format for each customer.  For example, we may get this from Customer A:

    • First Name
    • Last Name
    • Birth Date (mmddyyyy but some can be mdyyyy)
    • Cust Field 1
    • Cust Field 2

    and this from Customer B:

    • Name Last
    • Name First
    • MI
    • DOB (mm/dd/yyyy)
    • Cust Field 1
    • Cust Field 3
    • Cust Field 5

    Currently, the data is manually loaded to a staging table where each field is a string.  It's loaded to the transactional table via a stored procedure (one for each company).  Then it's processed by another procedure where a lot of data cleanup happens.  The data cleanup really kills the performance of the processing we want to do.  I know there are tools that can handle multiple text files and put them into a common format but I can only use what we have in house which is SSIS.  I can't force the customers to use a specific format.

    We want to automate and make this process more efficient.  To that end, I see the workflow as:

    • an SSIS package that takes the company's text file and loads it to a company specific 'source' table.  The structure of the source table matches the file layout and the columns are all strings.
    • an SSIS package that takes the data from the company source table to a staging table.  The structure of the staging table matches the schema of my transaction table.  Data cleanup and error checking happens here.
    • an SSIS package that moves the data from staging to my transactional table where it can be processed.

    This way I can do data cleanup that is specific for each customer.  Once there in the staging table, I can perform a common set of validations and a fairly straightforward move to my production system.

    I don't like the idea that, if we have 10 customers, I'd have 10 packages to load to source, 10 to load to staging and 1 to load to the transaction table.  I can't really think of a more viable alternative.

    Does this seem reasonable?  Any suggestions for improvement?

    Thanks.

  • This is a good solution architecture, IMO. If different customers have different file formats, there really is no way of getting round the fact that you need different packages to load them and different tables into which to extract them.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Depending on how your relationship with your customers is, you could push back and tell the customer to provide the document in a specific format?

    Failing that, you need some method to convert the data from the customer specific format to a normalized form that your system can consume.  I think that your method should work.

    The other thought would be to have a very wide table that would have 1 column per possible input and then have a customer lookup table that would give the columns meaning.
    so your data import table would have the maximum number columns a customer may give you plus 1 for the customer ID - CustomerID, Col1, Col2, Col3 and so on.
    then you would have a customer lookup table that would map a customer ID to a customer name.
    Then in your SSIS package it would look at the customer ID (or name as that would make maintenance a bit easier) to determine how it parses the input from the data import table.
    The advantage of the last method is that as your customer base grows, you just need to build new SSIS packages for converting the data import to a method you can consume in the staging table.  The downside is that you end up with a very messy looking table.
    Depending on what your Staging table is doing, you may be able to turn your staging table into a view and remove the transition from data import to staging and go straight from data import to final table by selecting from the view?

    It really depends on the data.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • bmg002 - Friday, July 7, 2017 12:47 PM

    Depending on how your relationship with your customers is, you could push back and tell the customer to provide the document in a specific format?

    Failing that, you need some method to convert the data from the customer specific format to a normalized form that your system can consume.  I think that your method should work.

    The other thought would be to have a very wide table that would have 1 column per possible input and then have a customer lookup table that would give the columns meaning.
    so your data import table would have the maximum number columns a customer may give you plus 1 for the customer ID - CustomerID, Col1, Col2, Col3 and so on.
    then you would have a customer lookup table that would map a customer ID to a customer name.
    Then in your SSIS package it would look at the customer ID (or name as that would make maintenance a bit easier) to determine how it parses the input from the data import table.
    The advantage of the last method is that as your customer base grows, you just need to build new SSIS packages for converting the data import to a method you can consume in the staging table.  The downside is that you end up with a very messy looking table.
    Depending on what your Staging table is doing, you may be able to turn your staging table into a view and remove the transition from data import to staging and go straight from data import to final table by selecting from the view?

    It really depends on the data.

    Can't push back on the customers just yet.  It's a future state my boss and I want to get to.  We may be able to push all new customers down a specific format though.

    The wide table and view are interesting ideas but I'll probably keep to the one table per customer and a staging table.  I like to keep things small and (relatively) simple.  Easier to troubleshoot in six months when I can't remember why I did what I did :).

  • Tom_Hogan - Friday, July 7, 2017 11:36 AM

    Looking for thoughts on this.

    I have an application where we receive data from customers via text file.  Unfortunately, we basically receive a different file format for each customer.  For example, we may get this from Customer A:

    • First Name
    • Last Name
    • Birth Date (mmddyyyy but some can be mdyyyy)
    • Cust Field 1
    • Cust Field 2

    • Name Last
    • Name First
    • MI
    • DOB (mm/dd/yyyy)
    • Cust Field 1
    • Cust Field 3
    • Cust Field 5

    Currently, the data is manually loaded to a staging table where each field is a string.  It's loaded to the transactional table via a stored procedure (one for each company).  Then it's processed by another procedure where a lot of data cleanup happens.  The data cleanup really kills the performance of the processing we want to do.  I know there are tools that can handle multiple text files and put them into a common format but I can only use what we have in house which is SSIS.  I can't force the customers to use a specific format.

    We want to automate and make this process more efficient.  To that end, I see the workflow as:

    • an SSIS package that takes the company's text file and loads it to a company specific 'source' table.  The structure of the source table matches the file layout and the columns are all strings.
    • an SSIS package that takes the data from the company source table to a staging table.  The structure of the staging table matches the schema of my transaction table.  Data cleanup and error checking happens here.
    • an SSIS package that moves the data from staging to my transactional table where it can be processed.

    This way I can do data cleanup that is specific for each customer.  Once there in the staging table, I can perform a common set of validations and a fairly straightforward move to my production system.

    I don't like the idea that, if we have 10 customers, I'd have 10 packages to load to source, 10 to load to staging and 1 to load to the transaction table.  I can't really think of a more viable alternative.

    Does this seem reasonable?  Any suggestions for improvement?

    Thanks.

    I feel your pain, I have been in this situation myself. On some projects we were able to push back and issue a controlled XML Schema that all parties had to adhere to. We struggled where some data suppliers did not have the correct BI or IT team skills to implement it correctly. As long as no field contains more than 4000 characters XML is a good choice for working with SSIS.

  • Do any of these file formats have an industry "standard"?  I ask because I used to work in a finance company and there were a fixed number (unfortunately more than one) of "interfaces" to move financial data as accounts were serviced and transferred.  When we became associated with another outside company, we used one of these standard formats to import and export data.

    If the incoming files have different naming conventions or are dropped into customer-specific folders, you can use one SSIS package with a Foreach Loop Container for each customer.  The staging tables, data flows and stored procedures in each container will be somewhat different for each customer, but all of them will be located in one place for future maintenance.  The container would check for the presence of any incoming files and process it if it existed.

    Your business process will determine whether the staged data from each incoming file will be moved into customer-specific tables, or if one table can be created to hold data from all customers.  In your example of Customers A and B, you could have specific staging tables for each file layout, and one result table with First, Last, MI, DOB and Cust fields 1 through 5.  Then the customer-specific stored procedures would move data from the staging table to the applicable fields, converting the incoming string date formats to real dates for storage.

    It's a little cumbersome to maintain multiple file formats.  But adding an additional format for another customer becomes easier because the basic structure is already there.

    You might consider having a "control settings" table that will store the file naming conventions, folder names, etc for each type of file that you receive.  Then the SSIS package can read that info from the table, rather than having it hard-coded into the SSIS.  Also consider some sort of file history table that can be updated as the files are processed.  You can also use an SSIS file system task to move the processed file from the incoming drop-box folder to an archive folder, if that is part of your business requirements.

  • I agree with Phil.

    You're getting too hung up on the fact that the data from each customer is being used for the same business case. You're treating it as one data source when you should in fact, treat it as multiple data sources that require multiple data flows. If you try to treat them as the same, it's only a matter of time before one of those customers changes something dramatic that you can't control and breaks everything. Choose to look at them as separate data sources where only the customer breaks the customer, not the entire business.

    If they are essentially the same, just different field organization and so forth, each new package should really be a copy of a previous, just adjusted to that specific customer. You're not talking a lot of overhead here with something like a completely new unknown data source.

  • Tom_Hogan - Friday, July 7, 2017 11:36 AM

    Looking for thoughts on this.

    I have an application where we receive data from customers via text file.  Unfortunately, we basically receive a different file format for each customer.  For example, we may get this from Customer A:

    • First Name
    • Last Name
    • Birth Date (mmddyyyy but some can be mdyyyy)
    • Cust Field 1
    • Cust Field 2

    • Name Last
    • Name First
    • MI
    • DOB (mm/dd/yyyy)
    • Cust Field 1
    • Cust Field 3
    • Cust Field 5

    Currently, the data is manually loaded to a staging table where each field is a string.  It's loaded to the transactional table via a stored procedure (one for each company).  Then it's processed by another procedure where a lot of data cleanup happens.  The data cleanup really kills the performance of the processing we want to do.  I know there are tools that can handle multiple text files and put them into a common format but I can only use what we have in house which is SSIS.  I can't force the customers to use a specific format.

    We want to automate and make this process more efficient.  To that end, I see the workflow as:

    • an SSIS package that takes the company's text file and loads it to a company specific 'source' table.  The structure of the source table matches the file layout and the columns are all strings.
    • an SSIS package that takes the data from the company source table to a staging table.  The structure of the staging table matches the schema of my transaction table.  Data cleanup and error checking happens here.
    • an SSIS package that moves the data from staging to my transactional table where it can be processed.

    This way I can do data cleanup that is specific for each customer.  Once there in the staging table, I can perform a common set of validations and a fairly straightforward move to my production system.

    I don't like the idea that, if we have 10 customers, I'd have 10 packages to load to source, 10 to load to staging and 1 to load to the transaction table.  I can't really think of a more viable alternative.

    Does this seem reasonable?  Any suggestions for improvement?

    Thanks.

    You posted examples but I don't know if the list of column names is received vertically as posted in your original post or horizontally.  Either way, though, this has the potential of being a no-brainer to pull off.  I just need you to provide some anonymized but otherwise real data files and the expected CREATE TABLE statement(s) for the target table(s).  The data files only need 5 or 10 rows of data each for horizontal data and at least 5 full "records" of data if they're vertical.  Note that it may take more the one line to create a full record in the vertical format.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I really appreciate everyone who's taken (taking) the time to help so please read with a thankful tone in mind.

    I'm not sure if I'm not being clear or if people are reading too much into what I'm asking. I'm not really looking for help on the specifics but more on the general idea.

    We have an application that receives data from customers on a periodic basis (monthly, quarterly, etc.) in a flat file format. Each customer provides us the data in a different format (fixed, CSV, tab delimited, etc.). While there are common columns, they may be name differently in each file. The formats they use for dates are also different in each file (mmddyyyy, mdyyyy, yyyy-mm-dd, etc.). Also, there may be up to 30 additional "custom" fields a customer may have. The data is loaded into our application, processed, stored in our systems, and an output file is send back on the customer based on the processing.

    Currently, someone manually (using the import / export tool) loads the data into the main transaction table. The first step of the process then "cleans" the data and puts it into "fixed' fields in the transaction table. This step takes a lot of time. Partly because it has to do a lot of extra logic when it comes to cleaning / validating the data. Since we need to automate these loads, I also think it make sense to do the data cleansing furthur upstream.

    What I want to do is use a load pattern that is similar to what I do for a data warehouse load:
    FILE_A -> SRC_A -> STG -> TRAN
    FILE_B -> SRC_B ----^

    • Each customer file would go into it's own source table. The source table matches the format of the customer file.
    • The customer data is then loaded to a common staging table. The schema of the staging table matches the schema of the transaction table with some additional fields (related to error checking). Data type conversion and error checking (i.e. validation) happends here.
    • The cleaned data is loaded to the transaction table (inserted / updated as appropriate).
    • Processing then happens.

    At my last job we were able to use a third party ETL tool (Jitterbit) to take the raw data files and spit them out into a common file format. Of course, this just moved some of the logic furthur upstream and we still had a separate package per customer.

    I start second guessing myself if something seems too obvious a solution, which I why I posed here. Does anyone see any major gotchas or have any tips / tricks that I may be missing?

    Long term, we plan on having our customers use a specific format but we don't have that leverage now.

    Thanks.

    Tom

    Example data (pretend it's in any format CSV, fixed, pipe deimited, etc.)
    FILE_A
    Last_Name First_Name DOB        Address_1   Address_2 City           tate  Zip   Custom
    Smith     John       1977-05-07 123 Main St           Nowheresville  TX    45612 Red

    FILE_B
    NameLast  NameFirst Address      City    State Zip   BirthDate
    Jones     Martha    45 Maple Ln  Anytown CA    55555 171948

  • Tom_Hogan - Monday, July 17, 2017 7:11 AM

    I really appreciate everyone who's taken (taking) the time to help so please read with a thankful tone in mind.

    I'm not sure if I'm not being clear or if people are reading too much into what I'm asking. I'm not really looking for help on the specifics but more on the general idea.

    We have an application that receives data from customers on a periodic basis (monthly, quarterly, etc.) in a flat file format. Each customer provides us the data in a different format (fixed, CSV, tab delimited, etc.). While there are common columns, they may be name differently in each file. The formats they use for dates are also different in each file (mmddyyyy, mdyyyy, yyyy-mm-dd, etc.). Also, there may be up to 30 additional "custom" fields a customer may have. The data is loaded into our application, processed, stored in our systems, and an output file is send back on the customer based on the processing.

    Currently, someone manually (using the import / export tool) loads the data into the main transaction table. The first step of the process then "cleans" the data and puts it into "fixed' fields in the transaction table. This step takes a lot of time. Partly because it has to do a lot of extra logic when it comes to cleaning / validating the data. Since we need to automate these loads, I also think it make sense to do the data cleansing furthur upstream.

    What I want to do is use a load pattern that is similar to what I do for a data warehouse load:
    FILE_A -> SRC_A -> STG -> TRAN
    FILE_B -> SRC_B ----^

    • Each customer file would go into it's own source table. The source table matches the format of the customer file.
    • The customer data is then loaded to a common staging table. The schema of the staging table matches the schema of the transaction table with some additional fields (related to error checking). Data type conversion and error checking (i.e. validation) happends here.
    • The cleaned data is loaded to the transaction table (inserted / updated as appropriate).
    • Processing then happens.

    At my last job we were able to use a third party ETL tool (Jitterbit) to take the raw data files and spit them out into a common file format. Of course, this just moved some of the logic furthur upstream and we still had a separate package per customer.

    I start second guessing myself if something seems too obvious a solution, which I why I posed here. Does anyone see any major gotchas or have any tips / tricks that I may be missing?

    Long term, we plan on having our customers use a specific format but we don't have that leverage now.

    Thanks.

    Tom

    Example data (pretend it's in any format CSV, fixed, pipe deimited, etc.)
    FILE_A
    Last_Name First_Name DOB        Address_1   Address_2 City           tate  Zip   Custom
    Smith     John       1977-05-07 123 Main St           Nowheresville  TX    45612 Red

    FILE_B
    NameLast  NameFirst Address      City    State Zip   BirthDate
    Jones     Martha    45 Maple Ln  Anytown CA    55555 171948

    Heh... Ok... in general...

    Step 1:  Check with the people that are currently doing this manually and write a process/functional flow chart.
    Step 2:  Write the code to import the first two lines as blobs, analyze what is there, compare against existing customer profiles in the form of name substitution and perceived file type, etc, , and either use one of those or check against a mapping table to generate a new one along with a report of what could and could not be auto-magically resolved, all  based on the process flow chart.

    Let us know when you're ready for some specifics. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • So how'd things work out for you?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Converting incoming CSV files to XML in a generic field-name/field-value format will help solve the integration problem for you quite quickly by allowing you to then use one SSIS Package to process all your data partner's files in this common XML format.

    Converting CSV files to a generic field-name/field-value XML format can be done with this one line of PowerShell:

    Import-Csv -Path "C:\@\1.csv" | Export-Clixml -Path "C:\@\1.xml";

    Over time, you can work with your more tech savvy data partners to optionally provide data to you in XML format so you don't even have to convert them.

    This generic CSV file looks like like this XML after converting with the above PowerShell:
    FirstName,LastName,Age
    John,Doe,35
    Jane,Doe,37
    William,Smith,74

    <Objs Version="1.1.0.1" xmlns="http://schemas.microsoft.com/powershell/2004/04">
    <Obj RefId="0">
      <TN RefId="0">
      <T>System.Management.Automation.PSCustomObject</T>
      <T>System.Object</T>
      </TN>
      <MS>
      <S N="FirstName">John</S>
      <S N="LastName">Doe</S>
      <S N="Age">35</S>
      </MS>
    </Obj>
    <Obj RefId="1">
      <TNRef RefId="0" />
      <MS>
      <S N="FirstName">Jane</S>
      <S N="LastName">Doe</S>
      <S N="Age">37</S>
      </MS>
    </Obj>
    <Obj RefId="2">
      <TNRef RefId="0" />
      <MS>
      <S N="FirstName">William</S>
      <S N="LastName">Smith</S>
      <S N="Age">74</S>
      </MS>
    </Obj>
    </Objs>

    Note that if your data partners provide you files with no or inconsistent column headers you may need to fix up the files slightly before converting to XML so your tag names line up to what your SSIS will expect.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Jeff Moden - Tuesday, August 22, 2017 8:50 PM

    So how'd things work out for you?

    My priorities have been shifted.

    While poking around I found that our SQL Servers were really neglected; minimal resources allocated to them, no patching, and no monitoring.  I came back with a series of recommendations for the server I was working on.  Come to find out the infrastructure team sets up the SQL Servers and our DBA group has minimal SQL Server experience; they have DB2 backgrounds (used by the core business line).  The good news is that the DBAs want to learn SQL Server and have been really great to work with.  I'm now working with the two groups to establish best practices, as well as roles and responsibilities for who supports what (a fair amount of company politics is involved).  There's also been the question of if it makes more sense for me to be involved with improving our other applications first (i.e. the more lucrative ones).  In the meantime, we have an intern manually loading the text files to staging tables.

    All of my group's applications receive data files from customers which then needs to be staged, validated, and processed.  The results are usually then sent back to the customers.  The problem is that we don't receive a consistent format from customer to customer for most of the applications and for some of them, we can't push back to the customer if it doesn't match the requested format (long story).

    I think we'll have to build one package per customer file that gets the data to a common staging table where it is cleaned and validated, then picked up for processing.  I've been looking at PowerShell and some third party tools to see if they have easier options for manipulating the text files to put them into a common format which then could be picked up by SSIS to be handled.  Any thought or experiences are still welcome.

  • Tom_Hogan - Wednesday, August 23, 2017 7:11 AM

    Jeff Moden - Tuesday, August 22, 2017 8:50 PM

    So how'd things work out for you?

    My priorities have been shifted.

    While poking around I found that our SQL Servers were really neglected; minimal resources allocated to them, no patching, and no monitoring.  I came back with a series of recommendations for the server I was working on.  Come to find out the infrastructure team sets up the SQL Servers and our DBA group has minimal SQL Server experience; they have DB2 backgrounds (used by the core business line).  The good news is that the DBAs want to learn SQL Server and have been really great to work with.  I'm now working with the two groups to establish best practices, as well as roles and responsibilities for who supports what (a fair amount of company politics is involved).  There's also been the question of if it makes more sense for me to be involved with improving our other applications first (i.e. the more lucrative ones).  In the meantime, we have an intern manually loading the text files to staging tables.

    All of my group's applications receive data files from customers which then needs to be staged, validated, and processed.  The results are usually then sent back to the customers.  The problem is that we don't receive a consistent format from customer to customer for most of the applications and for some of them, we can't push back to the customer if it doesn't match the requested format (long story).

    I think we'll have to build one package per customer file that gets the data to a common staging table where it is cleaned and validated, then picked up for processing.  I've been looking at PowerShell and some third party tools to see if they have easier options for manipulating the text files to put them into a common format which then could be picked up by SSIS to be handled.  Any thought or experiences are still welcome.

    Tom, Not sure if you saw my post but I provided a concise solution involving PowerShell for consideration. I think it might save you from having to maintain 200+ Packages and creating n more over time as you onboard more data partners.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo - Wednesday, August 23, 2017 1:12 PM

    Tom, Not sure if you saw my post but I provided a concise solution involving PowerShell for consideration. I think it might save you from having to maintain 200+ Packages and creating n more over time as you onboard more data partners.

    I started playing with PowerShell to put the data into a common format based on your suggestion.  I might go pipe delimited rather than XML for SSIS use.  Pretty cool that we can do that type of conversion with only a couple of lines of code.

    Thanks.

Viewing 15 posts - 1 through 15 (of 32 total)

You must be logged in to reply to this topic. Login to reply