Importing XML Files into a flat table

  • I'm working on a process to import XML files into flat table using SSIS.  I've been trying to make use of the Built in XML Datasource.  I have it working to an extent but it seems somewhat slow and cumbersome due to a number of sorts I need to add in order to get the related elements joined and into the correct format to write it to my flat table.

    I've seen a number of people online suggesting other third party Tools (CozyRoc, ZappySys etc) to make this development effort easier and to work around some of the limitations of the Built in functionality.  

    Has anyone used any of these plugins and what were the pros/cons you found while using them.

    Thanks,
    Luke.

    To help us help you read this[/url]For better help with performance problems please read this[/url]

  • Luke L - Friday, July 13, 2018 10:51 AM

    I'm working on a process to import XML files into flat table using SSIS.  I've been trying to make use of the Built in XML Datasource.  I have it working to an extent but it seems somewhat slow and cumbersome due to a number of sorts I need to add in order to get the related elements joined and into the correct format to write it to my flat table.

    I've seen a number of people online suggesting other third party Tools (CozyRoc, ZappySys etc) to make this development effort easier and to work around some of the limitations of the Built in functionality.  

    Has anyone used any of these plugins and what were the pros/cons you found while using them.

    Thanks,
    Luke.

    Have you tried considered the data into a 'standard' SQL staging table and then using XQuery in T-SQL to process the data the way you want?

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • It's next on my list as a POC, but since the XML Datatype is limited to 2 GB and we may have files that are larger than that, it may be an issue down the road.  I figured out put out feelers to people on the Third party plugs ins while I worked on that.

    Also,the data is stored in a number of complex nested types which make it somewhat difficult to deal with and I'm having to make up the XSD myself since the source system doesn't provide one.  It's been fun so far...  I'm trying to make it easy for the next guy to maintain even if that next guy is me!

    Thanks,
    -Luke.

    To help us help you read this[/url]For better help with performance problems please read this[/url]

  • Luke L - Friday, July 13, 2018 12:08 PM

    It's next on my list as a POC, but since the XML Datatype is limited to 2 GB and we may have files that are larger than that, it may be an issue down the road.  I figured out put out feelers to people on the Third party plugs ins while I worked on that.

    Also,the data is stored in a number of complex nested types which make it somewhat difficult to deal with and I'm having to make up the XSD myself since the source system doesn't provide one.  It's been fun so far...  I'm trying to make it easy for the next guy to maintain even if that next guy is me!

    Thanks,
    -Luke.

    If your C# is up to it, you could also consider creating a Script Component source. That would give you total control, without the need for third-party add-ons.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin - Friday, July 13, 2018 12:15 PM

    Luke L - Friday, July 13, 2018 12:08 PM

    It's next on my list as a POC, but since the XML Datatype is limited to 2 GB and we may have files that are larger than that, it may be an issue down the road.  I figured out put out feelers to people on the Third party plugs ins while I worked on that.

    Also,the data is stored in a number of complex nested types which make it somewhat difficult to deal with and I'm having to make up the XSD myself since the source system doesn't provide one.  It's been fun so far...  I'm trying to make it easy for the next guy to maintain even if that next guy is me!

    Thanks,
    -Luke.

    If your C# is up to it, you could also consider creating a Script Component source. That would give you total control, without the need for third-party add-ons.

    Phil, C# and I don't really get along that well.  I only know enough to be dangerous and make my Biml scripts do what I need to for my ETL frameworks.  I may give it a go later on if all else fails.

    Thanks, 
    -Luke.

    To help us help you read this[/url]For better help with performance problems please read this[/url]

  • Luke L - Friday, July 13, 2018 12:28 PM

    Phil Parkin - Friday, July 13, 2018 12:15 PM

    Luke L - Friday, July 13, 2018 12:08 PM

    It's next on my list as a POC, but since the XML Datatype is limited to 2 GB and we may have files that are larger than that, it may be an issue down the road.  I figured out put out feelers to people on the Third party plugs ins while I worked on that.

    Also,the data is stored in a number of complex nested types which make it somewhat difficult to deal with and I'm having to make up the XSD myself since the source system doesn't provide one.  It's been fun so far...  I'm trying to make it easy for the next guy to maintain even if that next guy is me!

    Thanks,
    -Luke.

    If your C# is up to it, you could also consider creating a Script Component source. That would give you total control, without the need for third-party add-ons.

    Phil, C# and I don't really get along that well.  I only know enough to be dangerous and make my Biml scripts do what I need to for my ETL frameworks.  I may give it a go later on if all else fails.

    Thanks, 
    -Luke.

    If it comes to that & you get stuck, I can probably get a simple package up and running pretty quickly to help you get started.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin - Friday, July 13, 2018 12:40 PM

    Luke L - Friday, July 13, 2018 12:28 PM

    Phil Parkin - Friday, July 13, 2018 12:15 PM

    Luke L - Friday, July 13, 2018 12:08 PM

    It's next on my list as a POC, but since the XML Datatype is limited to 2 GB and we may have files that are larger than that, it may be an issue down the road.  I figured out put out feelers to people on the Third party plugs ins while I worked on that.

    Also,the data is stored in a number of complex nested types which make it somewhat difficult to deal with and I'm having to make up the XSD myself since the source system doesn't provide one.  It's been fun so far...  I'm trying to make it easy for the next guy to maintain even if that next guy is me!

    Thanks,
    -Luke.

    If your C# is up to it, you could also consider creating a Script Component source. That would give you total control, without the need for third-party add-ons.

    Phil, C# and I don't really get along that well.  I only know enough to be dangerous and make my Biml scripts do what I need to for my ETL frameworks.  I may give it a go later on if all else fails.

    Thanks, 
    -Luke.

    If it comes to that & you get stuck, I can probably get a simple package up and running pretty quickly to help you get started.

    I would very much appreciate that.  I'll give the XQuery bit a try and see if I can make that work and go from there.
    Thanks Phil!

    To help us help you read this[/url]For better help with performance problems please read this[/url]

  • Luke L - Friday, July 13, 2018 12:08 PM

    It's next on my list as a POC, but since the XML Datatype is limited to 2 GB and we may have files that are larger than that, it may be an issue down the road.  I figured out put out feelers to people on the Third party plugs ins while I worked on that.

    Also,the data is stored in a number of complex nested types which make it somewhat difficult to deal with and I'm having to make up the XSD myself since the source system doesn't provide one.  It's been fun so far...  I'm trying to make it easy for the next guy to maintain even if that next guy is me!

    Thanks,
    -Luke.

    I guess my question is... knowing those limits and the expected sizes, why are you using XML at all for this?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Luke L - Friday, July 13, 2018 10:51 AM

    I'm working on a process to import XML files into flat table using SSIS.  I've been trying to make use of the Built in XML Datasource.  I have it working to an extent but it seems somewhat slow and cumbersome due to a number of sorts I need to add in order to get the related elements joined and into the correct format to write it to my flat table.

    I've seen a number of people online suggesting other third party Tools (CozyRoc, ZappySys etc) to make this development effort easier and to work around some of the limitations of the Built in functionality.  

    Has anyone used any of these plugins and what were the pros/cons you found while using them.

    Thanks,
    Luke.

    Quick advice, if you expect your XML to exceed the 2Gb limit, then use alternative methods such as .Net stream reader etc. Handling these sizes in T-SQL and XQuery is doable but it is a complex exercise.
    😎

  • Jeff Moden - Friday, July 13, 2018 8:55 PM

    Luke L - Friday, July 13, 2018 12:08 PM

    It's next on my list as a POC, but since the XML Datatype is limited to 2 GB and we may have files that are larger than that, it may be an issue down the road.  I figured out put out feelers to people on the Third party plugs ins while I worked on that.

    Also,the data is stored in a number of complex nested types which make it somewhat difficult to deal with and I'm having to make up the XSD myself since the source system doesn't provide one.  It's been fun so far...  I'm trying to make it easy for the next guy to maintain even if that next guy is me!

    Thanks,
    -Luke.

    I guess my question is... knowing those limits and the expected sizes, why are you using XML at all for this?

    Sorry... I looking back on my post, that sounds like I'm being a bit short about it and that's not my intention.

    As Eirikur states, XML file sizes over 2GB are difficult to handle even with some of the right tools.  Is there no way to get the provider of the data to provide the data using a less verbose file format (as you know, most XML files are conservatively 8 to 16 times larger than the data the actually contain) or break the file up from their end?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Saturday, July 14, 2018 7:16 AM

    Jeff Moden - Friday, July 13, 2018 8:55 PM

    Luke L - Friday, July 13, 2018 12:08 PM

    It's next on my list as a POC, but since the XML Datatype is limited to 2 GB and we may have files that are larger than that, it may be an issue down the road.  I figured out put out feelers to people on the Third party plugs ins while I worked on that.

    Also,the data is stored in a number of complex nested types which make it somewhat difficult to deal with and I'm having to make up the XSD myself since the source system doesn't provide one.  It's been fun so far...  I'm trying to make it easy for the next guy to maintain even if that next guy is me!

    Thanks,
    -Luke.

    I guess my question is... knowing those limits and the expected sizes, why are you using XML at all for this?

    Sorry... I looking back on my post, that sounds like I'm being a bit short about it and that's not my intention.

    As Eirikur states, XML file sizes over 2GB are difficult to handle even with some of the right tools.  Is there no way to get the provider of the data to provide the data using a less verbose file format (as you know, most XML files are conservatively 8 to 16 times larger than the data the actually contain) or break the file up from their end?

    Were working with the vendor to create a custom table (its coming from db2) where i can copy the data from via oledb connections or whatnot.  This would be the best solution, but is not garunteed.  Ive been asked to explore other options if that wont work, or if it cant be done by our deadline.  Unfortunately thier application only provides data files as xml, there is no garuntee on the size of the xml files but given the information they are exchanging and our volume of data i could see the files getting big quickly.  Im trying to future proof this process in case the other method falls through.

    I need to implement a solution that will last rather than something that will choke a few weeks down the road.

    Thanks for the replies.
    Luke.

    To help us help you read this[/url]For better help with performance problems please read this[/url]

  • Luke L - Saturday, July 14, 2018 8:17 AM

    Jeff Moden - Saturday, July 14, 2018 7:16 AM

    Jeff Moden - Friday, July 13, 2018 8:55 PM

    Luke L - Friday, July 13, 2018 12:08 PM

    It's next on my list as a POC, but since the XML Datatype is limited to 2 GB and we may have files that are larger than that, it may be an issue down the road.  I figured out put out feelers to people on the Third party plugs ins while I worked on that.

    Also,the data is stored in a number of complex nested types which make it somewhat difficult to deal with and I'm having to make up the XSD myself since the source system doesn't provide one.  It's been fun so far...  I'm trying to make it easy for the next guy to maintain even if that next guy is me!

    Thanks,
    -Luke.

    I guess my question is... knowing those limits and the expected sizes, why are you using XML at all for this?

    Sorry... I looking back on my post, that sounds like I'm being a bit short about it and that's not my intention.

    As Eirikur states, XML file sizes over 2GB are difficult to handle even with some of the right tools.  Is there no way to get the provider of the data to provide the data using a less verbose file format (as you know, most XML files are conservatively 8 to 16 times larger than the data the actually contain) or break the file up from their end?

    Were working with the vendor to create a custom table (its coming from db2) where i can copy the data from via oledb connections or whatnot.  This would be the best solution, but is not garunteed.  Ive been asked to explore other options if that wont work, or if it cant be done by our deadline.  Unfortunately thier application only provides data files as xml, there is no garuntee on the size of the xml files but given the information they are exchanging and our volume of data i could see the files getting big quickly.  Im trying to future proof this process in case the other method falls through.

    I need to implement a solution that will last rather than something that will choke a few weeks down the road.

    Thanks for the replies.
    Luke.

    We did an experiment a while back at work.  They won't give me and the DB2 guy that worked with me on the experiment time to bring it to fruition but we successfully transferred 2 million wide rows from DB2 to a file and then simply BULK INSERTed the data and it only took several minutes for the whole operation.  There is a way for SQL Server to trigger the process on the DB2 boxes (with the correct privs being established) but they didn't give us the time to get that far.  The other fellow and I are considering doing it on our "own time" just to finish the POC so that we can stop the current silliness of hours-long transfers that unnecessarily "fill the pipe" and "stress the servers".

    Because we did the start of the POC more than 2 years ago, I don't remember the exquisite details but I can resurrect them for you if there's a chance of change for the better.  Therein lies the majority of the problem and the problem for both thee and me is overcoming the "culture" that perpetuates the horror of using XML and the "good enough" and "we've always done it this way" and "we've never done it that way" and "this is how everyone else does it" lunacy that permeates corporate systems and the attitudes of the people working for the corporation.  They end up couching all that nonsense by encapsulating it all in the words "Well, 'Best Practices' state that we should do it this way" and nothing could be further from the truth.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply