Validating meta data of xls file

  • Actually, skip all of that.  Most of it just doesn't matter.  I'll be back soon with a demo from the presentation I give on the subject of "Automating "Excel Hell"".

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok... Demo time.

    The first thing is to unzip and copy the spreadsheets from the attached SpreadSheets.zip file to the directory of your choice.  If the directory is something other than C:\JBMTest, you'll need to change the demo code to match.

    Also understand that I give this demo on my laptop.  That essentially means that the files are "on the database server".  I've setup to do this to pull from other machines using UNCs but not for this demo.

    I also have to assume that you've installed the ACE drivers.  I use an older version of Excel because I think the new versions make some sucking sounds and so I refer to Microsoft.ACE.OLEDB.12.0 and Excel 12.0 in the code.  You may have to change those before you run the demonstration.

    I wrote this to get some work done for one of the companies I work for and to make my life easier in the future.  The spreadsheets from the zip file only have 3 things in common...

    1.  They all have 3 header rows of various configurations.  The first 2 rows do a bit of grouping according to month and year in most cases but that's also not necessary in the "totally different" spreadsheet.  The 3rd row must contain the labels for then actual attributes of the data.  You'll see what I mean after you open them to see what I mean.  We don't actually care, though, what then first header rows have in them.  They just used for "grouping", which can vary by quite a bit as you'll see.  We figure that out all automatically and we just don't care how many there are or how many columns there are in the rectangular "active area", which is also "self determining".
    2. All of the spreadsheets have a "common" area on the left to identify rows.  Some spreadsheets have two columns, some have 3.  We don't actually care because we figure that out on the fly.
    3. The data will automatically stop importing at the first row that doesn't have the common area labels.

    The reason why we don't care if people add columns to the data is because I wrote the code to handle that automatically and without having to putz around with code at the end of every month or when some adds another column in a grouping or whatever.  You also notice that I also don't care where the top left corner of the active area starts.   The ACE drivers actually take care of that for us.

    The results are stored in an EAV table.  Then columns describing the rows will automatically vary in number according to the common area in the spreadsheets.  Truly, PFM.

    I've also attached the code for the stored short stored procedure that susses all of this out and a script that demonstrates importing all of the spreadsheets with only a change in the file name.  I don't actually created the imports in permanent tables so you don't have to worry about all that.  Obviously, you want to ensure that you've created the stored procedure before you run the example usage demo.

    Have fun and understand why I say I don't actually care when people add or delete columns or rows. I didn't make it totally "AI" but it does a whole lot auto-magically. A tweak or two could make it handle a whole lot more without have to adjust a package, etc, etc, every time someone sneezes on a spreadsheet.  It does use a couple of Global Temp tables but  I'm working on that, as well.

    And now you know one of the many reasons I say I don't need SSIS for most things. 😀

    Ah... almost forgot.  If you really feel the urge to validate column names and all, that what one of the reasons why I dump this into a temp table is.  I never shoot data into the final resting place during an import.  It's just not safe to do so.  It's not like it's a Sunkist orange where you know what's inside. 😀

    Attachments:
    You must be logged in to view attached files.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.s.  If the OP has a spreadsheet with just one header row, we'll need to do a little work but not much.  We'll make it so it's self-healing just like in the demo sheets/code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.p.s.  Almost forgot.  Sheet names that are weird or begin with a numeric digit must be encapsulated in single quotes along with the normal square brackets.  For example the sheet for 2020 would be named ['2020$'].

    There's also some pretty cool stuff you can do if you open the sheet as a linked server or if you use ADODB from SQL server including create new spreadsheets on the fly.  Add in a hint of DirTree and maybe xp_CmdShell and it's amazing what you can do with spreadsheets from SQL Server without SSIS.  I've gotta save a little of it for a short series on the subject, though. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, thank you for taking the time to write all of this stuff out.

    Importing from Excel using SSIS is almost always a massive pain, so I am going to check this method out.

    The biggest problems I have tend to be related to data types, or the lack of them. If what you are doing gets round those issues, I'll be converted.

    Thanks again!

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin wrote:

    Importing from Excel using SSIS is almost always a massive pain, so I am going to check this method out.

    On this, I couldn't agree more. I honestly despise the "ACE" drivers; they are certainly not an "ace" at anything other than inferring the wrong data types and then errorring when it does. If I can ever direct an external provider to giving us a CSV instead of a xls(x) file, I almost always will.

    Will certainly have a look at this later too, Jeff, and appreciate the time you've taken. Might be a little bit before I do though.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • This is why I moved away from SSIS and ACE for spreadsheets with mixed data (ie numeric and alphanumeric) in the same column, unless I can guarantee the data. Even a single column with numeric data with leading zeros will be changed by the drivers.

    To get round this, and I will not put office on my SQL server, I wrote a program(run on an application server) to export the first sheet to tab delimited text file without changing the data, then I could import the data by any method I choose.

    • This reply was modified 4 years, 10 months ago by  David Burrows.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • David Burrows wrote:

    This is why I moved away from SSIS and ACE for spreadsheets with mixed data (ie numeric and alphanumeric) in the same column, unless I can guarantee the data. Even a single column with numeric data with leading zeros will be changed by the drivers.

    To get round this, and I will not put office on my SQL server, I wrote a program(run on an application server) to export the first sheet to tab delimited text file without changing the data, then I could import the data by any method I choose.

    That's interesting. How does your program access the XLSX data, if not via the awful 'ACE' driver?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin wrote:

    David Burrows wrote:

    This is why I moved away from SSIS and ACE for spreadsheets with mixed data (ie numeric and alphanumeric) in the same column, unless I can guarantee the data. Even a single column with numeric data with leading zeros will be changed by the drivers.

    To get round this, and I will not put office on my SQL server, I wrote a program(run on an application server) to export the first sheet to tab delimited text file without changing the data, then I could import the data by any method I choose.

    That's interesting. How does your program access the XLSX data, if not via the awful 'ACE' driver?

    XLSX is in the OOXML format, not too hard to work with and there are several other spreadsheet programs using the same format. The good thing about OOXML is that one only has to have some understanding of XML to digest the files into any RDBMS, no ACE drivers or other such needed.

    😎

     

  • Thanks, Eirikur, I've had a look into that in the past, now that you mention it.

    If anyone has a handy chunk of code, working something like this (with suitable param defaults), I would love to hear from them!

    ExtractDataFromXLSX 
    (SourceFilePath = ,
    SheetName = ,
    StartRow = ,
    EndRow = ,
    StartCol = ,
    EndCol = ,
    ExtractFilePath = ,
    ColSeparator = ,
    ColQualifier = )

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Hi Thom, I am trying to understand your question . What do you mean by Validation mechanism?

    I declared a variable with some default value 'the value is generated by script task' . This means that I use this variable in the send mail task . I thought that it shows the default value (during run time this will be replaced the file path) which will be useful for other colleagues in future when they want to know how does the file is attached to the send mail task. But Send mail task does not accept providing this comment as a default value but some valid path or no value. I don't know what is happening in here. But for now, I have given a path as default value.

    Thanks,
    Charmer

  • Phil Parkin wrote:

    David Burrows wrote:

    This is why I moved away from SSIS and ACE for spreadsheets with mixed data (ie numeric and alphanumeric) in the same column, unless I can guarantee the data. Even a single column with numeric data with leading zeros will be changed by the drivers.

    To get round this, and I will not put office on my SQL server, I wrote a program(run on an application server) to export the first sheet to tab delimited text file without changing the data, then I could import the data by any method I choose.

    That's interesting. How does your program access the XLSX data, if not via the awful 'ACE' driver?

    Not as clever as you think we had an app server that had Office installed for other reasons, so the program used Excel COM to access the file, it was an easy solution at the time.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • Phil Parkin wrote:

    Thanks, Eirikur, I've had a look into that in the past, now that you mention it.

    If anyone has a handy chunk of code, working something like this (with suitable param defaults), I would love to hear from them!

    ExtractDataFromXLSX 
    (SourceFilePath = ,
    SheetName = ,
    StartRow = ,
    EndRow = ,
    StartCol = ,
    EndCol = ,
    ExtractFilePath = ,
    ColSeparator = ,
    ColQualifier = )

    I have a simple framework for this, prepping a talk and some articles on the subject, will keep you in the loop 😉

    😎

     

  • Eirikur Eiriksson wrote:

    Phil Parkin wrote:

    David Burrows wrote:

    This is why I moved away from SSIS and ACE for spreadsheets with mixed data (ie numeric and alphanumeric) in the same column, unless I can guarantee the data. Even a single column with numeric data with leading zeros will be changed by the drivers.

    To get round this, and I will not put office on my SQL server, I wrote a program(run on an application server) to export the first sheet to tab delimited text file without changing the data, then I could import the data by any method I choose.

    That's interesting. How does your program access the XLSX data, if not via the awful 'ACE' driver?

    XLSX is in the OOXML format, not too hard to work with and there are several other spreadsheet programs using the same format. The good thing about OOXML is that one only has to have some understanding of XML to digest the files into any RDBMS, no ACE drivers or other such needed.

    😎

    You will still need to unzip the file first.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • David Burrows wrote:

    Eirikur Eiriksson wrote:

    Phil Parkin wrote:

    David Burrows wrote:

    This is why I moved away from SSIS and ACE for spreadsheets with mixed data (ie numeric and alphanumeric) in the same column, unless I can guarantee the data. Even a single column with numeric data with leading zeros will be changed by the drivers.

    To get round this, and I will not put office on my SQL server, I wrote a program(run on an application server) to export the first sheet to tab delimited text file without changing the data, then I could import the data by any method I choose.

    That's interesting. How does your program access the XLSX data, if not via the awful 'ACE' driver?

    XLSX is in the OOXML format, not too hard to work with and there are several other spreadsheet programs using the same format. The good thing about OOXML is that one only has to have some understanding of XML to digest the files into any RDBMS, no ACE drivers or other such needed.

    😎

    You will still need to unzip the file first.

    Unzipping is the simple part, only needs a path to a command line program in the path variable, much simpler than installing a driver. I normally use 7zip, works like a charm every time.

    😎

     

Viewing 15 posts - 16 through 30 (of 58 total)

You must be logged in to reply to this topic. Login to reply