Importing - best approach to compare Excel row counts against the table created

  • Hi

    I'm using a data flow task to convert a number of csv source files into a number of SQL tables

    Ideally, I want to ensure that the number of rows match

    What's the best approach?

    I can use rowcount on the Excel file and compare that to a 'select' from the destination table

    Then write a message to a low table if they differ

    That works using variables

    Problem is, I have a number of these comparisons and I don't want to keep adding to my list of variables

    Let's say I have tableA.xls, tableB.xls and tableC.cls writing to tableA, tableB and tableC

    How would I (for example as there may be a better way) pass table<X> has a variable to the data flow

    Note, all table structures differ so I can't just create a Forech Loop container (at least I think I can't)

    Thanks

    Damian.

    Thanks

    Damian.

    - Damian

  • DamianC (10/20/2016)


    Hi

    I'm using a data flow task to convert a number of csv source files into a number of SQL tables

    Ideally, I want to ensure that the number of rows match

    What's the best approach?

    I can use rowcount on the Excel file and compare that to a 'select' from the destination table

    Then write a message to a low table if they differ

    That works using variables

    Problem is, I have a number of these comparisons and I don't want to keep adding to my list of variables

    Let's say I have tableA.xls, tableB.xls and tableC.cls writing to tableA, tableB and tableC

    How would I (for example as there may be a better way) pass table<X> has a variable to the data flow

    Note, all table structures differ so I can't just create a Forech Loop container (at least I think I can't)

    Thanks

    Damian.

    Thanks

    Damian.

    First point to note is that a CSV file is a far far better thing to be importing from than an Excel file. CSV files are usually referred to as 'flat files'.

    I would put this into a foreach loop, with a logging task writing to some sort of logging table (RunId, Filename, RowsInFile, RowsImported sort of thing). An ExecuteSQL task run after the data flow should be able to populate the table with all of the necessary data.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin (10/20/2016)


    First point to note is that a CSV file is a far far better thing to be importing from than an Excel file.

    "It Depends". If someone exported a spreadsheet as a CSV file, you have all of the same problems plus the addition problem of dealing with only occasional quoted identifiers on text-based items that have a comma in them. Huge PITA compared to reading from the Excel file directly using ACE Drivers.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    First point to note is that a CSV file is a far far better thing to be importing from than an Excel file.

    "It Depends". If someone exported a spreadsheet as a CSV file, you have all of the same problems plus the addition problem of dealing with only occasional quoted identifiers on text-based items that have a comma in them. Huge PITA compared to reading from the Excel file directly using ACE Drivers.

    Actually, even then, you don't have all the same problems.

    Knowing your SSIS proficiency level, you'll have to trust me on this:-) An important difference is that, with flat files in SSIS, you can set the datatype of all the underlying columns manually, rather than have the Excel driver sometimes insist that you are importing a numeric() when you know it should be a varchar(), then throw an error when it hits the (valid) string on row 5,000. That Excel driver is a steaming purulent carbuncle.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Thanks for the response

    How would I use a foreach Loop if the files are structurally different?

    - Damian

  • DamianC (10/20/2016)


    Thanks for the response

    How would I use a foreach Loop if the files are structurally different?

    Good point. I did not read your initial post closely enough!

    But it is still possible: imagine the scenario where you have three files and three different data flows.

    Your FEL can populate a variable containing the 'current' file name (ie, the one relevant to the current iteration of the FEL). Use precedence constraints within the FEL to control which of the Data Flows gets executed, then, finally, update the log, as shown.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin (10/20/2016)


    Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    First point to note is that a CSV file is a far far better thing to be importing from than an Excel file.

    "It Depends". If someone exported a spreadsheet as a CSV file, you have all of the same problems plus the addition problem of dealing with only occasional quoted identifiers on text-based items that have a comma in them. Huge PITA compared to reading from the Excel file directly using ACE Drivers.

    Actually, even then, you don't have all the same problems.

    Knowing your SSIS proficiency level, you'll have to trust me on this:-) An important difference is that, with flat files in SSIS, you can set the datatype of all the underlying columns manually, rather than have the Excel driver sometimes insist that you are importing a numeric() when you know it should be a varchar(), then throw an error when it hits the (valid) string on row 5,000. That Excel driver is a steaming purulent carbuncle.

    The IMEX parameter is the solution for that particular problem.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    First point to note is that a CSV file is a far far better thing to be importing from than an Excel file.

    "It Depends". If someone exported a spreadsheet as a CSV file, you have all of the same problems plus the addition problem of dealing with only occasional quoted identifiers on text-based items that have a comma in them. Huge PITA compared to reading from the Excel file directly using ACE Drivers.

    Actually, even then, you don't have all the same problems.

    Knowing your SSIS proficiency level, you'll have to trust me on this:-) An important difference is that, with flat files in SSIS, you can set the datatype of all the underlying columns manually, rather than have the Excel driver sometimes insist that you are importing a numeric() when you know it should be a varchar(), then throw an error when it hits the (valid) string on row 5,000. That Excel driver is a steaming purulent carbuncle.

    The IMEX parameter is the solution for that particular problem.

    IMEX = 1? Still doesn't work if your string data is way down the file, unless you hack the registry to increase Excel's 'number of rows to check' sample size. Or maybe you know something I don't.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Thanks Phil, on with your suggestion now

    - Damian

  • Phil Parkin (10/20/2016)


    Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    First point to note is that a CSV file is a far far better thing to be importing from than an Excel file.

    "It Depends". If someone exported a spreadsheet as a CSV file, you have all of the same problems plus the addition problem of dealing with only occasional quoted identifiers on text-based items that have a comma in them. Huge PITA compared to reading from the Excel file directly using ACE Drivers.

    Actually, even then, you don't have all the same problems.

    Knowing your SSIS proficiency level, you'll have to trust me on this:-) An important difference is that, with flat files in SSIS, you can set the datatype of all the underlying columns manually, rather than have the Excel driver sometimes insist that you are importing a numeric() when you know it should be a varchar(), then throw an error when it hits the (valid) string on row 5,000. That Excel driver is a steaming purulent carbuncle.

    The IMEX parameter is the solution for that particular problem.

    IMEX = 1? Still doesn't work if your string data is way down the file, unless you hack the registry to increase Excel's 'number of rows to check' sample size. Or maybe you know something I don't.

    I'll have to try longer spreadsheets because I've not run into that problem using IMEX=1, which is claimed to force everything in the spreadsheet to character-based values.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/21/2016)


    Phil Parkin (10/20/2016)


    Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    First point to note is that a CSV file is a far far better thing to be importing from than an Excel file.

    "It Depends". If someone exported a spreadsheet as a CSV file, you have all of the same problems plus the addition problem of dealing with only occasional quoted identifiers on text-based items that have a comma in them. Huge PITA compared to reading from the Excel file directly using ACE Drivers.

    Actually, even then, you don't have all the same problems.

    Knowing your SSIS proficiency level, you'll have to trust me on this:-) An important difference is that, with flat files in SSIS, you can set the datatype of all the underlying columns manually, rather than have the Excel driver sometimes insist that you are importing a numeric() when you know it should be a varchar(), then throw an error when it hits the (valid) string on row 5,000. That Excel driver is a steaming purulent carbuncle.

    The IMEX parameter is the solution for that particular problem.

    IMEX = 1? Still doesn't work if your string data is way down the file, unless you hack the registry to increase Excel's 'number of rows to check' sample size. Or maybe you know something I don't.

    I'll have to try longer spreadsheets because I've not run into that problem using IMEX=1, which is claimed to force everything in the spreadsheet to character-based values.

    Jeff, please read this thread, particularly Todd McDermid's response, in which he explains the problem with IMEX=1.

    The only near-bulletproof way I have found of doing this is to ensure that

    a) IMEX=1 appears in the connection string, and

    b) I read the column headers as data

    Doing this forces Excel to treat all of the columns as text (assuming your column headers are not numeric!).

    This brings its own overheads, of course (have to dispose of column headers, have to name columns manually based on ordinal, have to specifically convert to numeric/date where appropriate). But at least all of the data actually gets pulled in this way.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin (10/21/2016)


    Jeff Moden (10/21/2016)


    Phil Parkin (10/20/2016)


    Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    First point to note is that a CSV file is a far far better thing to be importing from than an Excel file.

    "It Depends". If someone exported a spreadsheet as a CSV file, you have all of the same problems plus the addition problem of dealing with only occasional quoted identifiers on text-based items that have a comma in them. Huge PITA compared to reading from the Excel file directly using ACE Drivers.

    Actually, even then, you don't have all the same problems.

    Knowing your SSIS proficiency level, you'll have to trust me on this:-) An important difference is that, with flat files in SSIS, you can set the datatype of all the underlying columns manually, rather than have the Excel driver sometimes insist that you are importing a numeric() when you know it should be a varchar(), then throw an error when it hits the (valid) string on row 5,000. That Excel driver is a steaming purulent carbuncle.

    The IMEX parameter is the solution for that particular problem.

    IMEX = 1? Still doesn't work if your string data is way down the file, unless you hack the registry to increase Excel's 'number of rows to check' sample size. Or maybe you know something I don't.

    I'll have to try longer spreadsheets because I've not run into that problem using IMEX=1, which is claimed to force everything in the spreadsheet to character-based values.

    Jeff, please read this thread, particularly Todd McDermid's response, in which he explains the problem with IMEX=1.

    The only near-bulletproof way I have found of doing this is to ensure that

    a) IMEX=1 appears in the connection string, and

    b) I read the column headers as data

    Doing this forces Excel to treat all of the columns as text (assuming your column headers are not numeric!).

    This brings its own overheads, of course (have to dispose of column headers, have to name columns manually based on ordinal, have to specifically convert to numeric/date where appropriate). But at least all of the data actually gets pulled in this way.

    Good stuff, Phil. Thanks for the link. That explains why I've never had a problem with such a thing. I do both of those things most of the time and, when I don't, the spreadsheets are well-formed and don't have mixed column types except for the header.

    As with (apparently) many folks, I had a basic misunderstanding of what IMEX actually does. According to Todd's line (which is broken and I had to do a little search on DougBert's site to find it), IMEX does NOT treat everything like text. It only treats mixed-type columns as text and suffers from the same 8 row sample limit as everything else.

    As I recalled, there is a connection string setting to change the sample row limit (MaxScanRows), Supposedly, it can be set from 1-16 but some claim that if you set it to 0, that it will sample all rows in the spreadsheet. Of course, that will also take some extra time because that would be the first of two complete reads of the spreadsheet. Unfortunately, some claim the MaxScanRows extended is broken even in the current version of the driver. You've just got to know I'll be testing that. 😀

    There's also the claim that the ACE driver has a problem with spreadsheets over 32k rows. Most of that claim is has to do with VB scripts where the driver doesn't recognize an EOF when using a VB loop to read the rows and, therefor, loops forever. Of course, I'll check that out, as well.

    Again, thanks for the heads-up about even IMEX=1 not working because of the default MaxScanRows ("TypeGuesRows" setting in the registry). I've given a couple of presentations on how to auto-magically overcome the problem of importing some pretty complex spreadsheets that have columns and rows added every month (including overcoming the problem of having to manual name columns). I'll add my test findings to that and try to get an article out on all of it.

    As you might also guess, it also demonstrates that SSIS might be the wrong choice for importing spreadsheets. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/21/2016)


    Phil Parkin (10/21/2016)


    Jeff Moden (10/21/2016)


    Phil Parkin (10/20/2016)


    Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    Jeff Moden (10/20/2016)


    Phil Parkin (10/20/2016)


    First point to note is that a CSV file is a far far better thing to be importing from than an Excel file.

    "It Depends". If someone exported a spreadsheet as a CSV file, you have all of the same problems plus the addition problem of dealing with only occasional quoted identifiers on text-based items that have a comma in them. Huge PITA compared to reading from the Excel file directly using ACE Drivers.

    Actually, even then, you don't have all the same problems.

    Knowing your SSIS proficiency level, you'll have to trust me on this:-) An important difference is that, with flat files in SSIS, you can set the datatype of all the underlying columns manually, rather than have the Excel driver sometimes insist that you are importing a numeric() when you know it should be a varchar(), then throw an error when it hits the (valid) string on row 5,000. That Excel driver is a steaming purulent carbuncle.

    The IMEX parameter is the solution for that particular problem.

    IMEX = 1? Still doesn't work if your string data is way down the file, unless you hack the registry to increase Excel's 'number of rows to check' sample size. Or maybe you know something I don't.

    I'll have to try longer spreadsheets because I've not run into that problem using IMEX=1, which is claimed to force everything in the spreadsheet to character-based values.

    Jeff, please read this thread, particularly Todd McDermid's response, in which he explains the problem with IMEX=1.

    The only near-bulletproof way I have found of doing this is to ensure that

    a) IMEX=1 appears in the connection string, and

    b) I read the column headers as data

    Doing this forces Excel to treat all of the columns as text (assuming your column headers are not numeric!).

    This brings its own overheads, of course (have to dispose of column headers, have to name columns manually based on ordinal, have to specifically convert to numeric/date where appropriate). But at least all of the data actually gets pulled in this way.

    Good stuff, Phil. Thanks for the link. That explains why I've never had a problem with such a thing. I do both of those things most of the time and, when I don't, the spreadsheets are well-formed and don't have mixed column types except for the header.

    As with (apparently) many folks, I had a basic misunderstanding of what IMEX actually does. According to Todd's line (which is broken and I had to do a little search on DougBert's site to find it), IMEX does NOT treat everything like text. It only treats mixed-type columns as text and suffers from the same 8 row sample limit as everything else.

    As I recalled, there is a connection string setting to change the sample row limit (MaxScanRows), Supposedly, it can be set from 1-16 but some claim that if you set it to 0, that it will sample all rows in the spreadsheet. Of course, that will also take some extra time because that would be the first of two complete reads of the spreadsheet. Unfortunately, some claim the MaxScanRows extended is broken even in the current version of the driver. You've just got to know I'll be testing that. 😀

    There's also the claim that the ACE driver has a problem with spreadsheets over 32k rows. Most of that claim is has to do with VB scripts where the driver doesn't recognize an EOF when using a VB loop to read the rows and, therefor, loops forever. Of course, I'll check that out, as well.

    Again, thanks for the heads-up about even IMEX=1 not working because of the default MaxScanRows ("TypeGuesRows" setting in the registry). I've given a couple of presentations on how to auto-magically overcome the problem of importing some pretty complex spreadsheets that have columns and rows added every month (including overcoming the problem of having to manual name columns). I'll add my test findings to that and try to get an article out on all of it.

    As you might also guess, it also demonstrates that SSIS might be the wrong choice for importing spreadsheets. 😛

    Now that sounds like a worthwhile article!

    I have a revision to your final sentence: It also demonstrates that spreadsheets are certainly might be the wrong choice for transferring data between systems.

    Having said that, I have developed a metadata-driven solution in SSIS for importing spreadsheets with varying sheet names, column names and number of header rows. Can't say it was fun to write, but it's good now that it's finished.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Just to be sure, ol' friend... please don't ever take my sometimes (usually?) zealous contrarian attitude towards SSIS personally. Without boring you with the ugly details, I had a really bad experience on the subject with some folks in the past and I've apparently not gotten over that particular bit of history. In the right hands, just like any tool, I imagine that SSIS/SSDT is quite the tool.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/21/2016)


    Just to be sure, ol' friend... please don't ever take my sometimes (usually?) zealous contrarian attitude towards SSIS personally. Without boring you with the ugly details, I had a really bad experience on the subject with some folks in the past and I've apparently not gotten over that particular bit of history. In the right hands, just like any tool, I imagine that SSIS/SSDT is quite the tool.

    Not at all, sir. I enjoy our SSIS banter. And I'm staying well away from the hand/tool innuendo.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply