validate imported spreadsheet data against the original spreadsheet

  • hi All,

    Was asked this at at a job intvw ...left me stumped...doh

    Import an excel spreadsheet into sql server using the import wizard into a target table. How do you

    sanity check/validate the data in the table against the data in the spreadsheet. Is

    there a way apart from eyeballing it row by row ?

    Thx

    Robin

  • Not sure what answer the interviewer was looking for but a couple of things come to my mind. First open the spreadsheet and compare row count from the spreadsheet to the table you imported to. Another and probably more robust idea is to open the spreadsheet and run queries against the excel file. This can be done with OpenQuery or you can even make a direct oledb connection. There are probably some better answers out there but both of these jumped to me.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • If the spreadsheet had a checksum for each row, it would be easy to check that against the imported data by checksumming what comes in vs that.

    Pretty much anything else is going to be eyeballs. If the import engine fails to pull data in correctly, anything that queries the original is probably (almost certainly) going to have the same flaws.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Sean Lange (1/27/2012)


    Not sure what answer the interviewer was looking for but a couple of things come to my mind. First open the spreadsheet and compare row count from the spreadsheet to the table you imported to. Another and probably more robust idea is to open the spreadsheet and run queries against the excel file. This can be done with OpenQuery or you can even make a direct oledb connection. There are probably some better answers out there but both of these jumped to me.

    OpenRowset and a linked server to Excel will give the same flaws in IMEX situations as the import engine in SSIS, since they all use the same data drivers. You'd get a false positive on that check.

    Rowcount will give the number of rows, but not the data quality within the rows.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Good point Gus. At least my first suggestion wasn't flawed. 😛

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Thx guys

    I was caught cold on it. Someone throwing a curved ball my way...may be getting me to think out the box. At least its not an answer from the land of the bleeding obvious...

  • Probably the answer being looked for was "query the Excel file and compare to the table". Most people would, I think, make the same mistake Sean did and assume that would work. It's not obvious that it won't, and seems intuitive that it would. So that's probably the answer the interviewer was looking for, or something similar to that.

    Sean knows the subject of SQL Server pretty well. Something that trips him up is more than likely going to trip up an interviewer.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • In terms of validation: you can run aggregate functions in both the excel file and against the table to make suer you end up with consistent results. This won't give you a perfect result, but it would highight any major issues fairly quickly (i.e. what I would have defined as a sanity check).

    For example - run counts against each column in the spreadsheet; run totals+average against any numeric columns (or date columns for that matter), etc.... You might end up having to compare the two sets of summaries against each other by hand, but it certainly is a lot faster than row by row comparisons.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (1/27/2012)


    In terms of validation: you can run aggregate functions in both the excel file and against the table to make suer you end up with consistent results. This won't give you a perfect result, but it would highight any major issues fairly quickly (i.e. what I would have defined as a sanity check).

    For example - run counts against each column in the spreadsheet; run totals+average against any numeric columns (or date columns for that matter), etc.... You might end up having to compare the two sets of summaries against each other by hand, but it certainly is a lot faster than row by row comparisons.

    You could do that, but you'd need to do that in Excel, not by using OpenRowset or a linked server.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (1/30/2012)


    Matt Miller (#4) (1/27/2012)


    In terms of validation: you can run aggregate functions in both the excel file and against the table to make suer you end up with consistent results. This won't give you a perfect result, but it would highight any major issues fairly quickly (i.e. what I would have defined as a sanity check).

    For example - run counts against each column in the spreadsheet; run totals+average against any numeric columns (or date columns for that matter), etc.... You might end up having to compare the two sets of summaries against each other by hand, but it certainly is a lot faster than row by row comparisons.

    You could do that, but you'd need to do that in Excel, not by using OpenRowset or a linked server.

    Agreed. It just gets back to validating using an independent process: if you're worried about the Excel-To-SQLDatabase connector screwing up, you want to build a validation that works without the component you're worried about. (I've been burned by input files with blank rows in the middle of the data, so I have had to resort to manually verifiying at times).

    If you are more concerned about some other part of the process than the excel aspect, then openrowset would absolutely be a strong way to go.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (1/30/2012)


    GSquared (1/30/2012)


    Matt Miller (#4) (1/27/2012)


    In terms of validation: you can run aggregate functions in both the excel file and against the table to make suer you end up with consistent results. This won't give you a perfect result, but it would highight any major issues fairly quickly (i.e. what I would have defined as a sanity check).

    For example - run counts against each column in the spreadsheet; run totals+average against any numeric columns (or date columns for that matter), etc.... You might end up having to compare the two sets of summaries against each other by hand, but it certainly is a lot faster than row by row comparisons.

    You could do that, but you'd need to do that in Excel, not by using OpenRowset or a linked server.

    Agreed. It just gets back to validating using an independent process: if you're worried about the Excel-To-SQLDatabase connector screwing up, you want to build a validation that works without the component you're worried about. (I've been burned by input files with blank rows in the middle of the data, so I have had to resort to manually verifiying at times).

    If you are more concerned about some other part of the process than the excel aspect, then openrowset would absolutely be a strong way to go.

    True enough. And in an interview setting (which is what this came from), pointing that out might get positive points with the interviewer.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I like to use MS Access to do things like this (if it's available). Access can "link" tables to a SQL source as well as the source Excel sheet. Then do a select where not in query going each way. Not perfect, but quick and easy.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply