EXCEL HELL!!!

  • Jeff Moden (3/2/2015)


    sqldriver (3/2/2015)


    This is a great day! I get to tell people I'm as good at something in SQL as Jeff Moden, even if it is SSIS. 😀

    😀

    I'll never be as good as Jeff at SSIS. I looked at it briefly a couple of years ago, and my reaction was something like "Why on earth would anyone ever want to use this pile of *&^% ?"

    Tom

  • TomThomson (3/8/2015)


    Jeff Moden (3/2/2015)


    sqldriver (3/2/2015)


    This is a great day! I get to tell people I'm as good at something in SQL as Jeff Moden, even if it is SSIS. 😀

    😀

    I'll never be as good as Jeff at SSIS. I looked at it briefly a couple of years ago, and my reaction was something like "Why on earth would anyone ever want to use this pile of *&^% ?"

    BWAAAA-HAAAA!!!! Don't hold back, Tom. Say what you really mean! 😛

    With due respect to people that do use it, my only experiences with it are writing stored procs to take care of what it can't do (or so they said... I don't personally know for sure) and converting packages to T-SQL so people can take the SSIS server they have offline because they ultimately found it easier just to do what they needed with Stored Procedures.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • TomThomson (3/8/2015)


    Jeff Moden (3/2/2015)


    sqldriver (3/2/2015)


    This is a great day! I get to tell people I'm as good at something in SQL as Jeff Moden, even if it is SSIS. 😀

    😀

    I'll never be as good as Jeff at SSIS. I looked at it briefly a couple of years ago, and my reaction was something like "Why on earth would anyone ever want to use this pile of *&^% ?"

    It's like someone looked at Visual Studio and thought to themselves "needs more Visio; not enough frustrating line connectors."

    That being said, it would be nice if either tool visualized t-sql into control-flow diagrams, especially for inherited code that looks like someone made ANSI spaghetti.

  • Jeff Moden (3/9/2015)


    TomThomson (3/8/2015)


    Jeff Moden (3/2/2015)


    sqldriver (3/2/2015)


    This is a great day! I get to tell people I'm as good at something in SQL as Jeff Moden, even if it is SSIS. 😀

    😀

    I'll never be as good as Jeff at SSIS. I looked at it briefly a couple of years ago, and my reaction was something like "Why on earth would anyone ever want to use this pile of *&^% ?"

    BWAAAA-HAAAA!!!! Don't hold back, Tom. Say what you really mean! 😛

    With due respect to people that do use it, my only experiences with it are writing stored procs to take care of what it can't do (or so they said... I don't personally know for sure) and converting packages to T-SQL so people can take the SSIS server they have offline because they ultimately found it easier just to do what they needed with Stored Procedures.

    I use both personally.

    My ETL system is designed entirely with stored procedures. I did the same work in SSIS when I was learning how to make packages just to see for myself if there were benefits outside of the articles I read online.

    I created the same ETL system in SSIS faster than what I did with stored procedures. This was mainly because I save time drag-and-dropping elements onto a canvas along with being able to visually see my workflow from start to finish. On top of that, it's easier for me to add constraints to each object visually along with point-and-click redirects on warnings/failures.

    Does that mean SSIS is faster for everything or the better option? Na. I still like having stored procedures drive the entire ETL system because it's easier to manager and update versus a damn package that has to be opened with a separate tool and compiler. I also like the fact it's all in one language too.

    But, make no mistake, I do fear enhancements to my ETL system that SSIS highlights better and makes it easier like those additional services such as email, ftp and so forth.

  • Jeff

    I apologize for the too short and provocative statements.

    As it was a while ago I used an ACE driver I did a test with a simple Excel file. On a Laptop

    with OS Win 7 and Excel 2013 and Windows server 2014 Express. I will not give all details. It did not work from the begining and without Google I doubt I would have reach the goal. I may be a nut but who knows that one has to change the default authorization on the SQLEXPRESS service even if you are logged on as an adminstrator.

    When I save a sheet as a Tab separated textfile I do not get any extra characters around a comma. A comma can cause problems used as decimal separator if you insert a number.

    Excel has to be used with care. I don’t say that a common user should use any homemade macro to insert data in a database.

    Back to Excel HELL. Jeffs example triggered me to do some coding in VBA. Don't see my example as a ready application but as a working prototype. In reality the database is not hosted on the same PC as the Excel installation where the application runs. The data is put in an Excel table which gives you high flexibility as yoy can insert a new row for a student and a new column for a new month/subject. (In reality I would not design the user interface like this). Included are also to basic reports. (PowerPivot is a better tool).

    If you would like to test the code open the attached Excelfile. Excel 2013 (not tested in Excel 2010 but it should work). Create a modul in VBAProject and

    paste the code. You need to add a reference (tool) to Microsoft ActiveX Data Objects Library. I use 2.8.

    If you have problems please contact me. I will be away for next 5 Days without a computer.

    Gosta M

    Usually I don't include SQL statements in the VBA code. This is simple code.

    Call stored procedures.

    As I cannot attach an .xlsm file (with a macro) I include the code:

    Option Explicit

    Sub Test()

    Dim conn As New ADODB.Connection

    Dim cmd As New ADODB.Command

    Dim rs As New ADODB.Recordset

    Dim ConnString, name, id, status, subject, year, month, period As String

    Dim k, l, score As Integer

    Dim dataarray As Variant

    'Change Data Source to your own

    'Change Initial Catalog to your database

    ConnString = "Provider=SQLOLEDB;Data Source= GOSTAMUNKTELL\SQLEXPRESS;" & _

    "Initial Catalog=Rapport;" & _

    "Integrated Security=SSPI;"

    conn.Open ConnString

    Set cmd.ActiveConnection = conn

    ' Create table test One shot

    cmd.CommandText = " CREATE TABLE [dbo].[Test]( " & _

    "[Name] [nvarchar](50) NULL, " & _

    "[ID] [nvarchar](50) NULL, " & _

    "[STATUS] [nvarchar](50) NULL, " & _

    "[PERIOD] [nvarchar](50) NULL, " & _

    "[SUBJECT] [nvarchar](50) NULL, " & _

    "[SCORE] [int] NULL " & _

    ") ON [PRIMARY]"

    'cmd.Execute 'Inactivate

    'Empty the table after each update

    cmd.CommandText = "truncate table test"

    cmd.Execute

    'Create an array with the number of rows as the number columns in "Table"

    'To rows one for month row 1 in "table" and one for subject row 2 in "Table"

    ReDim dataarray(Range("Table").Columns.Count, 2)

    For k = 4 To Range("Table").Columns.Count

    If Range("Table").Cells(1, k).Value <> "" Then

    month = Range("Table").Cells(1, k).Value

    End If

    dataarray(k, 0) = month

    dataarray(k, 1) = Range("Table").Cells(2, k).Value 'Subject

    Next k

    For k = 3 To Range("Table").Rows.Count

    name = Range("Table").Cells(k, 1).Value

    id = Range("Table").Cells(k, 2).Value

    status = Range("Table").Cells(k, 3).Value

    For l = 4 To Range("Table").Columns.Count

    period = dataarray(l, 0) & " " & Range("Year").Cells(1, 1).Value

    subject = dataarray(l, 1)

    score = Range("Table").Cells(k, l).Value

    cmd.CommandText = "insert into test values " & _

    "('" & name & "','" & id & "', '" & status & "','" & period & "','" & subject & "'," & score & ")"

    cmd.Execute

    Next l

    Next k

    'Create a Total report per column month + subject

    cmd.CommandText = "Select [PERIOD] + ' ' + [subject] , " & _

    "sum(score) " & _

    "FROM [test] " & _

    "group by [PERIOD] + ' ' + [SUBJECT] " & _

    "order by [PERIOD] + ' ' + [SUBJECT] "

    rs.Open cmd

    Range("Total_score_per_month_subject").CopyFromRecordset rs

    rs.Close

    'Create a Total report per name and subject

    cmd.CommandText = "Select [NAME],[SUBJECT] , " & _

    "sum(score) " & _

    "FROM [test] " & _

    "group by [NAME],[SUBJECT] " & _

    "order by [NAME],[SUBJECT] "

    rs.Open cmd

    Range("Total_score_per_name").CopyFromRecordset rs

    rs.Close

    conn.Close

    Set conn = Nothing

    Set dataarray = Nothing

    End Sub

  • Thanks for your efforts, Gosta. I really appreciate it.

    If we take a look at the following section of the code...

    cmd.CommandText = " CREATE TABLE [dbo].[Test]( " & _

    "[Name] [nvarchar](50) NULL, " & _

    "[ID] [nvarchar](50) NULL, " & _

    "[STATUS] [nvarchar](50) NULL, " & _

    "[PERIOD] [nvarchar](50) NULL, " & _

    "[SUBJECT] [nvarchar](50) NULL, " & _

    "[SCORE] [int] NULL " & _

    ") ON [PRIMARY]"

    Not that that's bad but it defines the very problem that I have with all such code. Quite literally, you have to at least look at the Excel spreadsheet, determine these names, and then hard code them making the code good for only one spreadsheet. You'd have to write code for every spreadsheet you want to import. Again, not necessarily a bad thing but inconvenient because of the requirement to write code for every spreadsheet and, if the user(s) add a column to the "common" columns of Name, ID, and Status, you'd also have to at least modify existing code to handle the change. Such an addition would also require a minor change to the FOR loops in the code.

    What I've been able to do (and I'm working on getting an article published on the subject) is I've been able to make all of that auto-magic. The import code doesn't need to be changed at all just because the user makes such a change.

    As for the commas, that must be a change in what they've done in Excel 2013 because when I export in Excel 2010, I get the double quotes around any "cell" that has a comma in it, as well as leading "blank lines" filled with tabs if the top left corner of the user range isn't cell A1. What that means is that I want my code to handle all those possibilities and the ACE driver does handle that auto-magically.

    There's also the extra problem of teaching a user how to export as a TSV and then remembering to actually do it and to not save over their Excel file by mistake. A lot of extra work can be saved simply by avoiding the export to a file and just having the user send their Excel file. That will also aid in any troubleshooting and recommendations I might have for the user because I'm looking at exactly the same thing that they're looking at.

    Last but not least, there's the continuing problem of the proverbial "Tower of Babel" because in order to use VBA, you have to understand how to program in it. Most people that work with SQL Server all understand T-SQL and so if all such things are written in T-SQL, the folks writing such code need no special skills other than writing some good T-SQL. And, it's not difficult. The T-SQL code I've written has 4 small sections...

    1. Import the data directly from the spreadsheet into a staging table.

    2. Populate a meta-data table with the main column names that are listed in the spreadsheet.

    3. "Smear" the "data partition names" over the meta-data so that they are available to each "column" listed in the meta-data.

    4. Do an unpivot of the data automatically including any of the "common" column data on each row to build a full EAV without the programmer having to ever look at the spreadsheet.

    The neat thing here is the generic stored procedure will do this for ANY spreadsheet that follows the 2 row partition name convention (even if they're not date related) and I'm working on making the code figure that out automatically, as well.

    I'd post the code here but I'd rather wait until the article comes out.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Gosta Munktell (3/10/2015)


    Jeff

    I apologize for the too short and provocative statements.

    On that subject, you're a good man for coming back on that. Thanks a huge amount, Gosta.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff

    "Not that that's bad but it defines the very problem that I have with all such code. Quite literally, you have to at least look at the Excel spreadsheet, determine these names, and then hard code them making the code good for only one spreadsheet. You'd have to write code for every spreadsheet you want to import. Again, not necessarily a bad thing but inconvenient because of the requirement to write code for every spreadsheet and, if the user(s) add a column to the "common" columns of Name, ID, and Status, you'd also have to at least modify existing code to handle the change. Such an addition would also require a minor change to the FOR loops in the code."

    I only included that part to make it simple for you to create that table ( I use) in this prototype.

    Tables will never at least by me be created by Excel VBA.

    Gosta M

  • "What I've been able to do (and I'm working on getting an article published on the subject) is I've been able to make all of that auto-magic. The import code doesn't need to be changed at all just because the user makes such a change."

    My purpose with the example was to solve the intial task. I take of my hat for your efford to make a more general solution. I will look forward to see your article.

    I will not argue about VBA or T-SQL. In this forum most people use T-SQL and wants to see a complete solution in T-SQL. But will/can T-SQL replace VBA? I do have more examples where

    I think a combination av T-SQL and VBA is the most efficent way to solve the task.

    Gosta M

  • TomThomson (3/8/2015)


    Jeff Moden (3/2/2015)


    sqldriver (3/2/2015)


    This is a great day! I get to tell people I'm as good at something in SQL as Jeff Moden, even if it is SSIS. 😀

    😀

    I'll never be as good as Jeff at SSIS. I looked at it briefly a couple of years ago, and my reaction was something like "Why on earth would anyone ever want to use this pile of *&^% ?"

    I think that's a bit harsh! Sure, SSIS has it's shortcomings (one of which are the almost indecipherable error messages), but there are a huge number of Data Warehouse and ETL Devlopers around the world using it every day to accomplish complex tasks. The problem comes when people insist on using SSIS to simply move data from one place to another with no changes and they do that because it's what they know. As I've said before, use the right tool for the right job.

    Regards

    Lempster

  • Interesting topic, with interesting contributions

    We too get .xls, .xlsx, and .csv in ever changing formats and need to feed this into a SQL server Db.

    It's healthcare data, coming in from multiple suppliers as multiple submission types (e.g. referrals, admissions, tests, billing specifications, performance statistics, etc.)

    A few remarks from that experience:

    1.

    You need to convince your suppliers to some standardisation per submission-type, at least per supplier, but even after that you shouldn't be surprised if suddenly new columns appear and other columns disappear.

    On top you may get a mix (even a changing mix) of monthly and YTD (year-to-date) data, with occasionally nothing and then a few months in the same file, potentially causing duplications if they already submitted that month previously.

    Multiple submissions for a month aren't necessarily the same, suppliers extract this from life systems, and data may get updated long after a month is 'closed' (talking about healthcare data)

    Totally unbiased, I usually choose the version that will give the most advantageous results (best performance, lowest/highest bill possible bill) for the side I'm working for.;^)

    But maybe you'll have to consult the suppliers to find out what data they consider right; you don't have to agree with their choice, but at least you'll be talking about differences from the same baseline.

    2.

    Converting to CSV with Excel VBA is a bit tricky, to say the least, as you may lose precision due to the cell formatting in Excel.

    format [#.####################;-#.####################;0;@] (wh the square brackets) seems to work for me.

    After that you'll have to rebuild your numbers, dates and times in SQL

    Also you may have to standardise first, e.g. flatten pivots, demerge cells, remove empty rows and empty columns, etc., even transform from .xls to .xlsx first to enable flattening of pivots with 65535+ lines.

    After that import is easy and fast with bcp or BULK INSERT, but you still need to 'semi-automate' column alignment across months

    Maybe there are other considerations, like how to handle dates and times in data from different time zones

    3.

    Using spreadsheets as linked server (or using openrowset), is not always possible, as some sysadmins don't allow the required configuration features

    (In which case you better return the assignment, as SSIS just causes even more hell, when formats change monthly, as some of you already noted above)

    And it is a bit tricky too, for the same reasons as above under 2. ;^)

    Additionally you may need to add some dummy rows at the top to ensure the Provider doesn't recognise a column as number or date (and NULLs all else)

    That number of rows depends on a default in the registry, which you can change quite safely, if your sysadmins will allow it.

    In short you may have to use VBA (or maybe powershell) to automate 'standardisation' of the files first.

    Combined you probably can reduce the manual processing to exception handling, and eventually you may even reject the exceptions and ask the supplier for submissions in the right formats.

    My conclusion thus far: you can go a long way using openrowset, and the excel files as they arrive, but it all depends, and if I have to open the files first anyway to standardise them, I might as well save them as CSV and use BULK INSERT. Also saves me wasting time on handling over-zealous sysadmins.

    I would be pleasantly surprised if anyone, anywhere has cracked this problem and came up with a 'universal' solution.

    Just my 2c

  • pharryecoenen (6/18/2015)


    My conclusion thus far: you can go a long way using openrowset, and the excel files as they arrive, but it all depends, and if I have to open the files first anyway to standardise them, I might as well save them as CSV and use BULK INSERT. Also saves me wasting time on handling over-zealous sysadmins.

    I would be pleasantly surprised if anyone, anywhere has cracked this problem and came up with a 'universal' solution.

    Part of the reason I posted my original question was to make sure that there wasn't some brilliantly magical self-healing automatic method in SSIS before I started to write a presentation on "Automating Excel Hell". I've actually created a bit of T-SQL where not only do you NOT need to pre-standardize a spreadsheet, but you don't even need to know the column header names and it can be used on many "types" of spreadsheets.

    Based on the fact that there wasn't a magical SSIS method for all that (which I learned from this thread), I wrote one in T-SQL and made a "trial" presentation of the technique for my local PASS chapter. I'm converting that to an article.

    This first blush isn't truly "universal" but it eliminates a whole lot of pain on many spreadsheets that have a similar form. It's even self-healing when they add new months with multiple columns for each month as well as easily handling the dropping or addition of new columns for each month.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff

    Thanks for your reply.

    I've seen your stuff at detroit.sqlpass.org and as usual learned a few tricks from it, before trying it.

    Works great on 2008R2 after replacing (local) with <<ServerName>> in turn plulled from @@SERVERNAME.

    It breaks on 2014, as that needs explicit WITH RESULT SETS, so I'll have to figure out how that can be made to work with dynamic SQL using a varying number of columns.

    For your and others scenarios the 'long way' is getting you where you want to be, your work is great and probably all they need.

    Cheers

    Harry

  • I looked at it briefly a couple of years ago, and my reaction was something like "Why on earth would anyone ever want to use this pile of *&^% ?"

    I used DTS for years. SSIS was a steep learning curve, but having learned a great deal about it, I'd only go back to DTS kicking and screaming. SSIS has a couple of glaring shortcomings, but I've been able to do quite a bit with it.

  • pharryecoenen (6/19/2015)


    Hi Jeff

    Thanks for your reply.

    I've seen your stuff at detroit.sqlpass.org and as usual learned a few tricks from it, before trying it.

    Works great on 2008R2 after replacing (local) with <<ServerName>> in turn plulled from @@SERVERNAME.

    It breaks on 2014, as that needs explicit WITH RESULT SETS, so I'll have to figure out how that can be made to work with dynamic SQL using a varying number of columns.

    For your and others scenarios the 'long way' is getting you where you want to be, your work is great and probably all they need.

    Cheers

    Harry

    Heh... yeah. It's amazing how "improvements" keep breaking things. I'm one of those that "live far from the edge" so I've not even loaded up 2014 even on my PC yet. Thank you for the heads up.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 31 through 45 (of 63 total)

You must be logged in to reply to this topic. Login to reply