It's the little things that kill...

  • Ever notice how the smallest things can make the biggest difference? My most recent example involves a process that moves orders from one database to another. It used a table to track which orders it had processed. I made a change last Friday, which required a change to the query that inserted rows into that table. Inadvertantly, most likely out of habit, I used a LEFT JOIN rather than the OUTER JOIN I should have. As a result, orders that when through one step of the process, but not the main step, where not being logged to this table. Approximately 1,200 orders were created over the weekend from what was actually 10 orders. A huge mess, the root of which was fixed by simply changing a LEFT JOIN to an OUTER JOIN.

    Another recent example involved a shipping quoting package in which I neglected a set of parenthesis, causing an order-of-precedence error to miscalculate rates for a whole classification of packages (if it had done it for everything, I would have noticed much sooner. That error was fixed by adding just two characters: ( and ).

    I've had a few other such examples that I don't remember as well, but I've noticed the pattern that most of the errors I've fixed recently involve trivial syntax errors - with monumentously erroneous results.

    It reminds me of when I used to program a lot of C++, where accidenlty typing 1 instead of i had the potential to cause an infinite loop, that was next to impossible to find unless you loooked very, very closely (which is consequently why I no longer use "i" as a iterator variable in C++ - it just plain looks too much like 1).

    These type of errors seem to be on the rise for me - which I'm not sure if that's a good thing or a bad thing. On the good side, it could mean that I'm not getting as many severe logic errors. On the bad side, it could mean I need to pay attention a little bit more. It's hard to say, really, but I doubt even the most careful examination will eliminate every one of these. All of the errors I spoke of previously actually made it through testing before failing in production. Due diligence was paid, but no program will ever be bug free. I guess there can be far worse problems than a bug that can be fixed in 10 minutes or less...

    Anyone have any such examples? Have you had a process or program brought to its knees by a few misplaced characters?

    --J

  • I'm a bit confused. A LEFT JOIN is an OUTER JOIN. The full syntax would be LEFT OUTER JOIN.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Very sorry. I typed OUTER when I meant FULL. You're right, they were both OUTER joins, but I needed data from the right table, regardless of the presence of data in the left table.

    --J

  • Going from a Left Outer Join to a Full outer join is not a trivial change and is a completely different piece of work.

    Post the Execution plan and DDL for further advice.



    Clear Sky SQL
    My Blog[/url]

  • Well the query was 6 lines long, creating a table with 3 columns. I did have to add a COALESCE on the primary key so that it got the value from the right table if it wasn't in the left. So I guess it was slightly more than a "trivial" change.

    Perhaps it's a bad example...

    --J

  • I've been thinking a lot about how these sorts of things end up happening, and in both circumstances I actually created the situation that allowed it to happen.

    In the former case, the buggy code was actually the only code that was never truly fully tested. Naturally, that's what breaks first. I had actually modified a running process, so for a short while part of the process was running against production while the other part was running against a sandbox. In that case, each part of the process had to log their executions against a table in their own databases. When I brought it all together for production, I had to combine those two seperate steps into one step that logged to the same table. This was done more as an after-thought than anything else. One of the two queries was annexed with a simple (but incorrect) join, which I tested only under normal circumstances. As soon as an abnormal circumstance occurred, it started behaving unpredictably. Lesson learned.

    In the latter case, the query had grown so complex that it was nearly impossible to spot syntax errors (or even logic flaws). Part of it was the scope (which probably could have been managed a little better), but the stored procedure had grown to over 500 lines (although about 100 lines were comments). Nevertheless, spotting an error in that code, and even finding one when it did occur, was a daunting task. I am actually still in the process of factoring that code. But, since it actually does work now, I don't really have the opportunity to change anything without possibly making it worse. If it breaks again (and it probably will), I'll have to do most of the refactoring then.

    So I guess it's really the big things that kill, but the details certainly can not be neglected...

    --J

  • No one is perfect. The most impotent thing for anyone developing code is to test your code completely to make sure it does what you expect.

  • Maybe what is needed is to utilize a method that will allow you to programatically test all the permetations of a procedure. By saving this test, you will be able to easily test that everything works correctly in the future, when changes are made (after you've updated it for new test data if necessary). I saw something in the SQL Saturday I went to recently, but can't think of what it was right now.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • I definitely make no claims towards perfection. In fact, I regularly set the expectation that programs fail, bugs happen, and things happen which aren't exactly predictable. My goal has never been to make the perfect program. That would take far too long, and never truly happen. My goal is to minimize the impact of these bugs, and to reduce the risk of truly catastrophic bugs. With every program, I analyze the "worst-case scenario", supposing the program goes complete haywire, what is the worst that can happen. I focus most of my attention on those kind of bugs.

    With my most recent program, the creation of 1,200 duplicate orders was actually far from the worst-case scenario. The worst-case scenario actually involved scheduling hundreds of orders with the wrong date, then sending order acknowledgement to these customers. This would be a great disservice to our customers, and an incredible burden on our sales team. Compared to that, cancelling 1,200 orders was quite manageable; because I had taken great effort to ensure that an error such as that didn't cause notifications to be sent to customers. Our customers will never know we had such an error. The fellow employees who were affected by this error were actually happy to see me jump on the error and fix it without really involving them. By managing their expectations, they were actually greatful that I swooped in and fixed it as quickly as I did, rather than being angry that the error occurred in the first place. The process was down for about 2 hours during a non-critical time, and besides the presence of 1,200 orders in our ERP with a cancel flag set, the impact was neglegible.

    Still, I'd like to use this as a learning experience, and I think it has been...

    --J

  • WayneS (6/22/2010)


    Maybe what is needed is to utilize a method that will allow you to programatically test all the permetations of a procedure. By saving this test, you will be able to easily test that everything works correctly in the future, when changes are made (after you've updated it for new test data if necessary). I saw something in the SQL Saturday I went to recently, but can't think of what it was right now.

    There's not really one super test that you can test all programs with. However, I typically subscribe to test-driven development, where the acceptance criteria is determined first, and then a test is built to determine that the acceptance criteria is met. Only then do you actually begin coding. In the end, the test should succeed.

    This was however a special case in which I was unable to deploy a single scrap of code to the sandbox. This scrap of code had to be used at the very end to tie the production peice to the peice that had been in the sandbox, which was now ready for production. This would have been fine had I tested it a little bit further right then and there, but it was Friday afternoon and truthfully I was looking forward to leaving on time. Had I taken the time to notice, I would have realized that it was already failing before I left on Friday. I could have cauterized the wound then and there, but neither I, nor any of the parties involved, noticed until Monday morning. Again, lesson learned.... I definitely will not be so hasty next time, and should I need to add code during deployment, I will take more time to test it before walking away.

    --J

  • We live in a world where a misplaced arithmetic operator can cause a fifty million dollar satellite to be destroyed before it ever reaches orbit, and where I don't think anyone can ever fully predict all the possible invalid inputs that users will manage to put into software, so ensuring something will work 100% of the time is near impossible. You can just ensure it works as well as you can and hope nothing really data-destructive makes it past testing!

    To my mind, the main mitigation programmers can apply is to make sure their code is as clean and readable as possible. I'm not even talking comments here, although those are important; I mean lay out the actual code in such a way it's easier to follow. Your suggestion of using something other than "i" as a loop variable is a good one--I never got into that habit, fortunately (although some might argue that my own tendency to use "loop" is not much better, since it doesn't describe what the loop does!).

  • what have you learnt?

    never make any changes to production systems after lunchtime or after thursday

    build LOTS of constraints / checks / safe guards into your apps / db

    make a testing plan for each change

    execute the testing plan on QA platform with a non-technical expert user

    monitor the effects of changes

    put delays between processes to give you time to react to problems

    have a plan if things do go wrong

    treat your systems as if a bug will cost £1,000,000,000 and you will be shot

    one day you might be working on such a system - may as well learn how sooner than later

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply