SQLServerCentral Article

Packaged-Application Database Nightmares - A Horror Story

,

[WARNING: This content may not be suitable for normalization-loving Database professionals. Viewer caution is advised.]

I'm about to lay out a scenario so horrifying that most DBA will scramble for the cream-puff cuddly warmth of a major-character-dies episode of The Walking Dead.  Indeed, after you read, you may prefer being eaten by ravenous zombies to maintaining the database that is described.  The story you are about to read is or is not true.  The names have been changed to protect the résumés.

The Horror Begins.

Company X purchases a packaged application.  You are the DBA and are fairly new to the company.  After the package is purchased, you stand up an empty database, and the application writes in the schema. When connecting to the server to include the new database into transaction log and daily full backups, you just happen to glance at the new schema. This is what you see:

  • 2000 tables.
  • No defined Foreign Key relationships.
  • No clustered indexes - every table is a heap!
  • No defined Primary keys, instead tables use a nonclustered unique constraint to enforce PK.
  • No indexes on natural primary keys used to join tables.
  • Every column defined as varchar except for some dates that are DATETIME.
  • Some tables with an IDENTITY bigint and a non-clustered unique constraint on the identity.
  • But! Even with the Identity, most of those varchar(25) keys that are sequential numbers are populated out of a table full of "last number used" values..
  • A keyword indexing table for a table full of BLOBS (keyed on a GUID), with 20 keyword fields defined per row.  To add sauce to this goose, this table's 20 columns are used for different values depending on the nature of the BLOB table row to which the rows are linked.
  • second indexing table (written for an enhancement requested by company X) that keeps the exact same schema as the index table above.
  • A table-valued function that uses a UNION (not a UNION ALL) to join the results into a result set; roughly a million rows in one table, and 2 million rows in the other.
  • The app is a package - there is probably nothing you can do to improve the existing tables.
  • The size and configuration of the SQL servers that will host the application's database are presented to you as a fait accompli.

According to Codd and Date, this database is in 0NF - absolutely not normalized.  E. F. Codd resurrects and his zombie offers to eat you out of mercy, and you are sorely tempted to accept.

It gets worse!

The application is highly customized and then goes into production.  You think that you have achieved some comfort in that the new tables that your teammates on the Development side created are third-normal-form and use (mostly) the correct data type for the data in the column.. You are allowed to reconfigure the new tables (and many of the packaged tables) to have clustered indexes on the Identity column(s), and define surrogate primary keys to replace the unique constraints.  But defined FK relationships?  Only on a piece of paper.  

Go-Live comes and hundreds of users and numerous batch processes hit the application and immediately run into crushing performance problems.  How did performance testing miss this?  The trusty Visio flowchart, where every problem scenario terminates at the process named "Blame the database and call the DBAs," is brought out.  You are paged 30 times per day, and start fighting fires.  All. The. Time. Your head hits the pillow at night, and you are seized with fear, and your nightmares of disaster are incessant.

Some may call this "normal application stabilization;" you know better.  You run Pinal Dave's missing index script and identify 100 indexes that would improve query speed by 99 percent or more; over 50 of these indexes point to those two indexing tables described earlier.  You desperately put up a release that applies them all.  It works; performance is somewhat improved.  However, for your top ten tables by size, your index size now exceeds your data size - by a significant amount.  And even though you don't see it, your insert performance is degraded.

Even with the indexes, your nightmare worsens.  Every time that your end-users run a query by keyword, that TVF described above creates a 75GB table in tempdb (the porta-potty of SQL Server, as Brent Ozar so correctly notes) and then runs millions of checks to enforce the UNION clause in the TVF.  It happens numerous times per day. No index on Earth can fix this disaster!  You have no options.

Desperation sets in...

You repeatedly add indexes, growing your database.  Your SAN is stressed because it wasn't designed for this workload.  You frankly tell your managers that your options are limited, so they fill the SAN with SSDs (while angrily muttering about the budget).  Your entire database, except for a few BLOB tables, is stored on SSDs. So is TEMPDB, which is now inordinately large compared to the size of the database itself. These measures improve performance noticeably.  You get to relax a little, until the SAN bus overloads because of the amount of traffic, much of which is now handling the terrifying amount of index I/O for inserts and updates.  Indexes become stale so fast that you have nightly reorganize/rebuild jobs.  Query plans go bad so fast that you have a nightly DBCC FREEPROCCACHE in a job.  You are doing every single worst-practice you've ever heard of, trying to keep the plates spinning.

But the nightmare gets more intense.  Why?  Because now, your database size is growing 2 1/2 times faster than the original capacity plans predicted -- because of the indexes.  Now, Company X needs a new SAN, bigger, faster and more expensive.  In an effort to get in front of the I/O you propose solutions.  You suggest normalizing the keyword tables, which would shrink them dramatically and make indexes more effective.  You're shot down over the coding effort.  You work furiously off-hours on views and triggers to attempt to make the normalized structure invisible to the application and still reap the benefits of the performance.  Your plan is to show management a complete solution.

However, other problems materialize and divert you away from your efforts.  The application itself wasn't written well, and blocking ensues.  Not just the transient blocks that are normal to SQL Server, but long blocks that hold up as many as a dozen other processes.  The workstation client of the app also crashes often in mid-transaction, leaving sleeping blockers out there that you have to kill in order to unclog the pipes of SQL Server.  No defined/enforced foreign keys.  You worry about widow and orphan rows, but you are powerless.

You desperately look for a way to make everything work.  You draft a set of recommendations to make changes to eliminate heaps and cluster most tables on a bigint identity, and to make other indexing changes to reduce space consumption and speed up SELECT and INSERT/UPDATE traffic. 

Scapegoat? Or a Ray of Hope?

One night, there is a spectacular failure.  A job fills up TEMPDB.  The job receives a SQL error that TEMPDB is full and crashes.  You receive a page after the fact, as the monitoring tool catches the error; when you investigate, you find out that something hit the fan!  Someone in the "war room" for the app launch decides, without checking anything, to restart SQL Server. But the SQL instance is in a failover cluster, and this person stops and restarts the SQL Server from the services console and not the cluster administrator.  You walk into work the next morning with a fiery angry note from the CIO in your inbox, blaming everything on you.  It is your fault that TEMPDB (already larger than a normal TEMPDB by 2.5 times) filled up, and you should have told them of the impending issue.

You spend the first hour of your morning going over the monitoring tool reports, so angry that your co-workers have to spend time to calm you down and keep you from resigning in a rage over being blamed falsely.  Your findings:

  1. The offending job filled up TEMPDB from empty to full (200+ GB) in 10 minutes
  2. The problem was runaway code in the program - an infinite loop.  How on Earth did this get into production?
  3. At the time that the person restarted the SQL Server instance (improperly), the job had crashed and TEMPDB was empty again.  There was no need to go down and no need to restart SQL Server.
  4. The SQL failover cluster is in an inconsistent state because of the improper restart, requiring another downtime to put the cluster into a proper state.
  5. And - later - you find out that the task that had the runaway code in it was superflous and does nothing; it should have been removed from the code prior to go-live!

You prepare a detailed report with numerous proofs of the facts above, but the CIO pooh-poohs the report.  The CIO orders an outside consulting firm to be brought in to review everything in the database area - you rightly take that as a vote of no-confidence in your work.  The crew of consultants descends on the offices, and reviews things.  As the review progresses, you find them surprisingly sympathetic.  You express your frustrations to the consultants, and explain in detailed terms the issues you've faced.

The consultants report that you did absolutely nothing wrong and in fact are heroically fighting uphill against a problem that will eventually result in disaster.  The report squarely critizes the code quality of the application and the packaged-app database design.  The recommendations the consultants make are not just somewhat like yours, they are identical to yours.  Some managers bristle at the report, but one, in particular, reads and understands.  This manager turns out to be your manager.  You see a tiny ray of hope start to shine.

A Voice is Heard!

It's as if a magic door opened.  Developers who used to come to you just for more and more indexes start asking you for help to figure out the better way to write new code and to repair existing things.  Horribly-performing stored procedures and views are redesigned - with your input and even some of your SQL.  Positive results are immediately seen.  Your recommendations are suddenly taken seriously and things change.  While you still have many of the issues (such as the keyword table), your outlook begins to improve.  You find out from co-workers that your manager has been going to bat for you and your recommendations.

You become a little emboldened, and you propose things that you know will make the database server's performance as good as possible, and your ideas are taken seriously.  Many are implemented.  The senior managers who once came down on you see your value and begin to trust you.  The application becomes stable as the bugs are written out of the code and the database is optimized as much as possible.  Your work week drops from 100 hours per week to 70, then 60, then 50.  Your spouse starts to remember your name.

You continue to take leadership, and begin to propose proactive measures to improve database performance across the enterprise.  Server modernization projects are approved, and not just for the packaged application, but for many other applications.  You begin to find relief.

A New Horror or a New Hope?

You hear big news.  A major new product, requiring a major new application, is coming.  You begin to fret, and hope that the database that you are given is at least passable.  Then something happens that can knock you over with a small feather:  The development manager consults you about database size and capacity planning.  The Application Architect then stuns you:  At the beginning of the project, before line one of code is written, you receive a SQL script containing the entire proposed schema!  Is this a new horror story?

You nervously execute the schema script, and then run a diagramming tool.  Tears of joy fill your eyes are you are looking at a perfectly designed fourth-normal-form diagram!  In disbelief you look for something amiss.  You can't find anything out of place!  The standards document you wrote when you were first hired was followed to the letter.  Even your naming standards for tables, indexes, columns, procedures and functions are perfectly followed!

Because the database is so well-designed, the new application develops ahead of plan, and your capacity projections seem to be quite accurate.  Your index needs during development and QA are minimal.  You have new hope!  You get your performance review and it's outstanding!  

When it's time to configure the production SQL servers, your recommendations are overdesigned a little bit.  However, they are approved without pushback.  The new SAN that the company purchased is a monster with great performance.  Your DBA team (Yes - a Team) is hired according to your recommendations; you have DBA support around the clock.  They report to you!

After another great day of work - it's hard work but it's fun work - you come home at a reasonable hour and rest with family, and think of the horrors of the past.  You laugh as you realize that your hard work during that time period, in which you achieved the highly improbable, is now repaid with great respect.  Your hair turned grey during that ordeal, but it was worth it.  At day's end, you lay your head on your pillow with a smile.  You sleep comfortably.  And the pager doesn't go off.

Epilog

This horror story is fictionalized, but is intended to resemble the nightmare scenario foisted upon many DBA.  This story had a happy ending.  True, not all of them will end with such an outcome.  However, if you work hard and keep steady on the course, you can and likely will receive a similar result.

-------

John F. Tamburo is the Chief Database Administrator for Landauer, Inc., the world's leading authority on radiation measurement, physics and education. John can be found at @SQLBlimp on Twitter.  John also blogs at www.sqlblimp.com.

Rate

4.53 (40)

You rated this post out of 5. Change rating

Share

Share

Rate

4.53 (40)

You rated this post out of 5. Change rating