March 1, 2007 at 8:23 pm
If you've been following the weather in the US this year, and reading these editorials, you probably know we've had lots of snow this winter. From the week before Christmas until last week, Denver and the surrounding areas have been nicely buried in snow. While I can see my driveway now, there's still 3 and 4ft piles of snow in some places.
A few weeks ago upstate NY got over 100 inches of snow in a day or two as well as ice all across the Eastern seaboard. Other airlines were affected, but not to the extent of much smaller JetBlue. In the aftermath, you may have read that JetBlue cancelled a number of flights, but what was more interested was a piece about their IT systems.
The issues were mostly in their reservation and call systems, and it was interesting that they built a new database to help their scheduling team improve their ability to multi-task. However it's rare that I see a company mention their systems being overwhelmed. I know it still happens, and we plan for it, but it's not so much news anymore.
Or people have gotten better. With all the video being distributed now, maybe we've just gotten better about building large scale systems. But it's Friday, so I wanted to see if you had any good stories.
When have your IT systems maxed out?
I have to say that I've rarely had my systems maxed. We've planned for it and architected all kinds of stuff, but rarely maxed. One of my startup employers had all sorts of contingency plans, but the closest we came to serious system stress was in testing. Another small company had some servers stressed, but the overall system, with multiple front ends, was ok.
However I did have some SQL Servers get pretty maxed out. At a large company, one group had designed a new internal web system. They spec'd the new servers, configured and set up the CMS and ran a test. A few hours later they came downstairs to complain the database wasn't performing as expected and was slowing down the system.
The server was maxed out and since it was a 2 CPU server, we could get a bigger one, but budget issues and time constraints were a problem. So I ran a trace, set files to roll over at 50MB and let it go. As I started the trace, I saw the results fly by. And checking the file system, I was getting a new file every 5-10 minutes. After stopping things and going back to analyze the trace, we calculated between 11 and 12 thousand queries a second were going through.
My report back was that there wasn't anything else we could do. Most of the traffic were from a few queries that the CMS used to figure out what to display and the tables were well indexed, and the queries well structured. We just could only run 11,000 a second 🙂
Steve Jones
March 1, 2007 at 9:37 pm
The only time we've maxed out at my current company, and we have, was when we built a really bad database that pegged the cpu's on a 4-way server. Some of the procedures were so badly written that they would take 15 minutes or more to recompile. It was a horror show.
Take that out of the picture... Nada.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 2, 2007 at 2:17 am
I worked for a company that ran one of those customer loyalty-type programs (points per £/$ spent and the like). They used a third-party software as the 'loyalty-engine' which processed flat-files containing points issuing details for that period. The database was so badly constructed that putting just two small files in to the system would grind it almost to a halt. Although the database was off-site, had two DBAs and developers working on it from the third-party and I was not allowed to make any changes to the core system - as the company's sole DBA I was constantly being jumped on by all levels of management (this place was definitely too many Cowboys not enough Indians) because the system was failing.
Sound familiar, anyone?
Anyway, I ended up building an add-on queuing system so that files could be submitted concurrently but would not be processed until the one before it had finished.
March 2, 2007 at 6:11 am
One of the great advantages of having a top-notch ability to troubleshoot SQL Server performance challenges is the ability to conclude correctly when a system is maxed out. Among other benefits, it prevents holes in the wall from beating one's head against it.
On a contrary note, I'm amazed as I go from site-to-site the folks who fail to comply with myriad best practices yet wonder why their system fails to perform optimally.
March 2, 2007 at 6:28 am
Had a case of that recently. One of the business users complained that the faxng was always slow (we use a faxing system that saves the faxes in the db). I took a quick look at the server, but couldn't see anything other than a couple badly written queries, so I suggested he get on to the vender, see what we're allowed to do to the db (3rd party app)
Couple days later, I hear from the big boss that the faxing isn't working and he says it must be fixed, so I took a closer look at the server (complete with profier trace and perfmon stats)
I discovered that the 'server' was running 2 450Mhz processors and 1GB or memory.
I dread next month end, as I doubt we'll have a new server by then, and we'll have the same mess when the box can't handle the load.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
March 2, 2007 at 6:33 am
At a company I worked for, we had a queue processor go haywire once. Way back when the company started, there was some custom code written to change how customer orders printed out, and the custom code broke the index on the data queue. No one noticed because there wasn't much in the file, and even a year later things were still performing relatively well. Then about 3 years in, the queue stopped being responsive, and started having about an hours lag during our busy seasons. Our DBA/RPG coder at the time said there was nothing that could be done, and it was just the volume doing us in. Until he left about a year later, he kept the problem in check by deleting out data from the queue. Over the next year, the problem became progressively worse until there was a 5 or 6 hour lag, and the processor was spiked most of the day. That's about the time I took over. I eventually tracked the problem down. We compiled a new indexed view so that the program didn't spend all of it's time scanning huge numbers of records looking for the right one, pointed the program at the view instead of the file, and the processing lag dissapeared.... and our average processor usage dropped something like 20-30% (that's including 2 shifts worth of minimal processing) and the processor was no longer pegged during the bulk of the work day. (dropped from 90%+ on average to < 20% on average)
the difference a little optimizing can make
March 2, 2007 at 8:25 am
We have a problem now, not with SQL Server or any other database, but with our reporting engine. We use Crystal Reports on our server to generate reports in an ASP.NET application. For reasons that none of us understand, those CR reports will somehow max out. At some point users will not be able to run any more reports, although they can continue to use all other functions of the website just fine. It seems like either a licensing issue or a resource issue. We've tried to address the resource issue by making sure that all CR resources are disposed off whenever they are used, but that still doesn't fix the problem. And, if we understand things with CR's licensing, we're OK there, too.
So, we don't really know what's going on.
Kindest Regards, Rod Connect with me on LinkedIn.
March 2, 2007 at 9:27 am
The airline and travel industry is mostly run on software that was built in the 70s. Some hardware too. JetBlue tried to use something else (their "new database"), and they got hosed. The problem is that while the old software is pretty bad, it's been around so long that it's relatively bug-free. If someone could modernize the software and make it compatible with old versions, they could make a killing.
I like the fact that JetBlue's CEO basically came out and said "we screwed up. We'll fix it," instead of giving us a load of spin. Now there's a guy I'd want to work for.
March 2, 2007 at 9:54 am
At my current company, we have a sql server that maxes out and actually locks up every 2 weeks (started off on a monthly cycle )....When this happens, we cannot access the database and we cannot even logon to the server...we need to do a cold reboot...No indications looking at the event viewer or sql server logs...We thought it was a hardware issue and replaced the server (except for it has the same exact configuration which we bought to set up mirroring)....it did not help.
Configuration: AMD Opteron (2200 MHZ), Windows 2003 R2 (Standard 64 BIT), Sql server 2005 (64 bit), with 8 GB of RAM , RAID 5.
Before we upgraded, we had 32 bit sql server 2000 on a 32 bit OS....we never had any issues. There is a little increase in the load but with a configuration that we opted, we only expected better results.
The issues we see on the server are:
We tried changing the Sql Server max memory setting to 6 GB thinking that we could reduce the contention between OS and SQl server.....that helped us maintain upto 10% free RAM....but did not help the situation
The next thing that we are planning to do is to set "Lock Pages in Memory" option, a microsoft recomendation for 64 bit...
If this does not work, we want to try applying the SQL Server 2005 Service Pack 2 which addresses better cache handling....
We are trying the piece meal approach to get to the bottom of the issue. Any suggestions as to how to tackle this issue is highly appreciated. Please help!
Also, can you suggest any monitoring tools that woudl help on the long run?
Thank You in advance.
Bobby
March 2, 2007 at 10:14 am
The airlines sell Time so you need time interval arithmetic in algebra and calculus but their schedules and logistics is not regulated so they will not hire the mathematicians needed to pass the clean math to their application developers. So the developers create schedules with the time interval in algebra which is rigid and brittle, no contingency built into it. I am assuming they consider schedules and logistics an expense which is a very wrong assumption.
Kind regards,
Gift Peddie
March 2, 2007 at 11:27 am
I work in healthcare where the VAX is still used heavily. We max out every day at midnight when we have to do dayend processing. Even during the day we max out and we have long waits in our registration and billing processes. We have a 4 node VMS cluster, but only 1 node controls posters and DB managers and it gets behind all the time. My SQL servers are a different story - I had some locking issues a few months ago with some big reports while users were on the system, but that is fixed now with some locking hints.
March 26, 2007 at 1:00 pm
Santosh, Can you describe your storage. You mention RAID 5, who is the storage provider? Internal or external disk?
May 7, 2007 at 12:12 pm
Santosh, If you have not already done so, check to see what jobs are running every two weeks and run traces against these jobs. Recycle the SQL Server Logs more frequently (this helped our plant systems). Has the Windows Server Admin checked the the paging and the amounts of waits and reads? As Rob sugested, what are you using for storage? SpiceDBA
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply