MTTD

Question

MTTD

Steve Jones - SSC Editor

SSC Guru

Points: 736212
More actions
September 11, 2019 at 12:00 am

#3673340

Comments posted to this topic are about the item MTTD

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

David.Poole SSC Guru Points: 76000 More actions · Answer 1

I don't think that you can get a 1 minute diagnosis on a newish system.

Continuous improvement and engineering excellence is the way to get there. The cycle I would expect to go through would look something like that below

Work out what you think can go wrong
Work out how you would detect it
Define useful error messages, alerts and communication paths
Work out how to engineer out the things in #1

Even the best of us get surprised by the way things manage to go wrong in unanticipated ways. The important thing is to do the root cause analysis and feed that into the 4 step process above.

In my experience continuous improvement naturally leads to refactoring and simplification. This makes systems less likely to go wrong in the first place and much quicker to diagnose when they do.

There is a lot to learn from The Clean Coder by Robert C Martin

LinkedIn Profile

david.wootton SSC Rookie Points: 25 More actions · Answer 2

The issue of HADR used to be difficult to set up in previous versions of SQL Server and Windows. However the multitude of new HADR configuration set ups are always improving and made easier to administrate not just on premise but also in the cloud. Further HADR topologies combined with integration and migration into non Microsoft HADR solutions are pushing and pushing better and better designs whilst definitely minimising/automating administration burden therefore maximising up time. In short 5 9's are more than achievable on today's HADR Eco Systems by diverting high administration costs into automated management and monitoring.

Eric M Russell SSC Guru Points: 125561 More actions · Answer 3

I'm unconvinced whether AOG and other replication or scale out technology increase or decrease database availability. Designing the applications (and monitoring) to be more fault tolerant can increase system availability in terms of end user perception and uptime reporting.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

MVDBA (Mike Vessey) SSC-Insane Points: 21797 More actions · Answer 4

David.Poole wrote:

I don't think that you can get a 1 minute diagnosis on a newish system.
Even the best of us get surprised by the way things manage to go wrong in unanticipated ways.

Agreed , but I have come across a few scenarios where a DBA has advised a developer that "this is a huge mistake waiting to happen" and been overruled.

On these occasions you have your monitoring in place and can prove the issue in minutes (hopefully a good dba would also have the rollback plan ready to go)

I'm running a server consolidation project at the minute with lots of linked servers involved... there's no way it's going live without every scenario I can conceive being tested and lots of scripts ready to protect us....

fingers crossed we can respond in 1 minute (but I think I just jinxed us)

MVDBA

Steve Jones - SSC Editor SSC Guru Points: 736212 More actions · Answer 5

HA/DR can lower uptime with complexity. Loose coupling and independence can help. We could argue that spending time on a broken secondary v a broken primary might increase downtime, but in most situations I think uptime is as high or higher.

David.Poole SSC Guru Points: 76000 More actions · Answer 6

Eric M Russell wrote:

I'm unconvinced whether AOG and other replication or scale out technology increase or decrease database availability. Designing the applications (and monitoring) to be more fault tolerant can increase system availability in terms of end user perception and uptime reporting.

To my mind data availability is as important as database availability that is important. Obviously your DB has to be up and running but if your scaled out DB has just replicated a delete without a where clause then you are just as stuffed as if the scaled out cluster went down.

I know that a lot of people are sadder and wiser for having experienced the horrors of BASE rather than ACID

LinkedIn Profile

MVDBA (Mike Vessey) SSC-Insane Points: 21797 More actions · Answer 7

David.Poole wrote:

Eric M Russell wrote:

if your scaled out DB has just replicated a delete without a where clause then you are just as stuffed as if the scaled out cluster went down.

4 words that need to be used in any delete situation

begin tran

rollback tran or (commit if your rowcount is good)

I never put a commit tran in a script until I know my where clause is good. I keep a special cupboard where we lock the naughty developers who forget this 🙂

This reply was modified 5 years, 2 months ago by MVDBA (Mike Vessey).

MVDBA