July 3, 2011 at 4:18 pm
Hello,
There is a huge amount of data, Quest screenshots, application and server logfiles and other data to go with this problem....but so far myself, my boss and the initial conversation with Microsoft has not highlighted an area to concentrate on and am hoping someone can point me in the right direction 🙂
The short(ish) story:
We have a large OLTP database running on SQL 2005 cluster on Windows 2003. Was running an archiving stored procedure, from an SMSS session from a remote server, running collections of simple insert, delete, commit/rollback statements between 2 databases in the same instance.
Archiving started around 11pm and have proof of last execution being successfully executed, and therefore committed according to the code, at 3:26am. The SMSS session was not closed and the RDP session went idle. Following morning the SPID from SMSS was AWAITING COMMAND. From Quest history this session went to sleep once it committed but stayed sleeping and at some point switched to awaiting command.
The SPID was killed and went into rollback. Due to issues advised by Microsoft relating to parallel query rollbacks using a single thread....and the SLA for the customer in danger....it was decided to fail the server over. The database came back suspect and automatic repair failed due to index corruption. Was unable to drop the index so ran DBCC to confirm damage.
Was also restoring database from full/diff taken at 4am that morning. Original database was then forced into emergency mode and was pretty much useless from that point. Customer decided restore was preferable to repair with data loss. This is where it gets weird.
Forget the application data for now....this is just about the archiving session mentioned earlier. At a request of the customer due to time restrictions we restored the full/diff and the full/diff/tlog database and ran a check to see the difference. Apart from some simple jobs running there was no difference....which was to be expected. The application connection pool was completely blocked by the sleeping/awaiting command archiving session.....that had committed....but not closed some of its locks. We are still trying to get this lock data from Quest but due to the server load it was unable to log all data during this time we think.
However....a more worrying find was that there was no evidence of the archive process running after 00:50....which is the same time we had the last entry from the application. We have a screenshot taken at 8:17am when the engineer woke up showing the query executed successfully and therefore committed. We had successful tlog backups taken just as the differential started at 4am and all through till 10am and no matter what version of them are restored we can find no evidence of the archiving data being moved....it is not in the source and not in the destination database.
This seems to fly in the face of a transactional database structure and cannot think of a reason where a committed transaction which should have been written and backed up....not to be there.
The only thing i can think of is that multiple archiving executions were run from the same session and SPID consecutively and when the SPID went into rollback when killed it started to undo the runs....therefore nothing was really committed as the session was left sleeping....but that would be huge amounts of data in memory....hours worth of data insert and delete....and a false successful statement to the user through SMSS.
Anyone got any ideas what to look for or where to concentrate our efforts??
Thanks in advance 🙂
July 3, 2011 at 4:57 pm
Did the transaction commit? A session can be sleeping (awaiting command) with an open transaction. If the transaction doesn't commit, SQL will roll it back on restart, disconnect or restore.
You mentioned 2 databases, were they restored to the same point in time?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 3, 2011 at 5:00 pm
hi gail,
yeah due to the way the sp is written there is a commit or a try/catch block with rollback....then a confirmation written. the query executed successfully and the confirmation written so it must have committed.
didnt restore the second database as we could always remove the data written and re-run the archiving that was missing....there doesnt seem to be any real data loss....more a question over the transaction integrity which is a little worrying!
thanks
July 3, 2011 at 5:23 pm
If the commit was reached then the transaction is durable and the effects are persistant. Sure there's absolutely no way that the transaction may not have committed (nested transaction you didn't account for, error)?
Did you restore to a time after the commit occurred?
It's hard to say what happened looking at a summary, but one thing SQL guarantees (the durable of ACID) is that a committed transaction is permanent.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 3, 2011 at 5:39 pm
as i didnt run this myself and am relying on what i have been told i cannot say 100% there was no nested transaction as this could happen if trying to run multiple runs....i think there may have been multiple calls for the sp using different variables in the same query window to speed up the process but each section would have been self contained with individual commit/try/catch/rollback statements i would hope as i cant see a need for a nested transaction
i understand that SQL is based on transactional activity and it does seem that something is missing/being witheld....but without all the evidence i am trying to find a sql symptom which could be pointed back to an problem with the procedure....as far as SQL server was concerned there were no errors logged......just blocked sessions and application deadlocks.....if i could find out the lock that was the lead blocker i could answer some of these questions 🙁 the system was in a real mess.....even some of the SA sessions were blocked!!
one thing i am considering is that the application has its own deadlock manager which kills application sessions when they timeout or get blocked for a certain period of time....but the application had issues of its own and was down at this point....i am still waiting for the application analysis but as the archiving is local to SQL i cannot see how an application deadlock could affect a SQL session if it is killed....and if it did the archive session reports back as being killed....the archive procedure returned all the correct messages in SSMS.
we cannot really blame SQL for losing a transaction and that is probably why Microsoft arent being very helpful.....but it is a little confusing!!
i doubt anyone will be able to say "yes is was probably ...." but i dont like being without a theory let alone an answer!!
July 3, 2011 at 5:47 pm
The problem is that the explanation is not clear enough to be able to say definitively that X happened, from the sound of things no one really does know clearly what was going on.
All you need is for there to have been one more BEGIN TRAN run than COMMIT and the transaction would have been considered uncommitted and hence would have been rolled back.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 3, 2011 at 6:03 pm
my boss who ran this told me there was no uncommitted transaction as that would mean the sp rolled back without any notification....and the archving confirmation would never have been written.
however, that doesnt seem possible.
im not in a position to do much about this....and dont have any responsibility in this case....but i know something isnt right. the implications arent worth making unless i can find some evidence....and cleverly there isnt much i can find. im their first proper dba and only been there a year versus a sql/oracle developer and 15 year company veteran who helped code the original product....
thanks for your help....maybe ill ask to run them in future where they can be monitored and not left running whilst being asleep :p
if only developers could be more like dba's.....and dba's more like developers :p
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply