DB restore stucks after recovery step

Question

DB restore stucks after recovery step

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply

Mr. Brian Gale SSC-Insane Points: 24800 More actions · Answer 1

You say you have 18,000 data files. are those all on different disks? If not, you will not really get a performance benefit and I can see that causing a lot of slowdown in the restore as you will need to jump around a LOT on disk with each write to the datafile.
Is your backup and log on the same disk as your datafiles? This will slow them down as well.

As for that "PostRestoreContainerFixups: fixups complete" message, I am not 100% sure what that does and information online is a bit lacking. I tried looking for information on it as well and came up empty. Except in all cases that I did find mention of it, it looked like it completed rather quickly (2-5 seconds in a 3700 second restore).

I have worked on databases that have multiple data files, but never anything with the numbers that you are, and am thinking that might be your issue?

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

sgt500 Default port Points: 1413 More actions · Answer 2

bmg002 - Monday, March 13, 2017 8:45 AM
You say you have 18,000 data files. are those all on different disks? If not, you will not really get a performance benefit and I can see that causing a lot of slowdown in the restore as you will need to jump around a LOT on disk with each write to the datafile.
Is your backup and log on the same disk as your datafiles? This will slow them down as well.
As for that "PostRestoreContainerFixups: fixups complete" message, I am not 100% sure what that does and information online is a bit lacking. I tried looking for information on it as well and came up empty. Except in all cases that I did find mention of it, it looked like it completed rather quickly (2-5 seconds in a 3700 second restore).
I have worked on databases that have multiple data files, but never anything with the numbers that you are, and am thinking that might be your issue?

yes, dfs and backup file are on the same disk, like said before, this is a test server, but restore/recovery phase ends in about 6-7h, I dnot understand what he is doing in the other 7 hours.
Maybe the number of df cause this behavior

Mr. Brian Gale SSC-Insane Points: 24800 More actions · Answer 3

That'd be my next guess. I would think that SQL would need to do some form of sanity check to make sure that the data pages are all properly in the data files and that it knows how to link the different data files.

Since this is a test DB, can you dump all of the data from the extra data files into the 1 data file, do a backup and then restore and see if there is any improvement?
Sucky part about doing that is that you are going to have to do 17,999 SHRINKFILE with the EMPTYFILE option and ALTER DATABASE to remove the extra mdf's which may not be easy to code up. I'd also look at pre-growing the primary mdf as you go. You won't be able to pre-grow it to the full 9 TB at once (as you don't have disk space for that), but maybe grow it by 1 TB, then kill off however many secondary mdf's as you can until you free up 1 TB then pre-grow the primary again and repeat until you are at 1 mdf.

For replicated code that has an easy counter variable, it is likely easiest to use dynamic SQL to generate the code. Otherwise you are going to have (17,999 * 3) + 1 lines of code to type up.

Since everything is all on the same disk, there isn't as much benefit having multiple data files. I believe if they are SSD's you get a little bit of a performance increase having more data files, but it is a trial and error thing. And I highly doubt that 18,000 data files would give you any performance increase. Especially since the max threads would be limited by the number of CPU's minus ones that are in use by other processes.

I suppose if you were running this on a FAT32 partition, you'd need a LOT of data files as I believe that has a 4 GB per file limitation (with tricks to get around it), but if you are running it on NFTS (which you should be if not exFAT), I can't think of any benefit of having 18,000 data files. Plus if you are running it on FAT32, you are going to have a whole slew of problems (slowness on the DB due to the page/extent size vs the block size of the disk for example). Although I think FAT32 has a max size of 2 TB, so pretty sure you are running it on NTFS or exFAT or some unsupported version of FAT32.

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Sue_H SSC Guru Points: 90860 More actions · Answer 4

sgt500 - Monday, March 13, 2017 9:52 AM
yes, dfs and backup file are on the same disk, like said before, this is a test server, but restore/recovery phase ends in about 6-7h, I dnot understand what he is doing in the other 7 hours.
Maybe the number of df cause this behavior

Interestingly enough, there is one reference I found to PostContainerStoreFixups. It states that this step resolves the post-restore configuration of sysaltfiles/sysdbfiles:
When doing a database restore, what actions is the engine taking after it gets past the 100%?

Sue

sgt500 Default port Points: 1413 More actions · Answer 5

Sue_H - Monday, March 13, 2017 11:43 AM
sgt500 - Monday, March 13, 2017 9:52 AM
yes, dfs and backup file are on the same disk, like said before, this is a test server, but restore/recovery phase ends in about 6-7h, I dnot understand what he is doing in the other 7 hours.
Maybe the number of df cause this behavior
Interestingly enough, there is one reference I found to PostContainerStoreFixups. It states that this step resolves the post-restore configuration of sysaltfiles/sysdbfiles:
When doing a database restore, what actions is the engine taking after it gets past the 100%?
Sue

Nice!
This confirms the hypothesis the number of datafiles affects the timing of this step. I may have further confirmation restoring anothe dbr of about 8TB , but with fewer datafiles (5K)

This may also explain the lock that I get in this phase by running "select * from sys.sysdatabases" (see 1th post)
I update you in the next few days on the second db result
ty all

sgt500 Default port Points: 1413 More actions · Answer 6

select from sysaltfile blocked too 🙂
session_id,wait_type,sql_text,blocked_by
52 LCK_M_S select * from sys.sysaltfil 55
55 NULL RESTORE DATABASE [dbxxxx] FROM DISK... 0

Mr. Brian Gale SSC-Insane Points: 24800 More actions · Answer 7

sgt500 - Tuesday, March 14, 2017 9:34 AM
select from sysaltfile blocked too 🙂
session_id,wait_type,sql_text,blocked_by
52 LCK_M_S select * from sys.sysaltfil 55
55 NULL RESTORE DATABASE [dbxxxx] FROM DISK... 0

For getting that data, try adding in a WITH (NOLOCK) option on the table. That should get you data from the tables that are locked.
That is:
SELECT * FROM sys.sysaltfil WITH(NOLOCK)

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Sue_H SSC Guru Points: 90860 More actions · Answer 8

sgt500 - Tuesday, March 14, 2017 9:34 AM
select from sysaltfile blocked too 🙂
session_id,wait_type,sql_text,blocked_by
52 LCK_M_S select * from sys.sysaltfil 55
55 NULL RESTORE DATABASE [dbxxxx] FROM DISK... 0

Yup...so that does seem to explain the time spent in PostContainerStoreFixups. I'd leave it as it is for now and keep checking the locks on that table. I bet the locks are there for the duration. Did you do the restore with the verbose trace flag again?

Sue

Perry Whittle SSC Guru Points: 234013 More actions · Answer 9

sgt500 - Wednesday, March 8, 2017 7:57 AM
MOVE ....
.... (18K+)

Just so i know i'm not misreading, could you please clarify what is meent by the 18K+

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Mr. Brian Gale SSC-Insane Points: 24800 More actions · Answer 10

Perry Whittle - Tuesday, March 14, 2017 10:36 AM
sgt500 - Wednesday, March 8, 2017 7:57 AM
MOVE ....
.... (18K+)
Just so i know i'm not misreading, could you please clarify what is meent by the 18K+

I asked that previously. He means repeat for all 18,000 data files! I can't imagine why a system that would need that many data files... his "smaller" system has 5,000 data files... I think our largest is 2.

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Perry Whittle SSC Guru Points: 234013 More actions · Answer 11

bmg002 - Tuesday, March 14, 2017 10:53 AM
Perry Whittle - Tuesday, March 14, 2017 10:36 AM
sgt500 - Wednesday, March 8, 2017 7:57 AM
MOVE ....
.... (18K+)
Just so i know i'm not misreading, could you please clarify what is meent by the 18K+
I asked that previously. He means repeat for all 18,000 data files! I can't imagine why a system that would need that many data files... his "smaller" system has 5,000 data files... I think our largest is 2.

That's what i thought :w00t:
Dare I ask how many filegroups? :Whistling:

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

sgt500 Default port Points: 1413 More actions · Answer 12

Perry Whittle - Tuesday, March 14, 2017 11:13 AM
bmg002 - Tuesday, March 14, 2017 10:53 AM
Perry Whittle - Tuesday, March 14, 2017 10:36 AM
sgt500 - Wednesday, March 8, 2017 7:57 AM
MOVE ....
.... (18K+)
Just so i know i'm not misreading, could you please clarify what is meent by the 18K+
I asked that previously. He means repeat for all 18,000 data files! I can't imagine why a system that would need that many data files... his "smaller" system has 5,000 data files... I think our largest is 2.
That's what i thought :w00t:
Dare I ask how many filegroups? :Whistling:

🙂
is the app works by design. It is multitenant. and each tenant creates x FileGroups.
I inherited this situation, there are no real reasons. I asked to change this behavior, but the old continuranno tenants to have all these fg

Perry Whittle SSC Guru Points: 234013 More actions · Answer 13

sgt500 - Wednesday, March 15, 2017 2:20 AM
Perry Whittle - Tuesday, March 14, 2017 11:13 AM
bmg002 - Tuesday, March 14, 2017 10:53 AM
Perry Whittle - Tuesday, March 14, 2017 10:36 AM
sgt500 - Wednesday, March 8, 2017 7:57 AM
MOVE ....
.... (18K+)
Just so i know i'm not misreading, could you please clarify what is meent by the 18K+
I asked that previously. He means repeat for all 18,000 data files! I can't imagine why a system that would need that many data files... his "smaller" system has 5,000 data files... I think our largest is 2.
That's what i thought :w00t:
Dare I ask how many filegroups? :Whistling:
🙂
is the app works by design. It is multitenant. and each tenant creates x FileGroups.
I inherited this situation, there are no real reasons. I asked to change this behavior, but the old continuranno tenants to have all these fg

Expect any restore to take considerable amount of time, such as the timeframe you are seeing.
Are you taking filegroup backups?

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

sgt500 Default port Points: 1413 More actions · Answer 14

Perry Whittle - Wednesday, March 15, 2017 4:40 AM
sgt500 - Wednesday, March 15, 2017 2:20 AM
Perry Whittle - Tuesday, March 14, 2017 11:13 AM
bmg002 - Tuesday, March 14, 2017 10:53 AM
Perry Whittle - Tuesday, March 14, 2017 10:36 AM
sgt500 - Wednesday, March 8, 2017 7:57 AM
MOVE ....
.... (18K+)
Just so i know i'm not misreading, could you please clarify what is meent by the 18K+
I asked that previously. He means repeat for all 18,000 data files! I can't imagine why a system that would need that many data files... his "smaller" system has 5,000 data files... I think our largest is 2.
That's what i thought :w00t:
Dare I ask how many filegroups? :Whistling:
🙂
is the app works by design. It is multitenant. and each tenant creates x FileGroups.
I inherited this situation, there are no real reasons. I asked to change this behavior, but the old continuranno tenants to have all these fg
Expect any restore to take considerable amount of time, such as the timeframe you are seeing.
Are you taking filegroup backups?

you're right, these tests have been useful to have proof of this fact.
Now the backup policy is a weekly full and daily diff, before was scheduled only a backup full every two days

sgt500 Default port Points: 1413 More actions · Answer 15

hi,
I would like to give you an update about the restore tests. I restored the smallest one,7TB with 4600 files.
It took a lot less time. the high number of files definitely impacts on the timing.

2017-03-28 09:45:45.210    spid58    RestoreDatabase: Finished
2017-03-28 09:45:45.180    Backup    RESTORE DATABASE successfully processed 62086684 pages in 3003.526 seconds (161.494 MB/sec).
2017-03-28 09:45:45.180    spid58    Restore: Done with MSDB maintenance
2017-03-28 09:44:14.340    spid58    Writing backup history records
2017-03-28 09:44:07.600    Backup    Database changes were restored. Database: db_xxxxxx2, creation date(time): 2011/12/19(12:47:05), first LSN: 11396148:9851828:16, last LSN: 11396148:10901616:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'X:\xxxxxxx.diff'}). This is an informational message. No user action is required.
2017-03-28 09:44:07.600    spid58    Restore: Writing history records
2017-03-28 09:44:07.350    spid58    Resuming any halted fulltext crawls
2017-03-28 09:44:07.090    Backup    Restore is complete on database 'db_xxxxxx2'. The database is now available.
2017-03-28 09:44:07.090    spid58    Restore: Database is restarted
2017-03-28 09:44:06.850    spid58    PostRestoreReplicationFixup for db_xxxxxx2 complete
2017-03-28 09:44:04.660    spid58    PostRestoreReplicationFixup for db_xxxxxx2 starts
2017-03-28 09:44:04.630    spid58    CHECKDB for database 'db_xxxxxx2' finished without errors on 2017-02-18 08:04:33.693 (local time). This is an informational message only; no user action is required.
2017-03-28 09:44:04.200    spid58    PostRestoreContainerFixups: fixups complete
2017-03-28 09:43:50.980    spid58    Recovery completed for database db_xxxxxx2 (database ID 5) in 7 second(s) (analysis 429 ms, redo 1812 ms, undo 4387 ms.) This is an informational message only. No user action is required.
2017-03-28 09:43:50.820    spid58    Recovery is writing a checkpoint in database 'db_xxxxxx2' (5). This is an informational message only. No user action is required.
2017-03-28 09:43:45.490    spid58    Zeroing completed on X:\Data\db_xxxxxx2\db_xxxxxx2_log.ldf
2017-03-28 09:43:45.480    spid58    Zeroing X:\Data\db_xxxxxx2\db_xxxxxx2_log.ldf from page 60514567 to 60515047 (0x736c20e000 to 0x736c5ce000)
2017-03-28 09:43:22.950    spid58    Starting up database 'db_xxxxxx2'.
2017-03-28 09:43:19.540    spid58    Restore: Restarting database for ONLINE
2017-03-28 09:43:19.480    spid58    Restore: Transitioning database to ONLINE
2017-03-28 09:43:19.430    spid58    Restore: Done with fixups
2017-03-28 09:43:18.950    spid58    Rollforward complete on database db_xxxxxx2
2017-03-28 09:43:08.630    spid58    OfflineRollforward: StopLsn/LastLsn(PruId: 0): 0xade434:0xa65870:0x1
2017-03-28 09:42:46.480    spid58    RunOfflineRedo logIter.Init(): LastLsn(PruId: 0): 0xade434:0xa65870:0x1
2017-03-28 09:42:46.480    spid58    RunOfflineRedo logIter.Init(): FirstLsn(PruId: 0): 0xade434:0x9653b4:0x10
2017-03-28 09:42:46.380    spid58    Restore VLF headers completed on database db_xxxxxx2
2017-03-28 09:42:45.020    spid58    Restore VLF headers for 209 headers on database db_xxxxxx2
2017-03-28 09:42:42.240    spid58    Restore-Redo begins on database db_xxxxxx2
2017-03-28 09:42:42.040    spid58    The database 'db_xxxxxx2' is marked RESTORING and is in a state that does not allow recovery to be run.
2017-03-28 09:42:28.130    spid58    Starting up database 'db_xxxxxx2'.
2017-03-28 09:42:25.050    spid58    Restore: Backup set restored
2017-03-28 09:42:24.130    spid58    Restore: Data transfer complete on db_xxxxxx2
2017-03-28 09:42:24.130    spid58    FileHandleCache: 0 files opened. CacheSize: 140
2017-03-28 09:41:45.800    spid58    SetTargetRestoreAge: 0
2017-03-28 09:41:41.050    spid58    Restore: LogZero complete
2017-03-28 09:41:41.040    spid58    Zeroing completed on X:\Data\db_xxxxxx2\db_xxxxxx2_log.ldf
2017-03-28 09:12:06.050    spid58    Restore: Waiting for log zero on db_xxxxxx2
2017-03-28 09:12:06.050    spid58    Completed MSDA in stream 0, thread 6436
2017-03-28 08:52:20.310    spid58    Starting MSDA in stream 0, thread 6436, length 0x764bc00000
2017-03-28 08:52:20.310    spid58    Restore: Transferring data to db_xxxxxx2
2017-03-28 08:52:20.310    spid58    Restore: Restoring backup set
2017-03-28 08:52:20.010    spid58    Restore: Containers are ready
2017-03-28 08:52:19.990    spid58    Zeroing X:\Data\db_xxxxxx2\db_xxxxxx2_log.ldf from page 1 to 91290496 (0x2000 to 0xae1f700000)
2017-03-28 08:51:29.000    spid58    Restore: PreparingContainers
2017-03-28 08:51:24.910    spid58    Restore: BeginRestore (offline) on db_xxxxxx2
2017-03-28 08:51:24.910    spid58    Restore: Planning complete
2017-03-28 08:51:24.910    spid58    Dismounting FullText catalogs
2017-03-28 08:51:12.230    spid58    Restore: Planning begins
2017-03-28 08:51:12.230    spid58    Restore: Backup set is open
2017-03-28 08:51:12.230    spid58    Restore: Configuration section loaded
2017-03-28 08:51:11.850    spid58    SetTargetRestoreAge: 0
2017-03-28 08:51:11.510    spid58    Opening backup set
2017-03-28 08:51:11.460    spid58    X-locking database: db_xxxxxx2
2017-03-28 08:51:11.310    spid58    RestoreDatabase: Database db_xxxxxx2
2017-03-28 08:51:11.310    spid58    DBCC TRACEON 3605, server process ID (SPID) 58. This is an informational message only; no user action is required.
2017-03-28 08:51:11.310    spid58    DBCC TRACEON 3604, server process ID (SPID) 58. This is an informational message only; no user action is required.
2017-03-28 08:51:11.310    spid58    DBCC TRACEON 3014, server process ID (SPID) 58. This is an informational message only; no user action is required.
2017-03-28 08:51:11.310    spid58    DBCC TRACEON 3004, server process ID (SPID) 58. This is an informational message only; no user action is required.
(this is differential restore log only)

this confirms what previously said about number of files,
ty all
bye bye