May 4, 2012 at 5:24 pm
I have a 3TB database. Understand that SQL can back it up, the problem is not only to back it up but more importantly to restore it. We do weekly full, daily diff, and log every 15min. Full is the challenge here. For full, we break the entire backup into 15 different files spreading them over to 4 separate local drives on RAID 0 64k block size. Still, it will take 48 hours +/-. Trying 3rd party tool, but much data in this database is BLOB and can't seem to get much compression rate out of it, and the worse part, the tool froze up, so we are back to native SQL backup.
How do you backup yours? Thanks.
May 5, 2012 at 3:45 am
How many file groups do you use for the database?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
May 5, 2012 at 8:10 am
I have a 1.8TB database that is currently using 1.6TB of that space. The backups take about 2 hours to complete using Litespeed and I restore that backup to a reporting system every night which takes about 1 hour to complete.
I have the database spread across 2 LUNs right now and will spread it out further soon. I backup to a single LUN that I can split and present to the reporting server for restore.
What kind of SAN do you have?
Do you have more than 1 HBA and how are they configured?
What is the connection speed of the HBA's?
To improve performance, you need a faster SAN and IO subsystem. You will probably also need multiple HBA's so you can dedicate IO channels for backup and each LUN.
You can get improved performance by breaking the data file out across multiple LUNs and the backup across multiple LUNs - but, if you don't have the additional IO channels to support that configuration it won't actually help much because you will saturate the IO subsystem and cause performance issues for both the backup process and normal user access.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
May 5, 2012 at 9:36 am
Thanks for the replies. I have only two file groups, primary and secondary and about 20 database files on SAN. The 4 local backup storage I have are MSA drives.
When our server admin checked disk queue length of the MSA drives, they don't see any overload.
I noticed the backup started out on good speed and it gets slower and slower, and became crawling towards the end. Files are too big to write to?
Other than IO, are there things in database that could slow down data being sent to the drives?
Something is not functioning right, as 3TB shouldn't take days to backup...even if it is native backup.
May 7, 2012 at 2:15 am
@ OP,
If possible try to divide the database into multiple. it will help you in future as sudden/urgent restore might put you in danger :-).
-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)
May 8, 2012 at 12:11 am
Hmm... perhaps you should engage your SAN vendor in regards to their best practices in regards to snapshots, etc. Databases are well understood by the vast majority of Enterprise storage vendors - you might be very surprised by the level of expertise/understanding that your SAN vendor can bring to bear...
Joe
May 8, 2012 at 6:36 am
you cannot manage TB-sized databases on crappy IO subsystems. Do a file IO stall analysis on all of the components hit during your backup and I am certain you will find the cause of your slowness.
Oh, one more thing - have you checked your OS file system for fragmentation? If you are like many of the clients I have come across you just might have left the default 1MB growth increment and have hundreds of thousands or even millions of tiny fragments all over the disks. 🙂
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
May 8, 2012 at 4:01 pm
Jeffrey, Joe and Kevin hit all of the relevant points.
We have a 1.5 TB database filled with 1.2 TB of data. Using SQL LiteSpeed, a good SAN, multiple HBAs and vendor best practices (SAN, Windows, HBA and SQL Server) we create a full database backup in 20-25 minutes !
Oh, we have not even gone to multiple file groups nor multiple files within filegroup yet - just one .mdf, one .ldf and one .bak file !
RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."
May 8, 2012 at 8:17 pm
I think Jeffery is correct in saying you need more I/O
Remember, to do a full backup every page on the disk must be read into memory and then from memory, every page must be written out.
The question you should be asking, how fast can I read and on what channel/bus, and how fast can I write and on what channel/bus? and how long does the DB stay in memory as the buffer (how much memory is in the server? I used to backup a 6 TB DB in a few hours but I had for writing 4 dedicated tape drives on 4 dedicated scsi cards. For Reading, I had 4 HBA cards. Memory I had 96 GB.
Have you tried to just backup to the NUL: device to test out how fast you can read?
Have you tried to benchmark how fast you write? What is the HBA queue depth setting? What is the size of the cache on the SAN you are writing to? Remember, if the SAN cache is 6 GB split 50% reads/50%writes and you are reading 2 TB, the readcache isn't going to help. I am assuming you are reading and writing to the same SAN.
Maybe you want to change the SAN cache for a test?
I hate to ask but are the disks aligned? If they are not, you could be suffering exactly as you are with horrible write performance.
May 9, 2012 at 8:49 am
Tim-153783 (5/8/2012)
Have you tried to just backup to the NUL: device to test out how fast you can read?Have you tried to benchmark how fast you write?
I hate to ask but are the disks aligned? If they are not, you could be suffering exactly as you are with horrible write performance.
1) Calculate some theoretical "guess best case" numbers - if you have 3,145,728MB of incompressible data, and your storage throughput caps out at (pure made up number) 300MB/s, then a full write of your data should take about 3 hours (not including verify). Note that I'd expect to get 300MB/s of sequential writing or reading speed on a single file on a mere handful of drives, either on a 4Gbps SAN or on local storage. You've got at least one serious problem, perhaps more.
2) Calculate your current "actual case" numbers - from the OP, 3,145,728MB in 48 hours is about 18MB/s. You don't have to have two bonded/teamed active/active 100Mbps NIC's in the mix somewhere, do you :-P?
3) Check for the usual problems on every single piece of equipment involved:
3a) OS level Partition alignment ('wmic partition' or 'wmic /node:"servername or ip" partition get /all', see StartingOffset)
3b) Deeper level SAN/MSA/etc. partition issues
3c) Network throughput/misconfiguration
3d) SQL Server index fragmentation... or heaps
3e) Filesystem level fragmentation, of either files or free space
3f) Excessive VLFs (DBCC LOGINFO)
3g) Full general hardware checks - RAIDs in degraded mode?
3h) Check for shared spindles - document everything that could be reading or writing to any given spindle at one time.
3h i) 15 files to a single 4 spindle set means each file is competing with 14 others at the time.
3i) Check for drive issues - run vendor diagnostics. Any drive diagnostics that run in less than a few minutes aren't doing a surface scan - get better diagnostics (perhaps offline diags).
4) Benchmark each of your sets of storage, perhaps with SQLIO. Perhaps you can't read fast enough from the main database or from the backup, perhaps you can't write fast enough to the backup or to the restore location.
4a) Benchmark the network paths as well; copy files manually, perhaps.
5) Examine your architecture - it sounds like it's overly complicated, writing 15 files to a 4 disk RAID 0. Perhaps try writing only one file to your 4 disk RAID 0, or if you really want to keep things complex, break the RAID entirely and write 1 file per spindle. You could also try writing a single file to a RAID level that has some redundancy, like RAID 5 - RAID 0 does not always provide very good performance on enterprise RAID controllers, based on my benchmarking - this is likely due to levels of optimization in the controller and its drivers.
Example SQLIO script; make sure you're testing against the same number of files each storage device will normally see... and then test against once file per storage device:
rem 2 Outstanding IOs
sqlio -kW -s1200 -frandom -o2 -b8 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -frandom -o2 -b64 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o2 -b8 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o2 -b64 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o2 -b256 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o2 -b1024 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -frandom -o2 -b8 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -frandom -o2 -b64 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o2 -b8 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o2 -b64 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o2 -b256 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o2 -b1024 -LS -F%1 timeout /T 1400
rem 8 Outstanding IOs
sqlio -kW -s1200 -frandom -o8 -b8 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -frandom -o8 -b64 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o8 -b8 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o8 -b64 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o8 -b256 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o8 -b1024 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -frandom -o8 -b8 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -frandom -o8 -b64 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o8 -b8 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o8 -b64 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o8 -b256 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o8 -b1024 -LS -F%1 timeout /T 1400
rem 16 Outstanding IOs
sqlio -kW -s1200 -frandom -o16 -b8 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -frandom -o16 -b64 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o16 -b8 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o16 -b64 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o16 -b256 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o16 -b1024 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -frandom -o16 -b8 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -frandom -o16 -b64 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o16 -b8 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o16 -b64 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o16 -b256 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o16 -b1024 -LS -F%1 timeout /T 1400
rem 32 Outstanding IOs
sqlio -kW -s1200 -frandom -o32 -b8 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -frandom -o32 -b64 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o32 -b8 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o32 -b64 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o32 -b256 -LS -F%1 timeout /T 1400
sqlio -kW -s1200 -fsequential -o32 -b1024 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -frandom -o32 -b8 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -frandom -o32 -b64 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o32 -b8 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o32 -b64 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o32 -b256 -LS -F%1 timeout /T 1400
sqlio -kR -s1200 -fsequential -o32 -b1024 -LS -F%1 timeout /T 1400
May 12, 2012 at 8:20 pm
Thank you all for your response. I am very grateful for the amount and level of details of all responses. Especially your time. There is no where else like SSC.
Quick status, I am backing up to 4 MSA drives over the network, from an active node to a passive node of a SQL cluster. No luxury of a compression tool (LiteSpeed, HyperBac, and etc...). that takes 48 hours...I realized that backing it up over the network definitely is a major issue that contributes to the 48 hours horror. Waiting for a scheduled shutdown in a week so I can failover the instance back to the node where all 4 MSA are currently located. It used to take around 13 hours with that setup ( SQL and MSA all on one node).
reading from the response, it sounds like even 13 hours with native backup is too much for 3TB. I will try all the suggestions provided here. Let me report back on any improvement. Thanks.
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply