September 28, 2016 at 11:04 pm
Comments posted to this topic are about the item The Quiet Zone
September 29, 2016 at 2:09 am
I remember working in one place where the fire alarm going off would cause server room doors to lock - reasonable, in a way, as you don't want people wandering into a blazing inferno. There was a key you were supposed to turn before going into the server rooms, and turn again when leaving, to ensure that the halon systems didn't discharge when there were people in the room.
I was told not to worry if the alarm went off while I was in the 3rd floor server / network room - as the fire suppressant would dump so fast, the internal partition wall would blow out... as would the windows. Not a comforting thought. Fortunately, my stuff was mostly housed in a different room, which had other problems...
Thomas Rushton
blog: https://thelonedba.wordpress.com
September 29, 2016 at 2:42 am
ThomasRushton (9/29/2016)
I was told not to worry if the alarm went off while I was in the 3rd floor server / network room - as the fire suppressant would dump so fast, the internal partition wall would blow out... as would the windows.
I can't stop laughing at the thought that somebody thought this was a reassuring thing to say :-D.
September 29, 2016 at 4:44 am
130 db? I am an audio tech on the side and I have trouble imagining that level of sound. As your link says, think jet engines at close range.
I have a vision etched in my memory of, back in the 1970s, watching through a window while the Halon system in our campus computer center dumped (false trigger), with one poor guy still in the machine room pressing the (non-functional) override switch with all his might. The system was wired incorrectly, it turns out.
We were down for quite a while. No humans were injured; just shaken up a bit. The machine room was a mess. Every bit of dust was redeposited somewhere else. It was loud, but not 130db loud!
(Edit: that was the 70's, not the 80's!)
September 29, 2016 at 7:05 am
Megan Brooks (9/29/2016)
130 db? I am an audio tech on the side and I have trouble imagining that level of sound. As your link says, think jet engines at close range.
I recall hiking out in a desert area that had this one big hill. I was near the top when a jet from our local AF base zoomed by. The sound was so loud I thought I was about to be hit.
September 29, 2016 at 11:00 am
Interesting comment about how SSDs might not be susceptible to this.
SSDs have a limited number of writes. Hard drive life is measured differently, MTBF. I would be interested in hearing how others have found SSDs to perform as far as reliability, either in this thread or another one. I think enough organizations have been using them for sufficient time to make some assumptions about reliability.
I know there are cost issues, and that unusually loud sounds or vibration are not common, and there are a lot of other factors to consider, but long term I do see most, if not all, storage going that route.
Dave
September 29, 2016 at 11:16 am
djackson 22568 (9/29/2016)
Interesting comment about how SSDs might not be susceptible to this.SSDs have a limited number of writes. Hard drive life is measured differently, MTBF. I would be interested in hearing how others have found SSDs to perform as far as reliability, either in this thread or another one. I think enough organizations have been using them for sufficient time to make some assumptions about reliability.
I know there are cost issues, and that unusually loud sounds or vibration are not common, and there are a lot of other factors to consider, but long term I do see most, if not all, storage going that route.
Not sure SSDs are susceptible to this. I was trying to point out they likely will be susceptible to something.
Reliability is dramatically improved. I use almost all SSDs in 2 laptops and a desktop as well as externals. Have found them to be very reliable, lasting years. However, when they go, they usually just go without much warning.
People initially had issues in servers, as early SSDs wrote to the same spot. In the last 4-5 years, most use write leveling and have MTBFs in the years range now. I have no issue using them.
3D SSDs, newer tech, will change our world. Very low tatencies and high capacities. No idea on reliability, but coming with good warranties.
http://promotions.newegg.com/Samsung/14-4393/samsung-highend.html
September 29, 2016 at 11:27 am
Steve Jones - SSC Editor (9/29/2016)
djackson 22568 (9/29/2016)
Interesting comment about how SSDs might not be susceptible to this.SSDs have a limited number of writes. Hard drive life is measured differently, MTBF. I would be interested in hearing how others have found SSDs to perform as far as reliability, either in this thread or another one. I think enough organizations have been using them for sufficient time to make some assumptions about reliability.
I know there are cost issues, and that unusually loud sounds or vibration are not common, and there are a lot of other factors to consider, but long term I do see most, if not all, storage going that route.
Not sure SSDs are susceptible to this. I was trying to point out they likely will be susceptible to something.
Reliability is dramatically improved. I use almost all SSDs in 2 laptops and a desktop as well as externals. Have found them to be very reliable, lasting years. However, when they go, they usually just go without much warning.
People initially had issues in servers, as early SSDs wrote to the same spot. In the last 4-5 years, most use write leveling and have MTBFs in the years range now. I have no issue using them.
3D SSDs, newer tech, will change our world. Very low tatencies and high capacities. No idea on reliability, but coming with good warranties.
http://promotions.newegg.com/Samsung/14-4393/samsung-highend.html
Thanks, Steve. I didn't communicate properly - I didn't mean to sound like I thought SSDs are susceptible to this. I agree they are likely at risk for other things. To me the big one is the way they are written to, and how they fail.
I have had a couple fail in laptops, but manufacturers aren't buying the best quality SSDs for laptops and desktops. When I replaced them I went with a brand I trust and had no issues. In fact I had my current laptop purchased with a hard drive, and then bought a replacement before it was built. The only issue I have is with the crap our Windows team pushed out to me. Every patch Tuesday... but that isn't the drive.
Our servers are where I am concerned. I am not a SAN expert. From what I know about RAID 5, if I had a server with three disks, I think they would be likely to all fail within hours of each other, which scares me just a wee bit. I am hoping that as we all see failures, we see them happen more randomly. Failures that are predictably at the same time, but at an unknown time, are far worse than random unpredictable times. I have never seen a RAID array have all drives fail at once. In my own experience, never more than one at a time. Mathematically speaking, an array of SSDs should fail together, assuming I understand them correctly.
Dave
September 29, 2016 at 12:05 pm
I remember interviewing at a large chemical plant and noticing that the drives for their brand new Hp-3000 twins were down as I was given a tour of the data center. It was news to the manager giving me the tour but I may have scored a few points as I got the job. Months later the company was taking down an obsolete building less than a quarter mile away. We would watch from our office windows as the wrecking ball knocked down walls. We heard the sound after we saw the contact. And yes, our platters chattered from the vibration. It was a rough 2 weeks!
September 29, 2016 at 12:30 pm
David,
No idea if SSD MTBF is linked to age or batch. That's interesting. I have seen the same thing with HDDs in servers, and in fact, in large enterprises, we've always had 2-3 vendors for HDDs and made sure to mix RAID items from different batches/brands to limit multiple failures.
SSDs used to be bad in servers, but they've gotten better. Lots of new SAN and server systems go with flash, and I hear no issues from that, so I'm good with using them.
I haven't had issues in laptops, but I'm lucky that I tend to get more expensive ones, so maybe they're using better quality? I've had a few externals fail, but I've learned to carry at least two.
September 29, 2016 at 12:57 pm
The thing to remember about the MTBF numbers is, it's a Mean, not an absolute. So you could have (using Daves' RAID-5 example) 3 SSDs in a RAID-5 array.
Drive A might last 9 years until the server is put in the trash
Drive B might fail after 3 months
Drive C might last 3 years before failing
I know some time back there were people concerned about how drive (HDD) capacity was getting to the point where even with a RAID-6 array and hot spares, the possibility of having a drive fail, then during the rebuild process another drive failing causing the array to no longer have enough drives to be able to be rebuilt. The concern was that in the time it took to rebuild one drive, another would fail (and as the drives were often from the same batch, in theory they might all fail around the same time)
I'm sure it's out there somewhere, it would be interesting to see a comparison of rebuild times of two identical drive arrays (same controller, same capacity) with one using HDDs and one SSDs.
As for the silence of a server room, I've had that happen at a previous job when we lost power. The UPS units were barely able to hang on long enough for me to go the couple hundred feet from my desk to the server room to start shutting things down (yes, I didn't have a UPS on my desktop, neither did anyone else, cheap boss) Even more fun was doing the shutdowns in the dark, no emergency lights in the server room (which was part of a waiting area, with a glass wall between the racks and the waiting room seats.)
September 29, 2016 at 1:25 pm
jasona.work (9/29/2016)
The thing to remember about the MTBF numbers is, it's a Mean, not an absolute. So you could have (using Daves' RAID-5 example) 3 SSDs in a RAID-5 array.Drive A might last 9 years until the server is put in the trash
Drive B might fail after 3 months
Drive C might last 3 years before failing
I know some time back there were people concerned about how drive (HDD) capacity was getting to the point where even with a RAID-6 array and hot spares, the possibility of having a drive fail, then during the rebuild process another drive failing causing the array to no longer have enough drives to be able to be rebuilt. The concern was that in the time it took to rebuild one drive, another would fail (and as the drives were often from the same batch, in theory they might all fail around the same time)
I'm sure it's out there somewhere, it would be interesting to see a comparison of rebuild times of two identical drive arrays (same controller, same capacity) with one using HDDs and one SSDs.
As for the silence of a server room, I've had that happen at a previous job when we lost power. The UPS units were barely able to hang on long enough for me to go the couple hundred feet from my desk to the server room to start shutting things down (yes, I didn't have a UPS on my desktop, neither did anyone else, cheap boss) Even more fun was doing the shutdowns in the dark, no emergency lights in the server room (which was part of a waiting area, with a glass wall between the racks and the waiting room seats.)
What I am trying to say is this: Spinning hard drives (IMO) tend to fail at random times. So if I purchase 20 drives and install them all at once, in the same array of disks, my expectation is that each drive would fail at a different time. Now Steve pointed out that it is possible there is a correlation between batches. Maybe a batch of drives would all fail at the same time. However, when we initially build a server the drives might be from a single batch, but as we replace them they probably are not. Since I have never heard of an entire RAID array failing at once, this has never concerned me. Maybe it should.
Now when it comes to SSDs there is a particular number of writes per cell. When SSDs first came out they were designed to be written to sequentially. So if there were 100 cells, it would start at cell 1, move forward to cell 100, and then start over at the first available cell (deleted data). If you think about how RAID works, then I think it is correct to assume that most, if not all, of the drives in the array would write to the same cell each time. EG as each drive gets to cell 100, the others would as well. Deleted data is likely to be present on the same cells. I am ignoring parity - and since I am not a RAID expert, I might be wrong about how this works. If my assumptions are correct (big if!) then I am forced to conclude that there is a higher chance of multiple failures at once.
Hence my question - we have had what I believe to be sufficient time for servers to have had this occur. Since I am not seeing any news articles about it, and since I haven't seen anyone respond to my question saying it happened to them, I am starting to believe that there is something I am unaware of. Maybe manufacturers changed their write algorithms, maybe it works differently than I believe, maybe there are additional factors I don't know about.
So going back to what you posted - I believe your example is valid for spinning disks. I am under the impression that SSDs are likely to fail in much closer proximity to each other. I would love to hear that I am wrong. Backups are great, but nobody wants to rely on those. We all prefer to not have failures, especially something as bad as this.
Dave
September 29, 2016 at 1:48 pm
Still find it shocking how many companies don't do a proper DR test. Criminal.
Gaz
-- Stop your grinnin' and drop your linen...they're everywhere!!!
September 29, 2016 at 1:51 pm
djackson 22568 (9/29/2016)
jasona.work (9/29/2016)
The thing to remember about the MTBF numbers is, it's a Mean, not an absolute. So you could have (using Daves' RAID-5 example) 3 SSDs in a RAID-5 array.Drive A might last 9 years until the server is put in the trash
Drive B might fail after 3 months
Drive C might last 3 years before failing
I know some time back there were people concerned about how drive (HDD) capacity was getting to the point where even with a RAID-6 array and hot spares, the possibility of having a drive fail, then during the rebuild process another drive failing causing the array to no longer have enough drives to be able to be rebuilt. The concern was that in the time it took to rebuild one drive, another would fail (and as the drives were often from the same batch, in theory they might all fail around the same time)
I'm sure it's out there somewhere, it would be interesting to see a comparison of rebuild times of two identical drive arrays (same controller, same capacity) with one using HDDs and one SSDs.
As for the silence of a server room, I've had that happen at a previous job when we lost power. The UPS units were barely able to hang on long enough for me to go the couple hundred feet from my desk to the server room to start shutting things down (yes, I didn't have a UPS on my desktop, neither did anyone else, cheap boss) Even more fun was doing the shutdowns in the dark, no emergency lights in the server room (which was part of a waiting area, with a glass wall between the racks and the waiting room seats.)
What I am trying to say is this: Spinning hard drives (IMO) tend to fail at random times. So if I purchase 20 drives and install them all at once, in the same array of disks, my expectation is that each drive would fail at a different time. Now Steve pointed out that it is possible there is a correlation between batches. Maybe a batch of drives would all fail at the same time. However, when we initially build a server the drives might be from a single batch, but as we replace them they probably are not. Since I have never heard of an entire RAID array failing at once, this has never concerned me. Maybe it should.
Now when it comes to SSDs there is a particular number of writes per cell. When SSDs first came out they were designed to be written to sequentially. So if there were 100 cells, it would start at cell 1, move forward to cell 100, and then start over at the first available cell (deleted data). If you think about how RAID works, then I think it is correct to assume that most, if not all, of the drives in the array would write to the same cell each time. EG as each drive gets to cell 100, the others would as well. Deleted data is likely to be present on the same cells. I am ignoring parity - and since I am not a RAID expert, I might be wrong about how this works. If my assumptions are correct (big if!) then I am forced to conclude that there is a higher chance of multiple failures at once.
Hence my question - we have had what I believe to be sufficient time for servers to have had this occur. Since I am not seeing any news articles about it, and since I haven't seen anyone respond to my question saying it happened to them, I am starting to believe that there is something I am unaware of. Maybe manufacturers changed their write algorithms, maybe it works differently than I believe, maybe there are additional factors I don't know about.
So going back to what you posted - I believe your example is valid for spinning disks. I am under the impression that SSDs are likely to fail in much closer proximity to each other. I would love to hear that I am wrong. Backups are great, but nobody wants to rely on those. We all prefer to not have failures, especially something as bad as this.
Now I see where you were going.
I don't think the drives would all write to the same cells on the same drives at the same time. But, like you, I'm not a RAID expert. If what I remember from when I did work on servers correctly, if you've got a small file (say, something that fits in one allocation unit) than that file only exists on one of the drives (while the parity information exists on another.) So now your drives aren't "aligned" such as what you're picturing, and thus are less likely to fail together in a short span of time.
Which, of course, isn't to say that they *can't* all fail in a short span of time.
September 29, 2016 at 2:10 pm
jasona.work (9/29/2016)
djackson 22568 (9/29/2016)
jasona.work (9/29/2016)
The thing to remember about the MTBF numbers is, it's a Mean, not an absolute. So you could have (using Daves' RAID-5 example) 3 SSDs in a RAID-5 array.Drive A might last 9 years until the server is put in the trash
Drive B might fail after 3 months
Drive C might last 3 years before failing
I know some time back there were people concerned about how drive (HDD) capacity was getting to the point where even with a RAID-6 array and hot spares, the possibility of having a drive fail, then during the rebuild process another drive failing causing the array to no longer have enough drives to be able to be rebuilt. The concern was that in the time it took to rebuild one drive, another would fail (and as the drives were often from the same batch, in theory they might all fail around the same time)
I'm sure it's out there somewhere, it would be interesting to see a comparison of rebuild times of two identical drive arrays (same controller, same capacity) with one using HDDs and one SSDs.
As for the silence of a server room, I've had that happen at a previous job when we lost power. The UPS units were barely able to hang on long enough for me to go the couple hundred feet from my desk to the server room to start shutting things down (yes, I didn't have a UPS on my desktop, neither did anyone else, cheap boss) Even more fun was doing the shutdowns in the dark, no emergency lights in the server room (which was part of a waiting area, with a glass wall between the racks and the waiting room seats.)
What I am trying to say is this: Spinning hard drives (IMO) tend to fail at random times. So if I purchase 20 drives and install them all at once, in the same array of disks, my expectation is that each drive would fail at a different time. Now Steve pointed out that it is possible there is a correlation between batches. Maybe a batch of drives would all fail at the same time. However, when we initially build a server the drives might be from a single batch, but as we replace them they probably are not. Since I have never heard of an entire RAID array failing at once, this has never concerned me. Maybe it should.
Now when it comes to SSDs there is a particular number of writes per cell. When SSDs first came out they were designed to be written to sequentially. So if there were 100 cells, it would start at cell 1, move forward to cell 100, and then start over at the first available cell (deleted data). If you think about how RAID works, then I think it is correct to assume that most, if not all, of the drives in the array would write to the same cell each time. EG as each drive gets to cell 100, the others would as well. Deleted data is likely to be present on the same cells. I am ignoring parity - and since I am not a RAID expert, I might be wrong about how this works. If my assumptions are correct (big if!) then I am forced to conclude that there is a higher chance of multiple failures at once.
Hence my question - we have had what I believe to be sufficient time for servers to have had this occur. Since I am not seeing any news articles about it, and since I haven't seen anyone respond to my question saying it happened to them, I am starting to believe that there is something I am unaware of. Maybe manufacturers changed their write algorithms, maybe it works differently than I believe, maybe there are additional factors I don't know about.
So going back to what you posted - I believe your example is valid for spinning disks. I am under the impression that SSDs are likely to fail in much closer proximity to each other. I would love to hear that I am wrong. Backups are great, but nobody wants to rely on those. We all prefer to not have failures, especially something as bad as this.
Now I see where you were going.
I don't think the drives would all write to the same cells on the same drives at the same time. But, like you, I'm not a RAID expert. If what I remember from when I did work on servers correctly, if you've got a small file (say, something that fits in one allocation unit) than that file only exists on one of the drives (while the parity information exists on another.) So now your drives aren't "aligned" such as what you're picturing, and thus are less likely to fail together in a short span of time.
Which, of course, isn't to say that they *can't* all fail in a short span of time.
It depends on which RAID level. RAID 1 is mirroring, RAID 5 is striping. So for 1, everything is duplicated. For 5, you have N number of disks, and I believe the information is striped evenly across N-1 disks. The last disk is, I believe, for parity. I am not sure if parity is always on one drive, I think it changes with each write or a similar algorithm. So therefore N-1 disks would have an equal portion of the data written to them, but not sure on parity...
I could of course read up on it again, but the point is that I am looking for real world experiences rather than theory. I don't know that reading a RAID article is going to give me that.
Dave
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply