November 30, 2016 at 3:05 pm
RAMDisk used to exist in Windows NT. I used it and it worked really well. Then, it was eliminated in Windows Server 2003, as I recall.
You would block off a certain amount of RAM, proclaim it to be a disk drive, then access it like was a disk drive with folders and files. It worked well.
Context:
My company's main server has a bottleneck issue associated with the TempDB. Perfect storm, lots of little tables with small amounts of data, being created and dropped frequently. Already implemented T1118, to no avail. If we can eliminate the disk drive for the TempDB and create it in RAM to run at RAM speeds, it might go a long way to eliminating this server bottleneck.
Does anyone know of a reliable RAMDisk implementation, that could be invoked from a stored procedure on SQL Server startup?
November 30, 2016 at 3:59 pm
Gail Wanabee (11/30/2016)
RAMDisk used to exist in Windows NT. I used it and it worked really well. Then, it was eliminated in Windows Server 2003, as I recall.You would block off a certain amount of RAM, proclaim it to be a disk drive, then access it like was a disk drive with folders and files. It worked well.
Context:
My company's main server has a bottleneck issue associated with the TempDB. Perfect storm, lots of little tables with small amounts of data, being created and dropped frequently. Already implemented T1118, to no avail. If we can eliminate the disk drive for the TempDB and create it in RAM to run at RAM speeds, it might go a long way to eliminating this server bottleneck.
Does anyone know of a reliable RAMDisk implementation, that could be invoked from a stored procedure on SQL Server startup?
You have to understand that TempDB is already in RAM unless something get's too large and spills to disk. To prevent the spill, you'll need a lot of RAM and the number of spill overs won't be that great for a lot of little tables. It would be better to get an SSD if you want TempDB to be "in memory" for everything so that you're not stealing from main memory but I've never seen huge improvements there. Would probably be better to put the TRN files on SSD.
YMMV depending on the code and how many spools and hash joins occur.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 30, 2016 at 5:49 pm
Thanks for the response, Jeff. I already know everything you wrote and you're correct.
The problem is a little more complicated than I've indicated and I wanted to stick to basics to keep the subject simple and not bog people down in the details of a technology set that is not well documented.
Some of the details:
Our server is RAM rich right now. We're not using all the RAM we have and the 8 TempDB data files are near empty most of the time.
We use a system named Simplivity. It's a hybrid disk/SAN system that does some unusual things under the hood, especially involving disk writes, and therein lies the problem. The I/O stalls we're seeing in Spotlight and verifying in PERFMON are focused primarily on I/O to and from the TempDB log file. These are serious delays (80->170+ msecs) and much slower than RAM, so I'm pretty sure the Simplivity disk I/O system is involved in some way.
For the immediate future, we are going to be using this product, and to its credit, it does some things extremely well. But we can't get at this one problem any other way that I can see, and the manufacturer insists that the system is working as advertised. And for more money, any solution can be purchased.
That's unacceptable so I am working on ways to exhaust every possibility. If we can find a way to remove the TempDB files from the Simplivity system, we can eliminate the biggest choke point in the entire production environment.
RAM resident TempDB files may not solve any problems, but it will eliminate them as a problem associated with the Simplivity system.
November 30, 2016 at 10:21 pm
If it has a problem with disk writes, then TempDB is probably the least of your worries. Have you looked into what that system is doing to log file writes?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 1, 2016 at 4:46 am
Options/thoughts:
1) RTFM (Read the "Fine" Manual) for Simplivity and find out how to force certain files to stay resident in flash storage, assuming your system has such.
2) Deduplication has NO BUSINESS in a SQL Server file storage system. Every client I have had on that using SQL Server has had wins by moving off of it.
3) Buy local SSD and put tempdb on that. Depending on scenarios this is probably the best option honestly.
4) http://www.superspeed.com/servers/ramdisk.php. Also https://en.wikipedia.org/wiki/List_of_RAM_drive_software
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
December 1, 2016 at 10:13 am
Jeff, we've looked as deeply as we can into what is a completely black box implementation and there's not much to see. The I/O delays on TempDB files are real but the root causes are hidden from us. The manufacturer insists that everything is fine. We know better. I won't even talk to them anymore.
Guru, you're correct but replacing the Simplivity system (we inherited it) is not an option right now. Every idea you've suggested, I have pursued with the manufacturer. Without going into all the details, I'll just say that we were told that a lot more money spent on their solution would solve all our problems. Our upper management has become fully aware of the system's limitations and as the company grows, it will most likely not be in our plans.
Right now, we have to live with it and since I can't fix it or replace it, I am going to find a way to work around its limitations.
Thank you both for your responses.
December 1, 2016 at 2:31 pm
Gail Wanabee (12/1/2016)
Jeff, we've looked as deeply as we can into what is a completely black box implementation.
Ah... got it. Gotta "love" hyper-convergence appliances. Thanks for the feedback.
Considering it is an appliance and you are flush in memory, then Ram Disk might be the only way to get around this for now. I guess my question would be, do you think that something like Ram Disk even stands a chance of working as expected since it appears to have put the screws to SQL Server TempDB?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 1, 2016 at 11:42 pm
Can you explain why you cannot put an SSD in the box and drop tempdb on that?
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
December 2, 2016 at 10:40 am
You can't just put a disk in this. It's not like a "server". And, it's likely that the disks are already SSD's, although the hosts can be purchased with spinning disks
I'm not sure what or how your server is configured, but we have been evaluating hyper-converged for 4 months.
I can assure you that I have pounded the evaluation boxes provided to us from two primary hyper-converged vendors, and both have performed beyond very well.
1 TB database, 1000 users, with only 64 GB RAM allocated to the VM, and I am seeing 8-15 ms response times on all the disks in every test.
I have to say that this has to be a configuration issue, at either the VM level or the hyper-converged configuration.
How were the disks provisioned? It needs to be thick provisioned eager-zeroed.
I think you need to look into the configuration of the VM. David Klee, Michael Corey, and Jonathan Kehayias have all written some very detailed articles on configuring VMWare for SQL server.
Start with VMWare support.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
December 2, 2016 at 11:09 am
Guru:
Yes. I suggested that exact solution (TempDB on SSDs) to the system manufacturer's reps. (That solution has produced outstanding results on other servers in previous jobs.)
The reps replied that SSD can't be added to the system to be devoted to just one server or just one logical disk drive. The only solution the vendor offered was a forklift replacement of the entire system and SSD replacements for ALL disk space.
That would cost a considerable sum and I won't even propose it to my management because it's not a surgical solution, it's more of an atom bomb approach to solving what is a singular problem in the vendor's system.
December 2, 2016 at 11:22 am
We don't know, Jeff. Right now, we're down to fault isolation strategies.
I intend to completely remove the TempDB database from the Simplivity disk system by placing it in RAM, then evaluate the impact of that.
The TempDB log file is our main chokepoint on this problem but it's not the only problem with I/O stalls. Our 2 large application databases' data files are the other 2 chokepoints.
If removing TempDB from the Simplivity system solves that log file I/O stall problem, then we'll see what's left. Will it help to go back to the vendor? I doubt it, no matter what the results. Their attitude is "It's working correctly; if you want better performance, you need to upgrade." We're not buying their message nor their upgrade.
We've done an exhausting review of all the settings for the system with their tech people. I've literally spent hours, gone page by page, paragraph by paragraph, through VMWare's most recent Best Practices document, over 50 pages long. I found a few things that were set incorrectly but nothing that produced any measurable change in performance.
So short of spending a lot more money or changing vendors, what I'm doing is our best option right now.
December 2, 2016 at 11:37 am
Michael, I responded to part of your suggestions in another post, want to complete it here.
The vendor won't tell us a lot about their hyper-convergence solution; it's patented, and I don't think their support people actually know the under-the-hood details, other than it's a black box with a really smart processor for deduping.
Their focus is on write efficiency and that, they have achieved. Watching PERFMON, using a 100 second interval, I rarely see the "Average Disk Seconds/Write" metric exceed 3 msecs, even under the highest I/O loads, using primarily 7200 rpm disk drives with a small amount of SSD and some very high speed RAM for caching, thrown in.
Reads are the only problem, and primarily on 3 files (2 application database data files, and the TempDB log file).
They claim their system is configured correctly. After personally reviewing the VMWare settings, with our admin staff and their techs involved, I'm convinced those settings are optimized.
There is something else going on, something unique to their particular implementation, that is the problem.
When our company grows, and outgrows the current system, we are almost certainly going to look for a different vendor with a different solution. Right now, we're not there. So I'm doing what my management has asked me to do: squeeze every ounce of performance out of the current system until we need to replace it.
--------
Thanks to all of you for taking the time and energy to respond.
December 2, 2016 at 12:17 pm
Good info. Thanks!
One of the things we did was to reach out to current users of the product. The reviews and recommendations from them couldn't have been stronger. They praised the support, and went into way more detail about their systems, and the performance, than we expected.
So, your experience is causing some pause for us to implement this, but it also is making me think that somethings just not configured correctly.
That being said, were the boxes you put in place undersized for the load? The quotes we received, and did testing on, were not the bottom tier of hardware.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
December 2, 2016 at 3:04 pm
Michael L John (12/2/2016)
Good info. Thanks!One of the things we did was to reach out to current users of the product. The reviews and recommendations from them couldn't have been stronger. They praised the support, and went into way more detail about their systems, and the performance, than we expected.
So, your experience is causing some pause for us to implement this, but it also is making me think that somethings just not configured correctly.
That being said, were the boxes you put in place undersized for the load? The quotes we received, and did testing on, were not the bottom tier of hardware.
I would think if IO stalls are a problem then the hardware must be pretty good - otherwise IO demand from the CPUs wouldn't be very high. I see that too often in poorly virtualized environments. That does presume that there isn't a IO-path bottleneck, which is possible.
To the OP: have you actually tuned the application(s) that are running on this Simplivity system? If not that alone could be responsible for IO getting crushed. Things like mismatched datatypes, functions around columns in WHERE clauses, etc. can literally blow away almost any hardware you throw at it. I have seen such over and over in the last 20+ years of consulting on SQL Server.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
December 2, 2016 at 5:47 pm
Actually, the I/O demands from the OLTP server are not extremely high. The CPU context switches metric is relatively low; there's not a lot of thread thrashing. The SQL Server waits profile is one of the best I've seen in my entire career, almost textbook perfect.
I work closely with the Systems Admin group and we know that there isn't an I/O path bottleneck, anywhere. at least nothing we can see. The NIC cards are way underutilized and the Simplivity system in general handles the workload very well. Since we moved reporting and data analysis off the OLTP server to a reporting server, our CPU cores run at about 20%-60% utilization; they're not even close to being maxed out.
In answer to your question, I can't tune the application software, but I can recommend changes and they are happening. The current admin and development staff inherited some less than optimally implemented technologies and we're in the process of fixing, refining, and re-architecting.
I have exceptionally detailed knowledge of the load and stress points on, and generated by the application software and the queries it executes, as well as the databases' architectures and the design problems they have. I've considered all your suggestions and pursued many more points of investigation, but thanks for suggesting anything; I'm open to all suggestions.
Thanks for bringing up the I/O path bottleneck. It is one of the last remaining possibilities, as the people that set this up did some odd things involving 2 NIC cards that seem inexplicable, according to our admin group. They and I eventually will solve the mysteries about that, and it may be part of the problem, don't know yet. I was told that the last attempt at reimplementing it failed.
My situation has every problem you listed and many more on multiple servers and databases. Right now, I've got the fires out and I'm dealing with the smoke and embers. My first audit of the main server, the OLTP server, produced a 26 page report detailing the use, misuse, and the problems with the OLTP environment and the databases on that server. I still haven't been able to implement and resolve at least 80% of the problems I identified in my audit.
Resolving them is a process, not an event, so I'm still in the middle of the process.
I started with the fundamentals and worked up: the hardware on the physical machine, the VMWare configuration settings, the Windows Server configuration settings, the SQL Server configuration settings, the database level settings, ... I'm still working my way up to the more detailed problems like hundreds of missing indexes, excessive page splits, inadequate index maintenance, and more. But I'm getting there.
Like a fine wine, no problem solved before its time. :^)
But that's why the company hired me, to straighten out their servers, their applications, and to educate the software development staff who are quite talented but not so well educated about writing efficient SQL. We're getting there.
Having started with the fundamentals, the platforms, is why I generated this post. I've taken care of all the fundamentals except for I/O stalls. I'm stuck on I/O stalls on the hardware platform, can't seem to get beyond it, even though it is creating no system user visible problems, so I'm considering non-conventional alternatives, since the manufacturer cannot offer me any additional assistance.
I genuinely appreciate all your posts in response to my original one.
Viewing 15 posts - 1 through 15 (of 26 total)
You must be logged in to reply to this topic. Login to reply