September 23, 2011 at 5:30 pm
Mike C (9/23/2011)
I'm surprised you're mixing and matching varchar and nvarchar data types. Normally you'd see one or the other in a table. That's a nitpicky thing, but can cause you some pain in development later.
Heheh, Mike, if we were going to nitpick...
[age] [nvarchar](50) NULL,
What is the expected value here, Paleolithic?
[sex] [nvarchar](50) NULL,
Are we describing the method?
[pic] [nvarchar](100) NULL,
There's already two LOBs in here, why suddenly switch to a pathing description?
[searchterm] [nvarchar](400) NOT NULL,
Should most likely be normalized but I'd like to see the contained data.
[pubDate] [varchar](50) NOT NULL,
Why is this not DateTime?
....
I've decided I'm going to use this table build as a training mechanism for my teams here at the office. It should be a good method for me to determine what-all still needs to be trained for my local staff and Junior Devs.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
September 23, 2011 at 5:36 pm
Mike C (9/23/2011)
Ahhh, my point is that the tail isn't always the best drop point. In the application I tuned not long ago the tail was the drop point (identity column CI). It was actually a Java app that pushed 100 simultaneous threads all trying to INSERT to the table at the same time. Essentially 1 INSERT ran and 99 waited while the INSERT blocked the last page. When that one finished the next one ran and another thread on the Java side sent another INSERT to wait in line.
Yeah, it's the phenomenon also known as the 'hotspot'. I agree they can cause some issues depending on usage.
I noticed the data being sent was in no particular order, and had a nice random distribution to it, so I changed the CI to the business key of the inbound data. This caused a lot of page splits on the initial inserts, ... Basically blocking was all but eliminated pretty quickly due to the random distribution of the data and the new CI.
Similar solution but maintainable: Use a fill factor of ~60-65%, depending on rowlength. Leave room for 3-10 rows per page in between each maintenance window.
... You lose the major advantage of partitions during reads since the modulo of the surrogate key is not related to how people will query your data in the real world, but it should provide better performance on inserts.
Yep, but that's an extreme case. Another tactic is to break the partitions on a reasonable key and then tail-insert into each partition, which can provide a better scenario. Both of those methods can break badly though if you do multi-row inserts, as you still end up hotspotting.
Your points, though, are well made. As usual, it always 'depends'. Database construction is still half artform.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
September 23, 2011 at 6:08 pm
Evil Kraig F (9/23/2011)
Addendum on a second look:[hashField] AS ([dbo].[getMd5Hash]([entryId])),
As a calculated column, non-persisted?
This table is just RIPE for pain.
Good catch on that.. The fact that HASHBYTES isn't used directly and not persisted.. I agree that for peak performance calculating it and sending it with the insert would be best but for most cases calculating it and persisting it is fast enough..
CEWII
September 23, 2011 at 7:41 pm
Evil Kraig F (9/23/2011)
Mike C (9/23/2011)
I'm surprised you're mixing and matching varchar and nvarchar data types. Normally you'd see one or the other in a table. That's a nitpicky thing, but can cause you some pain in development later.Heheh, Mike, if we were going to nitpick...
[age] [nvarchar](50) NULL,
What is the expected value here, Paleolithic?
[sex] [nvarchar](50) NULL,
Are we describing the method?
[pic] [nvarchar](100) NULL,
There's already two LOBs in here, why suddenly switch to a pathing description?
[searchterm] [nvarchar](400) NOT NULL,
Should most likely be normalized but I'd like to see the contained data.
[pubDate] [varchar](50) NOT NULL,
Why is this not DateTime?
....
I've decided I'm going to use this table build as a training mechanism for my teams here at the office. It should be a good method for me to determine what-all still needs to be trained for my local staff and Junior Devs.
For "sex" I was thinking "only on Fridays after midnight when it's not raining."
September 23, 2011 at 8:22 pm
Elliott Whitlow (9/23/2011)
Evil Kraig F (9/23/2011)
Addendum on a second look:[hashField] AS ([dbo].[getMd5Hash]([entryId])),
As a calculated column, non-persisted?
This table is just RIPE for pain.
Good catch on that.. The fact that HASHBYTES isn't used directly and not persisted.. I agree that for peak performance calculating it and sending it with the insert would be best but for most cases calculating it and persisting it is fast enough..
CEWII
To be honest I'm not sure what the point of the hash is. You're only hashing one field (entryId), which you're also storing in the table as a varchar(100). There's extra cost associated with HASHBYTES - under the covers it does an implicit conversion of your data to VARBINARY, hash algorithms are inherently costly, and it's implemented as a bunch of calls to the CryptoAPI. One call to HASHBYTES invokes several CryptoAPI calls under the hood:
CryptAcquireContext()
CryptCreateHash()
CryptHashData()
CryptGetHashParam()
CryptGetHashParam() <-- the second call is the expensive one
CryptDestroyHash()
Hashing algorithms involve a lot of looping and memory copies (expensive) internally in addition to the bit-level manipulations (cheap). You'd probably get more efficiency from your app by precalculating on the .NET side where you can create a static MD5 object and avoid multiple calls under the hood to most of the functions above.
Usually I use hashes for change detection, but that generally involves a lot more columns than one. That's why I'm not sure what the point of even generating a hash in your example is. Also MD5 is not the best algorithm. It cycles in short character spans (something like 64 characters) so it's relatively easy to generate hash collisions with it. SHA is a much better family of hash functions.
September 23, 2011 at 8:42 pm
Mike,
I generally agree, not sure what the purpose is for the hash here. I usually use it for change detection as well or in some cases where I need to be able to search on some encrypted data. I'll hash like the first 4 characters of a name. I also prefer the SHA family to MD5, the collission domain is smaller. But MD5 uses a binary(16) where like SHA1 uses a BINARY(20)..
CEWII
September 24, 2011 at 10:36 am
wow you guys are awesome in response times:
so here goes -- I have less than 2 years experience so forgive the odd errors in design.
- The MD5 hash key was implemented as a suggestion from another forum that stated it would help with dupe checking and lookups since my combination keys were large strings (entryId and searchterm). I did not know that it would be problematic and useless anyway so I will remove that.
- So from what I gather then is the identity field ok or is this unnecessary as well.
- This purpose of this DB which contains 3 other similar tables but are accessed from other applications; is to house the data for 2 days and then it is removed and only serves as a temporary storage for dupe checking the data I write to the xml files which are then delivered to our API index. Also the tables are used to deliver reports to a web application that shows daily counts per table and different stats on the searchterms like counts per searchterm and daily list so I the queries are run whenever someone runs the sql server 2008 report tool.
- The DB maintenance involves daily backups which only includes cleaning history, rebuilding index and full DB backup. So I will look into the defragmentation.
- I am not sure about non-clustered indexes and if I need them here.
- The .NET app is not single threaded in that there are around 50+ threads running and each creates up to 100 tasks where each task on its own handles abt 10000 potential new records for the table.
I assume (searchterm) is your "business key" that you're using to locate the existing records. Also, based on your further info it sounds like you're single-threading the INSERTs. This being the case, I would change your indexing strategy to make (id) the clustered index and put a nonclustered index on (searchterm). The id column will automatically be appended to the end of the (searchterm) nonclustered index, which should reflect better how you query for the existence of records.
- I use a combination key of entryId and searchterm since I can have posts with the same entryid but different searchterm. How do I make the id the clustered index as it is setup to auto incremented and not sure I understand the logic behind it? How would I make it a clustered index and searchterm a nonclustered index.
- I will correct my table design and thanks for picking up these things
- I will also look at compression.
- the sql server version on this server houses another DB which has a table that is accessed more than my application in that each task I create (which has +-10000 records to work with) accesses the table in this DB that is linked through a web application to validate whethee the username in each record exists and receive a unique id for that username. So when I look at the task manager the CPU usage is at 98-100%.
- I will work on testing a bulk insert with not exists and see how this pans out.
- took a look at sp_who2 to check the transaction stats and most of my associated transactions are sleeping AWAITING COMMAND
September 24, 2011 at 11:31 am
You say these tables are also used for reporting? If THIS particular table is used for reporting, there's a fairly good chance it has some indexes on them that are more valuable for SELECTs than anything else. Such indexes can really work against high-speed INSERTs. If there are one or more indexes whose first column has very few unique values (typical flag columns like "Sex", for example), then timeout problems during inserts can become quite the problem due to "extent splits" of the non-clustered indexes even if the FILLFACTOR on those indexes is quite low. This can happen even if you have a Clustered Index on something like an IDENTITY column to reduce "page splits" during INSERTs.
Check and see if you have any such indexes on the table being inserted into. If such indexes exist, consider changing the columnar order of the indexes or, if they are single column indexes, consider consolidating some of them to be composite indexes with columns the have a lot of unique values for the first column of the index.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 24, 2011 at 11:59 am
j_depp_99 (9/24/2011)
wow you guys are awesome in response times:so here goes -- I have less than 2 years experience so forgive the odd errors in design.
- The MD5 hash key was implemented as a suggestion from another forum that stated it would help with dupe checking and lookups since my combination keys were large strings (entryId and searchterm). I did not know that it would be problematic and useless anyway so I will remove that.
- So from what I gather then is the identity field ok or is this unnecessary as well.
- This purpose of this DB which contains 3 other similar tables but are accessed from other applications; is to house the data for 2 days and then it is removed and only serves as a temporary storage for dupe checking the data I write to the xml files which are then delivered to our API index. Also the tables are used to deliver reports to a web application that shows daily counts per table and different stats on the searchterms like counts per searchterm and daily list so I the queries are run whenever someone runs the sql server 2008 report tool.
- The DB maintenance involves daily backups which only includes cleaning history, rebuilding index and full DB backup. So I will look into the defragmentation.
- I am not sure about non-clustered indexes and if I need them here.
- The .NET app is not single threaded in that there are around 50+ threads running and each creates up to 100 tasks where each task on its own handles abt 10000 potential new records for the table.
I assume (searchterm) is your "business key" that you're using to locate the existing records. Also, based on your further info it sounds like you're single-threading the INSERTs. This being the case, I would change your indexing strategy to make (id) the clustered index and put a nonclustered index on (searchterm). The id column will automatically be appended to the end of the (searchterm) nonclustered index, which should reflect better how you query for the existence of records.
- I use a combination key of entryId and searchterm since I can have posts with the same entryid but different searchterm. How do I make the id the clustered index as it is setup to auto incremented and not sure I understand the logic behind it? How would I make it a clustered index and searchterm a nonclustered index.
- I will correct my table design and thanks for picking up these things
- I will also look at compression.
- the sql server version on this server houses another DB which has a table that is accessed more than my application in that each task I create (which has +-10000 records to work with) accesses the table in this DB that is linked through a web application to validate whethee the username in each record exists and receive a unique id for that username. So when I look at the task manager the CPU usage is at 98-100%.
- I will work on testing a bulk insert with not exists and see how this pans out.
- took a look at sp_who2 to check the transaction stats and most of my associated transactions are sleeping AWAITING COMMAND
Agreed, the MD5 as implemented isn't going to help. It's only hashing one column so it's not going to help as-is. Even with 2 columns it's not worth the extra processing power to generate the hash in this instance.
Based on your description above, here are my immediate suggestions:
* Check and fix your data types. As was pointed out, SEX (if it means "GENDER") does not require 50 characters to identify/code. After all there are only 2 genders + Lady Gaga. Other data types you have might be better as DATETIME, INTEGER, etc. Look at the domain (the possible values that can be held in any given field) and range (the min and max) for every field and use the smallest adequate data type possible to cover it.
* Get rid of the MD5 hash.
Now this one depends on a couple of factors:
* If you are going to stay with the current methodology (50 threads pounding the same table with SELECT ... INSERT) simultaneously then you're going to want to do what we were talking about earlier in the thread and change your indexing strategy, partitioning, or something to remove the hotspot. My simple suggestion is to change the Clustered Index to (entryId, searchterm) and make the PK (id - the identity column) nonclustered. If your (entryId, searchterm) combinations are being sent to the server in a random fashion with a good distribution, then this will resolve the hotspot issue.
However, since you don't have a "real-time" delivery SLA, I would recommend going a slightly different route:
* Create a staging table and populate it with Service Broker calls, individual inserts, or whatever method. You'll need to optimize the staging table as we described previously. Schedule a regular push of this data to the final table with a single INSERT ... WHERE NOT EXISTS ... statement. You can also avoid the initial SELECT to check for existence in the staging table, and just grab the most recent (or oldest) if you get dupes (duplicate business key values) in there.
This is slightly more complex versus what you're doing now, but not significantly more so. It should be a lot more efficient though.
* If possible, you might queue up the changes at the client in a flatfile or some other storage and then bulk insert the data into the staging table. You can use the .NET SqlBulkCopy class to bulk load data this way.
* On the client side you might also consider playing around with the number of parallel threads running. Do some simple performance tests with the 50 you're running now, then try again -- but this time limit it to the number of physical processors on your server. You might play around with it a bit to find the optimal number of threads, but I doubt 50+ is optimal.
* As mentioned, since you have so much stuff running on your SQL Server box, make sure SQL Server itself is not starved for resources.
September 27, 2011 at 10:13 am
I made the changes as far as the clustered and non-clustered indexes and the data field types. Just restarted everything and wasn't too long before I got the timeout errors. I am now going to look into the batch inserts and hopefully this will be fine. I thought about the point that maybe sql server is being starved for resources and I checked to make sure that my code closes all connections after each opening. Just to cover everything, does anyone have any final comment or suggestion on how to check sql server resources and whether is is being drained by a combination of things or just my application.
Thanks in advance
September 27, 2011 at 11:34 am
Mike C (9/24/2011)
* If you are going to stay with the current methodology (50 threads pounding the same table with SELECT ... INSERT) simultaneously then you're going to want to do what we were talking about earlier in the thread and change your indexing strategy, partitioning, or something to remove the hotspot. My simple suggestion is to change the Clustered Index to (entryId, searchterm) and make the PK (id - the identity column) nonclustered. If your (entryId, searchterm) combinations are being sent to the server in a random fashion with a good distribution, then this will resolve the hotspot issue.
I'm not 100% sold on this. You are trading a hot spot for likely page splits. He is going to need to set fill factor on the clustered index to something like 50%(pure guess) to minimize this. While last pages can be a hot spot, page splits are terribly expensive. I did a google search for "SQL 2008 last page contention" (without the quotes) and got some interesting reading..
You can look at these links which cover MS' research when they were testing a MASSIVE box:
CEWII
September 27, 2011 at 12:04 pm
Elliott Whitlow (9/27/2011)
Mike C (9/24/2011)
* If you are going to stay with the current methodology (50 threads pounding the same table with SELECT ... INSERT) simultaneously then you're going to want to do what we were talking about earlier in the thread and change your indexing strategy, partitioning, or something to remove the hotspot. My simple suggestion is to change the Clustered Index to (entryId, searchterm) and make the PK (id - the identity column) nonclustered. If your (entryId, searchterm) combinations are being sent to the server in a random fashion with a good distribution, then this will resolve the hotspot issue.I'm not 100% sold on this. You are trading a hot spot for likely page splits. He is going to need to set fill factor on the clustered index to something like 50%(pure guess) to minimize this. While last pages can be a hot spot, page splits are terribly expensive. I did a google search for "SQL 2008 last page contention" (without the quotes) and got some interesting reading..
You can look at these links which cover MS' research when they were testing a MASSIVE box:
CEWII
You are trading upfront page splits for continuous hotspotting. If you can fit 20 rows per page you'll get a lot of page splits up front or you can have 49 threads queued up behind the currently running thread forever. I tested with fill factor set to 80% and saw huge perf. improvement. Try it out sometime. Create a .NET app that launches dozens of threads, all inserting simultaneously and see what way works best for you.
September 27, 2011 at 2:29 pm
Mike C (9/27/2011)
Elliott Whitlow (9/27/2011)
Mike C (9/24/2011)
* If you are going to stay with the current methodology (50 threads pounding the same table with SELECT ... INSERT) simultaneously then you're going to want to do what we were talking about earlier in the thread and change your indexing strategy, partitioning, or something to remove the hotspot. My simple suggestion is to change the Clustered Index to (entryId, searchterm) and make the PK (id - the identity column) nonclustered. If your (entryId, searchterm) combinations are being sent to the server in a random fashion with a good distribution, then this will resolve the hotspot issue.I'm not 100% sold on this. You are trading a hot spot for likely page splits. He is going to need to set fill factor on the clustered index to something like 50%(pure guess) to minimize this. While last pages can be a hot spot, page splits are terribly expensive. I did a google search for "SQL 2008 last page contention" (without the quotes) and got some interesting reading..
You can look at these links which cover MS' research when they were testing a MASSIVE box:
CEWII
You are trading upfront page splits for continuous hotspotting. If you can fit 20 rows per page you'll get a lot of page splits up front or you can have 49 threads queued up behind the currently running thread forever. I tested with fill factor set to 80% and saw huge perf. improvement. Try it out sometime. Create a .NET app that launches dozens of threads, all inserting simultaneously and see what way works best for you.
I have built applications that do significant inserts and while last page contention CAN be an issue I would suggest looking at that first article to find out if it really is in this case. Continuous hotspotting sounds bad but without looking at it empirically you can't know whether it really IS a problem. To reduce page splitting you have to reduce the fill factor, and a lot depends on how often you re-index as to whether the number needs to be 80% or 50% or 30%.
Like almost everything with SQL, it depends..
CEWII
September 27, 2011 at 2:58 pm
It does depend, and on a lot more factors than we've covered here so far. I'm not familiar with your apps but note there is a difference between significant inserts and significant parallel inserts into the same table. Wait stats can be easily used to prove that blocking is or is not a significant issue. I will read those docs when I have a chance; however, I've been through nearly this exact same discussion with the SQLCAT team.
September 27, 2011 at 4:17 pm
Mike C (9/27/2011)
It does depend, and on a lot more factors than we've covered here so far. I'm not familiar with your apps but note there is a difference between significant inserts and significant parallel inserts into the same table. Wait stats can be easily used to prove that blocking is or is not a significant issue. I will read those docs when I have a chance; however, I've been through nearly this exact same discussion with the SQLCAT team.
My big point here is just that, it depends. There are a lot of factors and parallel inserts causing last page contention can certainly be one of them.
I had a process running from 50-100 threads doing an insert from each thread every 100-250ms, it could pound the machine HARD, the process became a benchmark for us to test new hardware because it could saturate an entire large machine for HOURS. I re-architected the process to batch up up to 2000 records and then dump them into the table. Run times dropped slightly. Basically I can say that the parallel inserts had very little effect on performance.
CEWII
Viewing 15 posts - 16 through 29 (of 29 total)
You must be logged in to reply to this topic. Login to reply