July 27, 2012 at 9:41 am
Gianluca Sartori (7/27/2012)
Jeff Moden (7/27/2012)
SQL Kiwi (7/27/2012)
Jeff Moden (7/27/2012)
I had an ephipany that just hit me like a ton of bricks....I see! Well that explains it
...
Man! I wonder how many forum battles have been fought over such misunderstandings.
* Oh
* Heh
* Uhmmm
* ...
Basically, you're saying that if people didn't mumble on the forums nobody would find a reason to argue, right? 😛
Actually, all we need to do is eliminate the use of language completely from forum posts. That'd do it.
I once had someone tell me that, "You can't write sarcasm or irony. They don't work in writing." My response was, "I'll be sure to inform Mark Twain and Voltaire that they need to change career-paths." Judging by the responses to that, either the level of irony was deeper than I think, or nobody on that forum had ever heard of either of those two.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
July 27, 2012 at 9:43 am
SQL Kiwi (7/26/2012)
Re: Aaron's article.If I were the author of the SSC article concerned, these are the main points I would take away:
- The in-line article code should be updated to ensure people always use the latest/best version.
- It should be clearly stated that performance is poor above 8000 bytes, and the code should not be modified to do that.
- A T-SQL splitter that can handle > 8000 bytes is required.
It is too harsh to say Aaron's article is "poorly written" (he clearly put considerable time and effort into it). One might not agree with the change to MAX data types (though it is a natural change many people might make, absent any explicit warnings to the contrary in the original article). One might also not agree with the testing methodology (Aaron does explain his reasoning, though).
From the CLR side, I might complain that the generic CLR splitter used (Adam's) is not the optimal one to compare since it handles multi-character delimiters. If it were important, I would point Aaron to the splitter code I wrote for Jeff, which is optimized for single-character delimiters. I probably won't, since the difference is relatively small, and he comes to the same correct conclusion that Jeff does anyway: "The current best way to split delimited strings in SQL Server is to use a CLR splitter.".
That said, it ought to be possible to make those sort of technical responses without causing Aaron's Snark-O-Meter™ to explode.
+1
+1
+1
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
July 27, 2012 at 10:13 am
SQL Kiwi (7/27/2012)
And people ask me why I don't post as much on SSC as I used to 🙁
Really unfortunate - I would love to see more from you here on SSC.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
July 27, 2012 at 10:16 am
GSquared (7/27/2012)
L' Eomot Inversé (7/27/2012)
GSquared (7/27/2012)
Brandie Tarvin (7/27/2012)
Back to the semi-technical discussion...If joining a VARCHAR(8000) to a VARCHAR(MAX) value degrades performance, how does one get around this?
I'm not speaking in the context of the splitter function. I'm pretending I have two tables with joinable columns that were poorly designed (or different systems) and I need to join them.
Do I convert the 8k column to a MAX? Or is there some other way to resolve the issue?
Generate and join on Checksum/Hashbytes is probably your best bet.
You could start with a check on lengths (if the lengths are unequal so are the strings), then for the remaining candidates cast the varchar(max) strings to varchar(8000) (which is guaranteed to be a valid cast as the length doesn't exceed 8000). Of course if that's written as a single query the optimiser may do something very silly with it (leaving you with casts that would truncate, which is far from helpful) so some care is needed.
You certainly can't use hashbytes for this case unless the varchar(max) strings are known to be less than 8000 bytes, as its input is limited to 8000 bytes. Using checksum is possible (I think - at least it doesn't complain if you feed it a lot of bytes and I remember any documentation saying it's restricted to 8000), but it won't be effecient unless the optimiser is clever about it - the optimiser deals just fine with the case where the checksum is part of the data, but not neccessarily with the case where neither checksums being compared is a pre-computed checksum held in the data; and anyway, checksumming the long objects is quite slow, has to read and process all the data, whereas comparing lengths shouldn't involve reading and processing all the data - if a lot of the varchar(MAX) strings are very long the cheapness of length may offset the finer (but still far from perfect) filtering provided by checksum.
In long-string joins or Where clauses, using a checksum column, pre-computed and indexed, can greatly speed up query results if used correctly.
It's not the final test, because of collision and the possibility of input-length exceeding the capacity of the functions, but it can make a final query much more efficient if you can narrow down the rows that have to be tested for text-equality to those that have matching checksums.
I've tested this on tables with names in them, where the strings aren't even all that large, and seen huge performance increases. You'd have to do your own tests, of course, but comparing two integers can be much more efficient than comparing large strings. Larger strings get an even better hit than simple names, in my tests.
Say you have to compare two strings that are 10,000 characters each. If you generate a checksum on the first 100 characters, the chance of collision (false-positive) is real, but very small.
Or, let's say you need to find a row in a million-row table, where the text in it matches Moby Dick. You have the desired text in an input parameter. Two varchar(max) values being compared for equality, and no way to index varchar(max). But if you have an indexed checksum column of the first 100 or 1,000 characters, on the table you want to query, and you do a checksum of the same number of characters in your input parameter, you'll limit your text comparison to probably 1 row, maybe 2 or 3 false positives. Narrowed down that way, you then do the real comparison (and hope the input parameter doesn't have a typo on page 400 that the stored value doesn't have) to the resulting 2 or 3 rows. MUCH faster than a single query comparing the whole values.
That's an extreme example, but very long text strings for things like molecular formulae are very possible, and finding if one has already been tested against in a drug company lab is a very real possibility of a use for comparing multi-thousand character strings for exact matches.
Using length instead of a checksum is possible, but less likely to be useful in some cases. Afer all, H2O and NaS have the same length, but they are completely different molecules, and they generate completely different checksums/hashes.
Don't disagree on any of that, as I think was clear from my post (I did specifically refer to the case where at least one of the checksums was precomputed).
I hadn't though of using hashbytes on an initial substring, it looks like a reasonable option, again it needs to be precomputed, but it does mean that hashbytes can be used (as well as checksum) if you hash a substring so I was clearly wrong to say that hashbytes couldn't be used because of its input length restriction. :blush:
On second thoughts, I might disagree that NaS and H2O are the same length - not on the grounds that the SG ration is 2.165 while the Molecular weight ratio is 3.244 (about 3.6 if the H2o is solid)so obviously the molecules have very different sizes (unless this is a case where 2.165 is equal to 2.244 [or 3.6] for practical purposes) but on the gounds that the length of water and the length of salt are both meaningless/inapplicable/unknown/dependant on all sorts of things and I won't accept that NULL = NULL is true (or, for that matter, false). :hehe::hehe::hehe:
Tom
July 27, 2012 at 10:24 am
Lynn Pettis (7/27/2012)
Okay, I may be missing something, but how do you accidently mirror a database?
By asking the sysadmin "who's the fairest of them all?"
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
July 27, 2012 at 10:27 am
SQL Kiwi (7/27/2012)
ChrisM@Work (7/27/2012)
I'm still uncomfortable with "in order to handle our longest string (500,000 characters)". In all the time I've been working, and lurking here, I've never seen a requirement for splitting a string longer than one or two hundred characters, and more than thirty or forty is rare.Oh, I have, most recently as a list of security ids for a financial application, which would typically be thousands or tens or thousands of items long. No doubt very many real-world uses could happily use any splitting mechanism, but is that the point?
I have seen the same sort of requirement. We opted for the TVP method that was mentioned at the tail of the article since it was an application that was bringing the data in delimited form into the database. Several times faster that way.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
July 27, 2012 at 10:27 am
Koen Verbeeck (7/27/2012)
Jeff, this is all your fault.http://www.sqlservercentral.com/Forums/Topic1336367-147-1.aspx#bm1336413
ps: irony
:w00t::cool::hehe:
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
July 27, 2012 at 10:32 am
Stefan Krzywicki (7/27/2012)
Ugh, junior coder here is using SELECT * INTO in production code and won't listen to me when I tell him to change it, even with non-CELKO-like explanations.
I have been dealing with a lot of that of late as well (from vendors and other sources specifically). It takes a lot of restraint 😉
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
July 27, 2012 at 10:47 am
jcrawf02 (7/27/2012)
Gianluca Sartori (7/27/2012)
Jeff Moden (7/27/2012)
SQL Kiwi (7/27/2012)
Jeff Moden (7/27/2012)
I had an ephipany that just hit me like a ton of bricks....I see! Well that explains it
...
Man! I wonder how many forum battles have been fought over such misunderstandings.
* Oh
* Heh
* Uhmmm
* ...
Basically, you're saying that if people didn't mumble on the forums nobody would find a reason to argue, right? 😛
Amusing, but probably not that far from the truth. Wishy-washy language almost never works for good. I am guilty of this in real life all the time, not just on forums.
ROTFL!
My mother used to describe tea that had not been on the pot long enough with those words! :hehe:
I don't know if it's a Scottish thing, though.
-- Gianluca Sartori
July 27, 2012 at 10:58 am
Reading all these posts today, I feel the need to say something stupid.
"something stupid."
Okay, all better now. @=)
July 27, 2012 at 11:04 am
Jeff Moden (7/27/2012)
Koen Verbeeck (7/27/2012)
Jeff, this is all your fault.http://www.sqlservercentral.com/Forums/Topic1336367-147-1.aspx#bm1336413
ps: irony
BWAAA_HAAA!!!! Too funny, Koen! I think that even without your "ps", I'd have picked up on the irony of that one! 😀
I thought I'll just add it because we're all a bit sensitive today 😉 😀
On topic: bcp is in my opinion a better solution than SSIS in that given scenario, so there's no shame admitting it.
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
July 27, 2012 at 11:11 am
Jeff Moden (7/27/2012)
SQL Kiwi (7/27/2012)
Jeff Moden (7/27/2012)
I had an ephipany that just hit me like a ton of bricks....I see! Well that explains it
Wow... Another eye opener that you might not realize that you've just provided me. Here's another word that fits the category of things that look like ironic sarcasm because it's used as sarcasm so often in IT. The word "well". I've seen hundreds of posts that folks, including me, have taken as a bit of ironic sarcasm or "ring-knocker-ish" simply because of the word "well". For example, someone says something like "Well, if you had looked at the code that... etc, etc", it frequently comes across as "Well, you ignorant slob, if you had looked at the code that..." instead of it being correctly taken as word that is usually meant to show a bit of thoughtfull understanding.
The same goes for things like "Ah..." and "Ummmm..." the first being taken as a sarcastic pause instead of the "I see!" that is was meant to be and the "Ummmm..." being taken as a sign of sarcastic restraint instead of an attempt to show thoughtfulness. Even "Heh...." can come across as incredibly sarcastic, at times.
Man! I wonder how many forum battles have been fought over such misunderstandings. Thanks again for the feedback, Paul. This has been a real eye opener.
And that's why they invented this: ?
(there's also some sort of lightning symbol that denotes irony)
http://en.wikipedia.org/wiki/Irony_punctuation
And everybody uses it ?
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
July 27, 2012 at 11:13 am
L' Eomot Inversé (7/27/2012)
GSquared (7/27/2012)
L' Eomot Inversé (7/27/2012)
GSquared (7/27/2012)
Brandie Tarvin (7/27/2012)
Back to the semi-technical discussion...If joining a VARCHAR(8000) to a VARCHAR(MAX) value degrades performance, how does one get around this?
I'm not speaking in the context of the splitter function. I'm pretending I have two tables with joinable columns that were poorly designed (or different systems) and I need to join them.
Do I convert the 8k column to a MAX? Or is there some other way to resolve the issue?
Generate and join on Checksum/Hashbytes is probably your best bet.
You could start with a check on lengths (if the lengths are unequal so are the strings), then for the remaining candidates cast the varchar(max) strings to varchar(8000) (which is guaranteed to be a valid cast as the length doesn't exceed 8000). Of course if that's written as a single query the optimiser may do something very silly with it (leaving you with casts that would truncate, which is far from helpful) so some care is needed.
You certainly can't use hashbytes for this case unless the varchar(max) strings are known to be less than 8000 bytes, as its input is limited to 8000 bytes. Using checksum is possible (I think - at least it doesn't complain if you feed it a lot of bytes and I remember any documentation saying it's restricted to 8000), but it won't be effecient unless the optimiser is clever about it - the optimiser deals just fine with the case where the checksum is part of the data, but not neccessarily with the case where neither checksums being compared is a pre-computed checksum held in the data; and anyway, checksumming the long objects is quite slow, has to read and process all the data, whereas comparing lengths shouldn't involve reading and processing all the data - if a lot of the varchar(MAX) strings are very long the cheapness of length may offset the finer (but still far from perfect) filtering provided by checksum.
In long-string joins or Where clauses, using a checksum column, pre-computed and indexed, can greatly speed up query results if used correctly.
It's not the final test, because of collision and the possibility of input-length exceeding the capacity of the functions, but it can make a final query much more efficient if you can narrow down the rows that have to be tested for text-equality to those that have matching checksums.
I've tested this on tables with names in them, where the strings aren't even all that large, and seen huge performance increases. You'd have to do your own tests, of course, but comparing two integers can be much more efficient than comparing large strings. Larger strings get an even better hit than simple names, in my tests.
Say you have to compare two strings that are 10,000 characters each. If you generate a checksum on the first 100 characters, the chance of collision (false-positive) is real, but very small.
Or, let's say you need to find a row in a million-row table, where the text in it matches Moby Dick. You have the desired text in an input parameter. Two varchar(max) values being compared for equality, and no way to index varchar(max). But if you have an indexed checksum column of the first 100 or 1,000 characters, on the table you want to query, and you do a checksum of the same number of characters in your input parameter, you'll limit your text comparison to probably 1 row, maybe 2 or 3 false positives. Narrowed down that way, you then do the real comparison (and hope the input parameter doesn't have a typo on page 400 that the stored value doesn't have) to the resulting 2 or 3 rows. MUCH faster than a single query comparing the whole values.
That's an extreme example, but very long text strings for things like molecular formulae are very possible, and finding if one has already been tested against in a drug company lab is a very real possibility of a use for comparing multi-thousand character strings for exact matches.
Using length instead of a checksum is possible, but less likely to be useful in some cases. Afer all, H2O and NaS have the same length, but they are completely different molecules, and they generate completely different checksums/hashes.
Don't disagree on any of that, as I think was clear from my post (I did specifically refer to the case where at least one of the checksums was precomputed).
I hadn't though of using hashbytes on an initial substring, it looks like a reasonable option, again it needs to be precomputed, but it does mean that hashbytes can be used (as well as checksum) if you hash a substring so I was clearly wrong to say that hashbytes couldn't be used because of its input length restriction. :blush:
On second thoughts, I might disagree that NaS and H2O are the same length - not on the grounds that the SG ration is 2.165 while the Molecular weight ratio is 3.244 (about 3.6 if the H2o is solid)so obviously the molecules have very different sizes (unless this is a case where 2.165 is equal to 2.244 [or 3.6] for practical purposes) but on the gounds that the length of water and the length of salt are both meaningless/inapplicable/unknown/dependant on all sorts of things and I won't accept that NULL = NULL is true (or, for that matter, false). :hehe::hehe::hehe:
Yep. I was just clarifying what I meant so that I could break the rules for The Thread getting technical. More fun that way.
Ignoring that NaS isn't "salt" (it is a salt, but not salt, to be more precise and ambiguous, both at the same time), gotta agree there, too. Length is meaningless at that scale.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
July 27, 2012 at 12:13 pm
With these language problems maybe we should ask Steve to create a new IFCode <sarcasm></sarcasm> to highlight sarcasm including a caveat
'If you take offence with this sarcasm it is your fault and complaining will only result in more sarcasm'.
Also add new emoticons of a tap with different size drips to add an endearing quality to the sarcasm.
Far away is close at hand in the images of elsewhere.
Anon.
July 27, 2012 at 12:30 pm
David Burrows (7/27/2012)
With these language problems maybe we should ask Steve to create a new IFCode <sarcasm></sarcasm> to highlight sarcasm including a caveat'If you take offence with this sarcasm it is your fault and complaining will only result in more sarcasm'.
Also add new emoticons of a tap with different size drips to add an endearing quality to the sarcasm.
Also we need [joke]...[/joke] for clowning.
And would the tap icon mean "the readers of this are drips" or "the author is a drip" or neither of these?
"a tap with different size drips" is ambiguous - is that one icon with a tap and several drips of different sizes, or several items with a tap and a drip each, the different icons using different drip sizes?
Tom
Viewing 15 posts - 37,156 through 37,170 (of 66,712 total)
You must be logged in to reply to this topic. Login to reply