November 22, 2009 at 2:56 pm
Michael,
Just a quick reply for now, because I have something else to add to this thread in a moment.
Collation determines the way character data is mapped to bit patterns, the rules for ordering, and also the rules for making comparisons. No data is ever changed - it's just the rules for ordering and comparison that differ depending on how the various collations involved resolve. The best references for this sometimes confusing topic are the following BOL entries:
If you work through some examples with the relevant BOL entry, it should become clearer.
Paul
November 22, 2009 at 3:50 pm
Jeff & Michael,
:w00t: I now understand why Jeff's method broke! :w00t:
Consider these very simple statements:
SELECT REPLACE(' ', ' ', 'X') -- Example 1
SELECT REPLACE(' ' COLLATE LATIN1_GENERAL_BIN, ' ', 'X') -- Example 2
SELECT REPLACE(SPACE(2), SPACE(2), 'X') -- Example 3
SELECT REPLACE(SPACE(2) COLLATE LATIN1_GENERAL_BIN, SPACE(2), 'X') -- Example 4
There are two space characters in Example 1 and Example 2.
Examples 3 and 4 are exactly the same except using SPACE(2) in place of the two space characters.
In SQL 2000 (8.0.2171):
Example 1 produces X
Example 2 produces XX <== This is what breaks Jeff's method
Example 3 produces X
Example 4 produces X
In SQL 2005 and 2008, all four examples give the expected output of a single 'X'.
So...what is example 2 all about in 2000? Well, a full discussion would be very long and dull (it involves data type resolution, implicit conversions, trailing blank handling with VARCHAR, expression constant folding, and the order in which all these rules are applied).
The bottom line is that literals that end with a space can be problematic in 2000 where an explicit collation is specified. It exposes what is probably a bug where trailing blanks are trimmed from one implicit VARCHAR literal, but not the other, due to a subtle difference in the order the implicit CONVERTs are applied, compared to the concatenations.
Adding COLLATE clauses to everything certainly works, but now that I know why it is happening, it turns out we only need two - wherever there are trailing blanks:
SELECT
CleanString =
REPLACE (
REPLACE (
REPLACE (
--Changes 2 spaces to the OX model
OriginalString COLLATE LATIN1_GENERAL_BIN, SPACE(2), SPACE(1) + CHAR(7)),
--Changes the XO model to nothing
CHAR(7) + SPACE(1) COLLATE LATIN1_GENERAL_BIN, SPACE(0)),
--Changes the remaining X's to nothing
CHAR(7), SPACE(0))
FROM @demo
WHERE CHARINDEX(' ',OriginalString) > 0
The above code works correctly on 2000, 2005, and 2008, regardless of server or database collations. I changed the space and empty string literals to SPACE(n) functions for clarity - it works just fine with literals too.
For anyone interested, an example difference in the 2000 query plan where this bug is exposed is shown below:
--[Expr1002]=replace(replace(replace(Convert(Convert(@Demo.[OriginalString])), Convert(space(2)), Convert(space(1)+char(7))), Convert(char(7))+Convert(space(1)), Convert(space(0))), Convert(char(7)), Convert(space(0)))
--[Expr1002]=replace(replace(replace(Convert(Convert(@Demo.[OriginalString])), Convert(space(2)), Convert(space(1)+char(7))), Convert(char(7)+space(1)), Convert(space(0))), Convert(char(7)), Convert(space(0)))
Paul
November 22, 2009 at 4:10 pm
Dear All
A fresh aproach :smooooth:
If the problem was slightly different from Jeff's article, and we had to remove any bad character or pattern of characters then I thought holding those characters or patterns in a table would be cool - and then of course you can just add new patterns to the table as you like! This would allow a end user to have control over 'string cleaning'.
For starters I have added the double space possibilities only, just to see if I could get things to work:
This is a small script that you can copy and have a go with :
-- Create a table of all the rotten stuff we want to remove
DECLARE @table_StringsToRemove TABLE (RemovePattern VARCHAR(50), ReplaceWith VARCHAR(30))
INSERT INTO @table_StringsToRemove(RemovePattern, ReplaceWith)
SELECT SPACE(2), ''
UNION ALL SELECT SPACE(3), ''
UNION ALL SELECT SPACE(5), ''
UNION ALL SELECT SPACE(9), ''
UNION ALL SELECT SPACE(17), ''
UNION ALL SELECT SPACE(33), ''
-- Create a test table of dirty strings:
DECLARE @table_StringsToClean TABLE (MyString VARCHAR(max))
INSERT INTO @table_StringsToClean(MyString)
SELECT 'a b c d e f g h i j k'
UNION ALL SELECT ' x b c d e f g h i j k'
-- Clean the strings
UPDATE @table_StringsToClean
SET MyString = REPLACE(MyString, RemovePattern, ReplaceWith)
FROM @table_StringsToRemove, @table_StringsToClean
WHERE CHARINDEX(RemovePattern, MyString) != -1
-- Display result
SELECT * FROM @table_StringsToClean
-- output of avove SQL is
a bc de f g h i jk
xb cd ef gh ij k
Prety cool ?
[Edit : no its not the output is wrong]
This you might like because:
A/ its inline set based SQL (I might try tomorrow if time as a UDF)
B/ you can add new patterns without recoding the SQL
C/ you can replace with Stars *** maybe to indicate text removed: perhaps obcene words
If you wanted to remove other stuff you could try this:
DECLARE @table_StringsToRemove TABLE (RemovePattern VARCHAR(50), ReplaceWith VARCHAR(30))
INSERT INTO @table_StringsToRemove(RemovePattern, ReplaceWith)
SELECT SPACE(2), ''
UNION ALL SELECT SPACE(3), ''
UNION ALL SELECT SPACE(5), ''
UNION ALL SELECT SPACE(9), ''
UNION ALL SELECT SPACE(17), ''
UNION ALL SELECT SPACE(33), ''
UNION ALL SELECT REPLICATE(CHAR(9),2), '' -- Duplicate TABS
UNION ALL SELECT REPLICATE(CHAR(9),5), ''
UNION ALL SELECT REPLICATE(CHAR(9),9), ''
UNION ALL SELECT REPLICATE(CHAR(9),17), ''
UNION ALL SELECT REPLICATE(CHAR(9),33), ''
UNION ALL SELECT 'boobs', 'knockers' --- Obcence stuff!!
UNION ALL SELECT 'RBAR', 'R***' --- Another rude word!
-- we can add anything to the above list
DECLARE @table_StringsToClean TABLE (MyString VARCHAR(max))
INSERT INTO @table_StringsToClean(MyString)
SELECT 'a b c d e f g h i j k'
UNION ALL SELECT ' x b c d e f g h i j k'
UPDATE @table_StringsToClean
SET MyString = REPLACE(MyString, RemovePattern, ReplaceWith)
FROM @table_StringsToRemove, @table_StringsToClean
WHERE CHARINDEX(RemovePattern, MyString) != -1
SELECT * FROM @table_StringsToClean
One last idea!!
Wrap the above into an AFTER UPDATE trigger keeping the data clean, then massive batch updates not required - and performance less of an issue.
P.S. As regards prior performance tests:
What I am feeling from the advice/comments, is don't take ANY results for granted and test out all the different solutions in your own unique environment - where things may be quite different - for many different reasons!
Mind you - if you run your entire enterprise on a crabby pattie laptop like mine - then expect similar results to those I have posted 🙂
Hope you all had a fine weekend:-):-)
C# Gnu
____________________________________________________
November 22, 2009 at 4:14 pm
Oops I just noticed the output is wrong:
a bc de f g h i jk
Any ideas?
C# Gnu
____________________________________________________
November 22, 2009 at 4:31 pm
Here is the corrected script
-- Create a table of all the rotten stuff we want to remove
DECLARE @table_StringsToRemove TABLE (RemovePattern VARCHAR(50), ReplaceWith VARCHAR(30))
INSERT INTO @table_StringsToRemove(RemovePattern, ReplaceWith)
SELECT SPACE(2), ' '
UNION ALL SELECT SPACE(3), ' '
UNION ALL SELECT SPACE(5), ' '
UNION ALL SELECT SPACE(9), ' '
UNION ALL SELECT SPACE(17), ' '
UNION ALL SELECT SPACE(33), ' '
-- Create a test table of dirty strings:
DECLARE @table_StringsToClean TABLE (MyString VARCHAR(max))
INSERT INTO @table_StringsToClean(MyString)
SELECT 'a b c d e f g h i j k'
UNION ALL SELECT ' x b c d e f g h i j k'
-- Clean the strings
UPDATE @table_StringsToClean
SET MyString = REPLACE(REPLACE(REPLACE(REPLACE(MyString, RemovePattern, ReplaceWith), RemovePattern, ReplaceWith) ,RemovePattern, ReplaceWith),RemovePattern, ReplaceWith)
FROM @table_StringsToRemove, @table_StringsToClean
WHERE CHARINDEX(RemovePattern, MyString) != -1
-- Display result
SELECT * FROM @table_StringsToClean
C# Gnu
____________________________________________________
November 22, 2009 at 5:03 pm
C# Screw (11/22/2009)
Oops I just noticed the output is wrong:a bc de f g h i jkAny ideas?
It's a nice idea - and very creative - but it won't work like that I'm afraid.
In order to work, this method would need to be able to apply more than one REPLACE operation to each row, do it in a guaranteed order, and operate on the result of the previous operation.
Sadly, none of these things are possible using UPDATE and a CROSS JOIN. If it were, we might not need recursive CTEs...
The engine will do the join, but the stream aggregate includes an ANY internal aggregate:
[RemovePattern] = Scalar Operator(ANY([RemovePattern])), [ReplaceWith] = Scalar Operator(ANY([ReplaceWith])), [MyString] = Scalar Operator(ANY([MyString]))
...so each row is only operated on once, by one REPLACE. The 'corrected' version simply applies the exact same REPLACE expression four times. In effect, it applies the SPACE(2) -> SPACE(1) replacement four times. Note that this is just a quirk of the current plan - it is not guaranteed to choose the SPACE(2) row over SPACE(9), SPACE(3), SPACE(17), or any other...!
Sorry!
November 22, 2009 at 5:08 pm
Yes thanks Paul, I noticed myself it was not replacng the 'rbar' text in the 2nd test (after I posted unfortuanately :blush:) - bother - I was so excited by the idea too! Back to the drawng board! 🙂
C# Gnu
____________________________________________________
November 22, 2009 at 5:43 pm
Hi
not quite as exciting - but at least the replace strings are in a table as I wanted :
😎
CREATE FUNCTION fn_CleanString(@MyString VARCHAR(max))
RETURNS VarChar(MAX)
AS
BEGIN
-- Remove unwanted strings
SELECT @MyString = REPLACE(@MyString, RemovePattern, ReplaceWith)
FROM tPatterns
WHERE CHARINDEX(RemovePattern, @MyString) != -1
--Now Remove duplicate spaces & tabs(Jeff) - or other method
SELECT @MyString = REPLACE(
REPLACE(
REPLACE(
LTRIM(RTRIM(@MyString)) ,' ',' '+CHAR(7)) --Changes 2 spaces to the OX model
,CHAR(7)+' ','') --Changes the XO model to nothing
,CHAR(7),'') --Changes the remaining X's to nothing
RETURN @MyString
END
go
-- Create a table of all the rotten stuff we want to remove
SELECT 'tablescan' RemovePattern, 'seek' ReplaceWith
INTO tPatterns
UNION ALL SELECT 'RBAR', 'R***'
UNION ALL SELECT 'anyword', 'R*******'
-- Create some dirty test data
SELECT 'a b c d e f g h rbar i tablescan j k' myString
INTO tMyStrings
UNION ALL
SELECT ' tablescan rbar x b c d e f g h i j k'
-- Clean & output the our test data
SELECT dbo.fn_CleanString(myString) FROM tMyStrings
go
-- Cleanup
DROP TABLE tPatterns
DROP TABLE tMyStrings
DROP FUNCTION fn_CleanString
C# Gnu
____________________________________________________
November 22, 2009 at 6:05 pm
:satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied:
This works nicely without the Nested replace:-):-):
CREATE FUNCTION fn_CleanString(@MyString VARCHAR(max))
RETURNS VarChar(MAX)
AS
BEGIN
-- Remove unwanted strings
SELECT @MyString = REPLACE(@MyString, RemovePattern, ReplaceWith)
FROM tPatterns
WHERE CHARINDEX(RemovePattern, @MyString) != -1
RETURN @MyString
END
go
-- Create a table of all the rotten stuff we want to remove
SELECT 'tablescan' RemovePattern, 'seek' ReplaceWith
INTO tPatterns
UNION ALL SELECT 'RBAR', 'R***'
UNION ALL SELECT 'anyword', 'R*******'
UNION ALL SELECT SPACE(2), ' '
UNION ALL SELECT SPACE(2), ' '
UNION ALL SELECT SPACE(2), ' '
UNION ALL SELECT SPACE(2), ' '
UNION ALL SELECT SPACE(2), ' '
-- Create some dirty test data
SELECT 'a b c d e f g h rbar i tablescan j k' myString
INTO tMyStrings
UNION ALL
SELECT ' tablescan rbar x b c d e f g h i j k'
-- Clean & output the our test data
SELECT dbo.fn_CleanString(myString) FROM tMyStrings
go
-- Cleanup
DROP TABLE tPatterns
DROP TABLE tMyStrings
DROP FUNCTION fn_CleanString
C# Gnu
____________________________________________________
November 22, 2009 at 6:13 pm
Paul White (11/22/2009)
Jeff & Michael,:w00t: I now understand why Jeff's method broke! :w00t:
Consider these very simple statements:
Heh... Yep... that's what I got out of it, too, but I didn't say any such thing because of the handrails that a lot of folks have been trying to install lately... It was easier for me to just say that the fix is to put the COLLATE statement on all string constants. You, of course, are doing it the right way. Another "right" way would be to use two "unlikely" characters but that would also require a 4th replace to restore the first unlikely character to a space, perhaps negating the speed gain.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 22, 2009 at 6:20 pm
C# Screw (11/22/2009)
:satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied:This works nicely without the Nested replace:-):-):
Very cool revelation but I'd suggest using the same spacing scheme that Michael used for two reasons... it's faster and it will handle up to 65536 spaces.
As a side bar, it's interesting that some folks in my group at work recently received some 3rd party files with (you won't believe this) a freakin' million spaces between some words. That's not an exaggeration... there were over 1 million contiguous spaces in several spots in the file. I'm thinking "road trip" to find out in person if those kids like pork chops. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
November 22, 2009 at 6:23 pm
C# Screw (11/22/2009)
:satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied:
:ermm: That works, yes, but you know it'll suck performance-wise, right? Look at the RBAR!!!
I'm trying to be helpful, but I know I sound like the Grinch:
1. Try to avoid MAX data types as parameters and variables where possible
2. Schema-bind your functions so the engine checks it for determinism
3. Schema-qualify your objects
4. Don't do row-by-agonizing-row data access in functions!
5. Actually, don't do data access at all in functions!!
Paul
November 22, 2009 at 6:25 pm
WHERE CHARINDEX(RemovePattern, @MyString) != -1
Also, can you tell me when CHARINDEX will ever = -1? Wouldn't it be more effective to just have ...
WHERE CHARINDEX(RemovePattern, @MyString) > 0
--Jeff Moden
Change is inevitable... Change for the better is not.
November 22, 2009 at 6:33 pm
Jeff Moden (11/22/2009)
Heh... Yep... that's what I got out of it, too, but I didn't say any such thing because of the handrails that a lot of folks have been trying to install lately... It was easier for me to just say that the fix is to put the COLLATE statement on all string constants. You, of course, are doing it the right way. Another "right" way would be to use two "unlikely" characters but that would also require a 4th replace to restore the first unlikely character to a space, perhaps negating the speed gain.
Appreciate the comments Jeff - thanks! I think introducing another replace would unfairly penalise your method (which I still admire BTW) simply due to a sneaky 2000 bug. I'm just glad it is fixed in 2005 and 2008.
November 22, 2009 at 6:34 pm
Paul White (11/22/2009)
C# Screw (11/22/2009)
:satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied::ermm: That works, yes, but you know it'll suck performance-wise, right? Look at the RBAR!!!
I'm trying to be helpful, but I know I sound like the Grinch:
1. Try to avoid MAX data types as parameters and variables where possible
2. Schema-bind your functions so the engine checks it for determinism
3. Schema-qualify your objects
4. Don't do row-by-agonizing-row data access in functions!
5. Actually, don't do data access at all in functions!!
Paul
I mostly agree with #5... but then there's things like Tally tables and lookup tables (both can be cached) that make it worth it especially when Cross Apply is used. I will agree that it certainly deserves a performance test.
I'll also agree that nested replaces will still beat the RBAR of doing single replaces even if they are table driven in a single query... we've already seen that in testing.
Just in case someone is confused (not you Paul), not all seemingly "loopless" code is Set Based... you also have to look at how many times each row or item is "touched". For example, a "Triangular Join" to do a running total on a lousy 10,000 rows will produce more than 50 MILLION rows of internal RBAR which is why it's so very slow compared to even a Cursor.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 241 through 255 (of 425 total)
You must be logged in to reply to this topic. Login to reply