December 6, 2010 at 3:41 pm
telastruct (12/6/2010)
Here's another approach:http://iso30-sql.blogspot.com/2010/10/remove-non-printable-unicode-characters.html
Thanks for the link and welcome aboard! Shifting gears, please be sure to test that code from that link for performance before you think of using it because it's likely a whole lot slower than the code already posted on this thread because of the scalar function with a While loop in it. The "nested REPLACE" code posted on this thread will absolutely fly.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 30, 2010 at 9:16 am
Actually, I did test it and it works great. I tested it on a table with almost 4 million rows and it only took a minute or two.
December 30, 2010 at 12:44 pm
telastruct (12/30/2010)
Actually, I did test it and it works great. I tested it on a table with almost 4 million rows and it only took a minute or two.
Now try the 'nested REPLACE' code Jeff referred to.
You should find it's much faster (and more efficient).
Actually.
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
December 30, 2010 at 1:21 pm
telastruct (12/30/2010)
Actually, I did test it and it works great. I tested it on a table with almost 4 million rows and it only took a minute or two.
As Paul said, test the nested-REPLACE method... you'll find that 4 million rows are process in just several seconds... 😉 That's provided that you're not outputing the 4 million rows to the display which is the "great equalizer". Lot's of people think their code runs well when they compared it to other code only by how long it takes to render on the screen. Bad code and good code will frequently take the same time on-screen simply because they have the same number of rows to display.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 30, 2010 at 8:30 pm
Thanks, guys. I apologize if my "actually" seemed snitty- I see how it could be taken that way after re-reading it. I guess I should have followed it up with the question: If the replace method "can clean the 13 million rows in about 25 minutes" (about 2 minutes for each million rows), how is that faster than 2 minutes for about 4 million rows? I'm not a DB expert, and am sure there's probably some other factor I'm missing...?
Even if it is not as efficient, I do like the way the other solution is packaged- it seems like the replace method would be pretty easy to incorporate.
Thank you again for your thoughts.
December 31, 2010 at 1:39 pm
telastruct (12/30/2010)
Thanks, guys. I apologize if my "actually" seemed snitty- I see how it could be taken that way after re-reading it. I guess I should have followed it up with the question: If the replace method "can clean the 13 million rows in about 25 minutes" (about 2 minutes for each million rows), how is that faster than 2 minutes for about 4 million rows? I'm not a DB expert, and am sure there's probably some other factor I'm missing...?Even if it is not as efficient, I do like the way the other solution is packaged- it seems like the replace method would be pretty easy to incorporate.
Thank you again for your thoughts.
Understood and thanks for the feedback.
The real key is the two different rowcounts on 2 different tables on 2 different machines by two different people using two different methods that haven't been fully posted. For example, are they doing an UPDATE, piping to a new table, or displaying on the screen? What's the status of fragmentation for both tables? How many indexes are being updated? Are the systems using the same version? Are the systems using the same capacity pipe and hard drives? What is the wait-state of the two systems? Are they both "dead quiet" or is one supporting a world wide heavy hit Web site with lots of reporting going on?
There's something else that folks need to understand... each system has a "tipping point" where UPDATEs are concerned. For example... on one system it may take only 10 seconds to update a million rows and, likewise, only 20 seconds to update 2 million rows and only 30 seconds to update 3 million rows. But, on that same system with the same table and index structures using the exact same query, it may suddenly take 4 hours to update only 4 million rows. Everything else being the same, another system may be able to handle 6 million rows without reaching the "tipping point".
If you want to do a comparison, it has to be on the same table and the same data or at least data with the same level of randomization. You just can't tell what's what with all the other variables currenlty in play.
I'll also admit that I've seen certain memory only scalar functions beat other methods even when they have a WHILE loop in it. But, we won't actually know until at least one person actually tests both methods on the same data on the same machine. I was hoping you'd be that person so I didn't actually have to be "the one" yet again. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
July 25, 2012 at 11:17 am
This is exactly what i was looking for. My data had ASCII(160) appended to it. I just added that to the list and it worked very well.
-MK
May 18, 2013 at 10:01 am
This is what I have been looking for and forgive me if this is a simple question (I am trying to learn some of these concepts, like in-line table-valued functions), but how do you easily apply this to an entire table? How do you use this, even? It returns a table (not permanent), so I selected the result set into a new table, but I don't know if that defeats the point of using the TVF in the first place.
This:
select * into dbo.testtable from
(SELECT t1.id as id, t1.title as title, clean.cleaned as cleaned from t1
cross apply dbo.if_cleanwithreplace(t1.title) clean) a
fixes one column
May 25, 2013 at 2:14 pm
ryan.blosser (5/18/2013)
This is what I have been looking for and forgive me if this is a simple question (I am trying to learn some of these concepts, like in-line table-valued functions), but how do you easily apply this to an entire table? How do you use this, even? It returns a table (not permanent), so I selected the result set into a new table, but I don't know if that defeats the point of using the TVF in the first place.This:
select * into dbo.testtable from
(SELECT t1.id as id, t1.title as title, clean.cleaned as cleaned from t1
cross apply dbo.if_cleanwithreplace(t1.title) clean) a
fixes one column
You can simplify that query quite a bit. No need for the nesting.
SELECT t1.ID, t1.Title, clean.Cleaned
INTO dbo.TestTable
FROM t1
CROSS APPLY dbo.if_cleanwithreplace(t1.Title) clean
;
The purpose of the iTVF, in this case, is to function more like an iSF (Inline Scalar Function) which is typically about 7 times faster than a run-of-the-mill scalar function. See the following article for more on that.
http://www.sqlservercentral.com/articles/T-SQL/91724/
Putting the results in a table doesn't negate the value of such functions. It depends on what you want to do. The results could just as easily be returned directly.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 21, 2014 at 8:25 am
Interesting solution.
How would you handle multiple columns to clean? Doing multiple cross applies seems like it would kill performance as you could be theoretically joining back to the same 4 million record table 2, 3, or more times.
Thanks.
October 28, 2015 at 11:54 am
Wow, adding the bin collation really does affect performance greatly!
I just tried this over one of my tables that's got 12M rows and it went from 28 sec to scrub all rows to only 8 sec with NO OTHER CHANGES.
A side note, I changed this function to use NVARCHAR(4000) instead of VARCHAR(8000) as that was my need and then I also changed all the CHAR() references to NCHAR() references and the '' empty spaces to N'' and found no noticeable difference in speed (it's actually 100ms faster on my few tries, perhaps because the binary collation and all the replacements can all do their thing without any implicit conversions to/from single-byte/double-byte?)
Love this forum.
October 28, 2015 at 12:08 pm
And another side note, that 8 sec for 12M rows using NVARCHAR(4000) and BIN collation turns into 36 seconds when using NVARCHAR(MAX) instead of NVARCHAR(4000). I know everyone said it hurts performance to use MAX, I'm just including some numbers here so people grasp the exact scale of the performance hit is all...like 4.5x slower sort of a hit.
October 29, 2015 at 1:17 pm
Paul White (4/6/2010)
carl.anderson-1037280 (4/6/2010)
I guess I have to read up on collations now!Worth doing, but the reason it is important here is that the sorting and comparison rules are very complex for Windows collations, much less so for SQL collations, and pretty simplistic for binary collations. See https://connect.microsoft.com/SQLServer/feedback/details/512459/replace-function-extremely-slow-with-non-binary-windows-collation
PS> I replaced the SPACE(0) with '' - it didn't seem to change performance. What's the idea behind SPACE(0)??
I just find SPACE(0) more explicit and easier to read than ''.
The difference between '' and ' ' is not always easy to spot in scripts, whereas SPACE(0) is clearly not the same as SPACE(1) 🙂
OK, one question I have (and although I've been doing this for like 15 years, mucking with collation is something new to me this year) how does one switch the collation back to database default then before the value gets output from the TVF? When I union the results of this to another table that hasn't been cleaned/stripped like this, it gives an error about not being able to deal with the differing collations. But if you put COLLATE DATABASE_DEFAULT at the end of the fn code, it will give you an error about putting COLLATE on an expression that already has a COLLATE...so the only way I can think of doing this is in the SQL statement that calls this TVF in the first place to put a COLLATE there. But that feels a little clunky having to essentially cast everything outside the function when it makes more sense to do everything that needs done inside the fn.
In other words, how can we use the BIN collation for speed purposes of all the replace() functions, but switch it back to db default collation when we return the cleaned string back to the calling statement?
What am I missing here?
November 3, 2015 at 12:04 pm
One last thing...BIN vs BIN2 collation? Does it matter?
November 4, 2015 at 5:44 am
Here's the updated code for anyone interested:
CREATE FUNCTION dbo.IF_CleanWithReplace
[...]
WITH SCHEMABINDING
[...]
In additions to allnelsons questions above, i'm curious as to the use of SCHEMABINDING here.
In BOL, Schemabinding "Specifies that the function is bound to the database objects that it references. When SCHEMABINDING is specified, the base objects cannot be modified in a way that would affect the function definition. [...]"
but there are no database objects referenced by the function!
Many thanks (Jeff, Paul and Carl) for a fantastic article and a great script. 😀
Mark
Viewing 15 posts - 31 through 45 (of 47 total)
You must be logged in to reply to this topic. Login to reply