October 18, 2011 at 11:13 am
Hi guys,
This is an interesting problem that I'm trying to solve that we normally do in C#, but for my specific project here for this coming weekend, I need to do directly in SQL.
I have a bunch of strings (names, addresses, cities, etc) that I need to replace any nonstandard alphanumeric characters with their HTML-friendly names, i.e. replace é with é
The only issue here is there are around 100 possible characters that I need to replace. I have a lookup table that I use in C#, but I can't think of any way to implement the same thing in SQL without having 98 nested replaces or a cursor. This will be processing around 10 fields per record on around 400k records, so I can't have completely awful performance. Slow is OK...if selecting 30,000 row batches goes from 1 second to 20 seconds, that's fine for my purposes...1 second to 30 minutes is not.
Here is some sample code with the symbol table and data. Again, I'm trying to find a way to do this without huge nested replace statements or a cursor/etc because there are around 100 possible characters to replace:
DECLARE @Symbols TABLE (
Symbol NCHAR(1),
HTML NVARCHAR(20)
)
INSERT INTO @Symbols
SELECT '¡','¡' UNION ALL
SELECT '¢','¢' UNION ALL
SELECT '£','£' UNION ALL
SELECT '¤','¤' UNION ALL
SELECT '¥','¥' UNION ALL
SELECT '¦','¦' UNION ALL
SELECT '§','§' UNION ALL
SELECT '¨','¨'
DECLARE @Text NVARCHAR(50)
SET @Text = 'Replace ¢ and £ and § with HTML-friendly strings'
SELECT @Text
Any help would be appreciated. Thanks in advance 😀
October 18, 2011 at 12:34 pm
This is actually much simpler than it seems, because SELECT is implemented behind the scenes as a loop. Jeff Moden refers to this as a pseudo-cursor. You can do this in one SELECT statement.
DECLARE @Symbols TABLE (
Symbol NCHAR(1),
HTML NVARCHAR(20)
)
INSERT INTO @Symbols
SELECT '¡','¡' UNION ALL
SELECT '¢','¢' UNION ALL
SELECT '£','£' UNION ALL
SELECT '¤','¤' UNION ALL
SELECT '¥','¥' UNION ALL
SELECT '¦','¦' UNION ALL
SELECT '§','§' UNION ALL
SELECT '¨','¨'
DECLARE @Text NVARCHAR(50)
SET @Text = 'Replace ¢ and £ and § with HTML-friendly strings'
SELECT @Text = Replace(@Text, Symbol, HTML)
FROM @Symbols
WHERE PatIndex('%' + Symbol + '%', @Text) > 0
SELECT @Text
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 18, 2011 at 2:04 pm
Thanks Drew - and Jeff if he reads this.
Only issue I'm running into now is that it's replacing characters that shouldn't be, i.e.
DECLARE @1 NVARCHAR(200),
@2 NVARCHAR(200)
SELECT @1 = 'athfawthfawf', @2 = 'þ'
SELECT PATINDEX('%' + @2 + '%',@1)
Any occurrence of 'th' will be considered the same as 'þ' and be replaced, along with quite a few other symbols.
I've tried different collations, even binary, but no luck in stopping it from doing this. Any ideas here?
October 18, 2011 at 2:44 pm
I was able to get it to work with the following:
SELECT @Text = Replace(@Text COLLATE LATIN1_GENERAL_BIN, Symbol, HTML)
FROM @Symbols
WHERE PatIndex('%' + Symbol + '%', @Text) > 0
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 18, 2011 at 2:56 pm
here's anotehr version i just got working as a function;
it's using a Tally table to swap char by char, based on the ascii() code of the character in question: i added a list of html entities i had on hand, so my list is pretty large, and those i did not have a default of i used the & # 255 ; format if i didn't have an explicit name.
SELECT dbo.HTMLizeHighAscii( 'Replace ¢ and £ and § with HTML-friendly strings')
i had a ton of problems getting the function to paste cleanly into the forum, because it contains html codes, so here's a link to the text file of it instead.
Lowell
October 18, 2011 at 3:07 pm
Thanks again. Drew, your suggestion worked...I was collating when I set the variable inside the function but not inside the replace statement...once I moved it there, everything seems honky dory now.
Lowell I'll give that function a quick look too to see how it does it. I have a feeling they'll have more unusual requests like this going forward and might need a more robust solution.
You guys are awesome, as usual. Now to go play around with pseudo-cursors and see what other stuff I can rewrite..
October 18, 2011 at 3:19 pm
Lowell (10/18/2011)
i had a ton of problems getting the function to paste cleanly into the forum, because it contains html codes, so here's a link to the text file of it instead.
I had to entitize the ampersand twice. So in order to get ¥ I had to put in ¥
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply