June 27, 2017 at 11:16 pm
Comments posted to this topic are about the item Replace Bad Characters in Table
June 28, 2017 at 2:58 am
I doubt, that this will be fast. It does not use an explicit cursor, but practically does the same (or something worser) with its while loop.
If someone really needs this stuff, I'd suggest to wait for the SQL 2017 release and use the new TRANSLATE() function.
God is real, unless declared integer.
June 28, 2017 at 7:00 am
I agree with franz, this will be slow. I´m not sure about the performance of the TRANSLATE function (MS has done terrible things with FORMAT), but here's a much faster version that will actually be set based instead of using RBAR. The code is also much shorter.
WITH cteBook AS(
SELECT *,
STUFF(( SELECT SUBSTRING( TEXT, n, 1)
FROM Tally
WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
AND n <= LEN(TEXT)
FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '') CleanText
FROM #BOOK
)
UPDATE cteBook
SET TEXT = CleanText;
With 10,000 rows the method posted in the article takes several minutes. This code completes in less than one second. It can also be converted into an iTVF to ease its use.
June 28, 2017 at 8:15 am
Regardless of anything else, when doing string operations of this nature, there are two important best practices to follow:
Take care, Solomon..
SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR
June 28, 2017 at 9:59 am
@Luis Cazares
Your code is faulty. That STUFF is not needed and it causes at least the first letter omission from the text. This is how it should be:
WITH cteBook AS(
SELECT
*,
(
SELECT SUBSTRING( TEXT, n, 1)
FROM Tally
WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
AND n <= LEN(TEXT)
FOR XML PATH(''), TYPE
).value('./text()[1]', 'varchar(max)') CleanText
FROM #BOOK
)
UPDATE cteBook
SET TEXT = CleanText;
The WHERE clause in the CleanText subquery already filters out the characters that don't match the regular expression. However this only eliminates "bad" characters but does not replace them with something equivalent in the case of encoding mismatch. For instance, how do you solve a text like this:
Sir George Étienne Cartier Parkway
If you eliminate É the result will be an incomplete street name.
Don't just give the hungry man a fish, teach him how to catch it as well.
the sqlist
June 28, 2017 at 10:24 am
t.franz - Wednesday, June 28, 2017 2:58 AMI doubt, that this will be fast. It does not use an explicit cursor, but practically does the same (or something worser) with its while loop.If someone really needs this stuff, I'd suggest to wait for the SQL 2017 release and use the new TRANSLATE() function.
For people that don't have SQL Server 2017 you can use this Translate function for SQL Server. It's scalar, which I normally don't recommend, but it was faster than any inline table valued function that I could come up with.
-- Itzik Ben-Gan 2001
June 28, 2017 at 10:33 am
the sqlist - Wednesday, June 28, 2017 9:59 AM@Luis Cazares
Your code is faulty. That STUFF is not needed and it causes at least the first letter omission from the text. This is how it should be:
WITH cteBook AS(
SELECT
*,
(
SELECT SUBSTRING( TEXT, n, 1)
FROM Tally
WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
AND n <= LEN(TEXT)
FOR XML PATH(''), TYPE
).value('./text()[1]', 'varchar(max)') CleanText
FROM #BOOK
)
UPDATE cteBook
SET TEXT = CleanText;
The WHERE clause in the CleanText subquery already filters out the characters that don't match the regular expression. However this only eliminates "bad" characters but does not replace them with something equivalent in the case of encoding mismatch. For instance, how do you solve a text like this:Sir George Étienne Cartier Parkway
If you eliminate É the result will be an incomplete street name.
You're right, I forgot to remove that when I included the template. The code becomes even simpler. If a replacement is needed, we would just need to use a CASE expression and remove the WHERE clause. I coded it to only remove characters because that's what the code from the article is doing. I'm usually against changing data like this because it's easy to get into trouble and change data that shouldn't be changed.
June 28, 2017 at 11:20 am
Luis Cazares - Wednesday, June 28, 2017 7:00 AMI agree with franz, this will be slow. I´m not sure about the performance of the TRANSLATE function (MS has done terrible things with FORMAT), but here's a much faster version that will actually be set based instead of using RBAR. The code is also much shorter.
WITH cteBook AS(
SELECT *,
STUFF(( SELECT SUBSTRING( TEXT, n, 1)
FROM Tally
WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
AND n <= LEN(TEXT)
FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '')CleanText
FROM #BOOK
)
UPDATE cteBook
SET TEXT = CleanText;
With 10,000 rows the method posted in the article takes several minutes. This code completes in less than one second. It can also be converted into an iTVF to ease its use.
It has been converted into a very well-tested, high-performing iSF: PatExclude8K. The solution would look like this:UPDATE #Book
SET [Text] = NewString
FROM #Book
CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']');
It's also worth noting that a permanent tally table (if that's what the tally table referenced in your code is) will perform poorly compared to an inline (I'm aware that you know this Luis - I mention for other people's benefit).
I put together a test to compare the SELECT statement performance of Luis' solution against patexclude8K with a serial and parallel plan using a perm tally table, then again using an inline tally table. Here's the tally DDL:CREATE TABLE dbo.tally
(
N bigint NOT NULL,
CONSTRAINT pk_cl_dbo_tally PRIMARY KEY CLUSTERED (N ASC),
CONSTRAINT uq_nc_dbo_tally UNIQUE NONCLUSTERED (N ASC)
);
INSERT dbo.tally
SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM sys.all_columns a, sys.all_columns b;
USE tempdb
GO
IF OBJECT_ID('tempdb..#Book') IS NOT NULL DROP TABLE #Book;
CREATE TABLE #Book (iid int, TEXT varchar(8000));
INSERT #Book
SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), REPLACE(newid(),'9',' ')
FROM sys.all_columns a, sys.all_columns b;
Test #1 - PatExclude8K vs solution with perm tally
PRINT 'Luis - serial'+char(10)+replicate('-', 50);
GO
DECLARE @st datetime = getdate(), @x varchar(8000);
SELECT @x =
( SELECT SUBSTRING( TEXT, n, 1)
FROM dbo.Tally
WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
AND n <= LEN(TEXT)
FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)') --CleanText
FROM #Book
OPTION (MAXDOP 1);
PRINT DATEDIFF(MS, @st, getdate());
GO 5
PRINT char(10)+'PatExclude8K - serial'+char(10)+replicate('-', 50);
GO
DECLARE @st datetime = getdate(), @x varchar(8000);
SELECT @x = NewString
FROM #Book
CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']')
OPTION (MAXDOP 1);
PRINT DATEDIFF(MS, @st, getdate());
GO 5
PRINT 'Luis - parallel'+char(10)+replicate('-', 50);
GO
DECLARE @st datetime = getdate(), @x varchar(8000);
SELECT @x =
( SELECT SUBSTRING( TEXT, n, 1)
FROM dbo.Tally
WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
AND n <= LEN(TEXT)
FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(8000)') --CleanText
FROM #Book
CROSS APPLY dbo.make_parallel();
PRINT DATEDIFF(MS, @st, getdate());
GO 5
PRINT char(10)+'PatExclude8K - parallel'+char(10)+replicate('-', 50);
GO
DECLARE @st datetime = getdate(), @x varchar(8000);
SELECT @x = NewString
FROM #Book
CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']')
CROSS APPLY dbo.make_parallel();
PRINT DATEDIFF(MS, @st, getdate());
GO 5
Results:
Luis - serial
--------------------------------------------------
Beginning execution loop
4387
3830
4030
4250
3917
Batch execution completed 5 times.
PatExclude8K - serial
--------------------------------------------------
Beginning execution loop
2314
2166
2300
2553
2224
Batch execution completed 5 times.
Luis - parallel
--------------------------------------------------
Beginning execution loop
1120
1153
1160
1387
1223
Batch execution completed 5 times.
PatExclude8K - parallel
--------------------------------------------------
Beginning execution loop
677
660
700
683
716
Batch execution completed 5 times.
Test #2 - Same test but with an inline tally for Luis' solution
PRINT 'Luis - serial(inline tally)'+char(10)+replicate('-', 50);
GO
DECLARE @st datetime = getdate(), @x varchar(8000);
SELECT @x =
( SELECT SUBSTRING( TEXT, n, 1)
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) a(x),
(VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) b(x),
(VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) c(x),
(VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) d(x)
) iTally(N)
WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
AND n <= LEN(TEXT)
FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)') --CleanText
FROM #Book
OPTION (MAXDOP 1);
PRINT DATEDIFF(MS, @st, getdate());
GO 5
PRINT char(10)+'PatExclude8K - serial'+char(10)+replicate('-', 50);
GO
DECLARE @st datetime = getdate(), @x varchar(8000);
SELECT @x = NewString
FROM #Book
CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']')
OPTION (MAXDOP 1);
PRINT DATEDIFF(MS, @st, getdate());
GO 5
PRINT 'Luis - parallel (inline tally)'+char(10)+replicate('-', 50);
GO
DECLARE @st datetime = getdate(), @x varchar(8000);
SELECT @x =
( SELECT SUBSTRING( TEXT, n, 1)
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) a(x),
(VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) b(x),
(VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) c(x),
(VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) d(x)
) iTally(N)
WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
AND n <= LEN(TEXT)
FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)') --CleanText
FROM #Book
CROSS APPLY dbo.make_parallel();
PRINT DATEDIFF(MS, @st, getdate());
GO 5
PRINT char(10)+'PatExclude8K - parallel'+char(10)+replicate('-', 50);
GO
DECLARE @st datetime = getdate(), @x varchar(8000);
SELECT @x = NewString
FROM #Book
CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']')
CROSS APPLY dbo.make_parallel();
PRINT DATEDIFF(MS, @st, getdate());
GO 5
Results
Luis - serial (inline tally)
--------------------------------------------------
Beginning execution loop
2376
2384
2323
2334
2387
Batch execution completed 5 times.
PatExclude8K - serial
--------------------------------------------------
Beginning execution loop
2217
2273
2307
2170
2370
Batch execution completed 5 times.
Luis - parallel (inline tally)
--------------------------------------------------
Beginning execution loop
684
663
694
716
664
Batch execution completed 5 times.
PatExclude8K - parallel
--------------------------------------------------
Beginning execution loop
677
746
827
730
700
Batch execution completed 5 times.
-- Itzik Ben-Gan 2001
December 11, 2019 at 2:25 pm
that's some awesome coding there! Thanks for the improvements 🙂
ST
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply