If you have a table of data that possibly contains characters that you don't want this script will remove the invalid characters.
You've probably seen this question asked a thousand times but most of the answers that I found usually asked, "how can I remove specific characters from a string?". In my case I didn't know the characters that needed to be replaced. The following script is as close as I could come to a full-on set-based or table-based solution. I've tested this against large sets of data and it remains pretty efficient.
Thanks to many great articles over the years by Jeff Moden which have always helped me to remember to stay away from any row by row processing.
For my test data I used a table of Aristotle quotes that was pulled in from OCR. I figured this would give me a fair amount of junk that I would want to remove. The statistics:
- 9,500 rows of data
- 282,000 characters of bad data
- 6,466,111 total characters
- 4.3% of the data is bad
- Total processing time 76 seconds
Setup
First create a tally table, basically a table of numbers. These tables are great to have in your DB. I suggest creating a permanent table in your DB but for this example we'll use all temporary tables. The numbers allow you to isolate individual characters in a string.
SET NOCOUNT ON CREATE TABLE #tally (idx int) -- Populate with values 1 through 8000 DECLARE @c int = 1 WHILE (@c <= 8000) BEGIN INSERT #tally (idx) values(@c) SET @c = @c + 1 END
Next, create the list of allowed characters including basic alpha numeric values and some punctuation. This list can be modified to contain any set of data. I used the tally table here as a way of converting my string into a table of allowed characters.
DECLARE @good varchar(50) = 'abcdefghijklmnopqrstuvwxyz 0123456789,.:;''' create TABLE #approved (letter VARCHAR(1)) INSERT #approved (letter) SELECT SUBSTRING(W.WORD, T.idx, 1) FROM (SELECT @good as WORD) W, #tally T WHERE T.idx <= LEN(@good)
For my example I'm using a book of Aristotle works that was imported from an OCR into a temp table so that the original data is untouched. I really only need the text data and an identifying key value. I've added an index to improve performance. As a side note, in testing without an index the entire script took easily 10 times longer to run. Always use indexes.
-- Drop temp tables DROP TABLE #BOOK GO DROP TABLE #BAD GO CREATE TABLE #BOOK (iid int, TEXT varchar(8000)) CREATE CLUSTERED INDEX TMP_Book ON #BOOK(iid) INSERT #BOOK (iid, TEXT) SELECT iid, TEXT from BOOK
Create a temporary table to hold your bad character information. Create an index which will improve the performance when joining this table back to the #BOOK table. Using the tally table, identify the bad characters in the data by left-joining each character to the list of good characters. I only need to capture the first instance of a specific character and its position in the string. Identifying and storing this data took about 8 seconds.
CREATE TABLE #BAD (iid int, idx int, processed int) CREATE CLUSTERED INDEX TMP_Bad ON #Bad(iid) INSERT #BAD (iid, idx, processed) SELECT A.iid, MIN(T.idx) as idx, 0 as processed FROM #BOOK A INNER JOIN #tally T ON LEN(A.TEXT) >= T.idx LEFT JOIN #approved B ON SUBSTRING(A.TEXT, T.idx, 1) = B.letter WHERE B.letter IS NULL GROUP BY A.iid, SUBSTRING(A.TEXT, T.idx, 1)
Now that I have a list of the positions of the bad characters in each of the rows of data, I can loop over the list of bad characters and replace them with an empty string. In an earlier version of this script I captured the character itself and called the REPLACE method using that character. I found that there was an odd brace character (probably a windows brace) that when captured was not found in the source data causing the REPLACE to fail to replace the character. This was probably due to some conversion issue.
The variable @SWAP holds the changing value of the string. The variable @VC holds the original value of the string. This allows us to capture the bad character based on its position in the original string.
DECLARE @VC varchar(8000), @SWAP varchar(8000), @iid INT WHILE EXISTS (SELECT processed from #BAD WHERE processed = 0) BEGIN -- Capture the text and the key SELECT TOP 1 @iid = B.iid, @VC = A.TEXT, @SWAP = A.TEXT FROM #BAD B INNER JOIN #BOOK A ON B.iid = A.iid WHERE B.processed = 0 -- Replace the bad character SELECT @SWAP = REPLACE(@SWAP, SUBSTRING(@VC, idx, 1), '') FROM #BAD WHERE iid = @iid -- Flag all BAD rows as processed for this iid UPDATE #BAD SET processed = 1 WHERE iid = @iid -- Update the book table with the new modified value UPDATE #BOOK SET TEXT = @SWAP WHERE iid = @iid END
Finally, run the original query to find bad characters and see that they've all been replaced.
SELECT A.iid, A.TEXT, SUBSTRING(A.TEXT, T.idx, 1) AS BAD, T.idx FROM #BOOK A INNER JOIN #tally T ON LEN(A.TEXT) >= T.idx LEFT JOIN #approved B ON SUBSTRING(A.TEXT, T.idx, 1) = B.letter WHERE B.letter IS NULL