June 21, 2010 at 8:16 am
I know this should be easy, but I can't seem to get it. I have a database with a table that has 800 million rows of data that contains any number of extra spaces and non-ASCII characters which are causing my front-end application to have issues. I need an SP that I can run that goes through replaces non-ASCII characters with spaces and reduces multi-blank spaces with single blank spaces. It needs to run search through each column of the table.
On the non-ASCII characters: I have no idea how to search and replace for them. I would think that somewhere someone has a functions/SP that will do it, but not that I can find.
On the blank spaces: This functions works great.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER FUNCTION [dbo].[cleanString] (@s varchar(50))
RETURNS varchar(50)
AS
BEGIN
declare @i int, @l int
declare @t varchar(50), @d varchar(50)
select @s-2=LTRIM(RTRIM(@s))
select @i=CHARINDEX(' ',@s) -1
if @i<1 return @s-2
set @t=''
while @i>0
begin
set @d=LEFT(@s,@i)
set @t = @t + ' ' + @d
set @l=LEN(@s)
set @s-2=LTRIM(right(@s,@l-@i-2))
set @i=CHARINDEX(' ',@s)-1
end
RETURN ( ltrim(@t + ' ' + @s-2) )
END
However, It will only work on one column at a time. I need it to run across the entire row before moving on.
In short, I need an SP that will take a row of data, replace non-ASCII characters with spaces, and then remove extra blank spaces.
Thanks
June 21, 2010 at 8:50 am
--First, lets build some test data
DECLARE @table TABLE(
data VARCHAR(20));
DECLARE @cnt INT
--2,500,000 rows
SET @cnt=2500000
WHILE @cnt > 0
BEGIN
SET @cnt=@cnt - 1
INSERT INTO @table
SELECT Upper(LEFT(Newid(), 3)) + ' ' + '$' + Upper(LEFT(Newid(), 1)) +
Lower
(
LEFT(Newid(), 12))
END
--Now the query
;WITH num1 (n)
AS (SELECT 1
UNION ALL
SELECT 1),
num2 (n)
AS (SELECT 1
FROM num1 AS data,
num1 AS data2),
num3 (n)
AS (SELECT 1
FROM num2 AS data,
num2 AS data2),
num4 (n)
AS (SELECT 1
FROM num3 AS data,
num3 AS data2),
nums (n)
AS (SELECT Row_number() OVER(ORDER BY n)
FROM num4),
cleaner
AS (SELECT data,
(SELECT CASE
WHEN Substring(data, n, 1) NOT LIKE '%[^a-zA-Z0-9_]%'
THEN
Substring(data, n, 1)
ELSE ''
END + ''
FROM nums
WHERE n <= Len(data)
FOR XML PATH('')) AS new_data
FROM @table)
SELECT data,
new_data
FROM cleaner;
How's that?
/*Output
data new_data
-------------------- --------------------
3B6 $063761ccd-1ae 3B6063761ccd1ae
649 $99aa55943-14c 64999aa5594314c
BFA $36426bb2d-0dd BFA36426bb2d0dd
D45 $40f3b40c7-5cf D4540f3b40c75cf
812 $0f8b0e0e7-cef 8120f8b0e0e7cef
B7A $06006ba34-52f B7A06006ba3452f
BCD $F6f27b65a-703 BCDF6f27b65a703
824 $B0ea2c21a-021 824B0ea2c21a021
A3A $Fdcac2ed5-306 A3AFdcac2ed5306
9A4 $9e812cb39-c26 9A49e812cb39c26
3CF $9cb7feeeb-18f 3CF9cb7feeeb18f
221 $8c78a1fca-d0c 2218c78a1fcad0c
D24 $Bdbac8ffa-8f3 D24Bdbac8ffa8f3
064 $F7c34b3c0-dd8 064F7c34b3c0dd8
154 $91a737a45-118 15491a737a45118
*/
*EDIT*
Sorry, I misread the question as "can I remove non-alphanumberic character". I'll have a look either later on or tomorrow if someone else hasn't already given you a query.
June 22, 2010 at 2:23 am
Right, modified. Not too happy with it, but it works.
--First, lets build some test data
DECLARE @table TABLE(
data VARCHAR(20));
DECLARE @cnt INT
--2,500,000 rows
SET @cnt=2500000
WHILE @cnt > 0
BEGIN
SET @cnt=@cnt - 1
INSERT INTO @table
SELECT Upper(LEFT(Newid(), 3)) + ' ' + '$' + Upper(LEFT(Newid(), 1)) +
Lower (LEFT(Newid(), 12))
END
--Now the query
;WITH num1 (n)
AS (SELECT 1
UNION ALL
SELECT 1),
num2 (n)
AS (SELECT 1
FROM num1 AS data,
num1 AS data2),
num3 (n)
AS (SELECT 1
FROM num2 AS data,
num2 AS data2),
num4 (n)
AS (SELECT 1
FROM num3 AS data,
num3 AS data2),
nums (n)
AS (SELECT Row_number() OVER(ORDER BY n)
FROM num4),
cleaner
AS (SELECT data,
(SELECT CASE
WHEN Substring(data, n, 1) LIKE '%[a-zA-Z0-9]%' THEN
Substring(data, n, 1)
ELSE ' '
END + '~FAKE~'
FROM nums
WHERE Len(data) >= n
FOR XML PATH('')) AS new_data
FROM @table)
SELECT data,
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(new_data, '%~FAKE~ ', ''),
'~FAKE~', '')
, ' ',
'<>'), '><', ''), '<>', ' ') AS new_data
FROM cleaner;
/*OUTPUT
data new_data
-------------------- -----------------------
CC1 $66b8c36dd-152 CC1 66b8c36dd 152
752 $54fa19f95-227 752 54fa19f95 227
743 $6b2d27a15-da6 743 6b2d27a15 da6
79D $7206ec154-c00 79D 7206ec154 c00
C3A $E03edea79-f7b C3A E03edea79 f7b
5CD $9f354f659-177 5CD 9f354f659 177
FE8 $22407f934-ef2 FE8 22407f934 ef2
A5C $Dda09ace8-8e2 A5C Dda09ace8 8e2
E50 $42bc1c44d-23c E50 42bc1c44d 23c
5C3 $1ec89932d-d0f 5C3 1ec89932d d0f
605 $F6afded0e-ecd 605 F6afded0e ecd
68D $F3a5e446f-ef7 68D F3a5e446f ef7
020 $E12e651bc-fbf 020 E12e651bc fbf
7B3 $3b3f8db13-4ab 7B3 3b3f8db13 4ab
562 $46108cf5e-785 562 46108cf5e 785
*/
*EDIT*
Spotted a bug with it 10mins after posting. Fixed now, should be working full.
Let me know if that's right for you! :hehe:
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply