March 3, 2005 at 6:14 am
We have a SQL 2000 stored procedure which will remove spaces, punctuation characters etc from a specified field (i.e. to cleanup/standardize business names temporarily before matching) but the performance is TERRIBLE. It uses a cursor and steps through one record at a time then looks at one character at a time using PATINDEX and removing characters and updating the final result before going to the next record. I have not been able to figure out a way to convert it to a batch process to speed it up. I have 950,000 rows to process and using the stored procedure only completed 40,000 rows in 15 hours! Does anyone have or know of a more efficient compress function or script. The only option I can think of at this time is to export to text and use the SAS compress function on it then upload back to SQL Server. Thanks, Larry Hilyard
larry.hilyard@bcbsfl.com
March 3, 2005 at 6:50 am
Have you searched the script section here?
Anyway, it would also help if you could post the procedure along with sample data and the desired output.
--
Frank Kalis
Microsoft SQL Server MVP
Webmaster: http://www.insidesql.org/blogs
My blog: http://www.insidesql.org/blogs/frankkalis/[/url]
March 3, 2005 at 8:28 am
I did search the posts in the last year in the programming forum for any related items with no applicable result. I am not familiar with the site layout so have not yet found a separate section with scripts (I only see a link to recent scripts on the opening site page). Here is the desired input and output:
BusNm (Input) (Desired Output)
BRINK'S CO BRINKSCO
SONNY'S REAL PIT BAR-B-Q SONNYSREALPITBARBQ
R J GATOR'S RJGATORS
Here is a SAS statement that will yield the desired output:
/* Update work tables in PCSAS */
data temp;
set Work.GdbDump;
BusNm = compress(BusNm,' ."<>!#&,'); /* remove spaces and punctuation except sngl quote */
BusNm = compress(BusNm,''''); /* remove single quotes */
run;
Here is the SQL Server stored procedure:
CREATE PROCEDURE sp_compress2 @StringToCompress varchar(255), @ReturnVal varchar(255) OUTPUT
AS
/*********************************************************************
sp_compress
This stored procedure removes certain characters from a string and
returns the result.
Usage:
DECLARE @FullName
DECLARE @CompressedFullName
EXECUTE sp_compress @FullName, @CompressedFullName OUTPUT
If @FullName="DOE, JOHN M", then @CompressedFullName="DOEJOHNM"
MODIFICATIONS HISTORY
---------------------
2000-03-13 PC Creation
2005-03-02 LH Modify to remove single quote also
*********************************************************************/
SET NOCOUNT ON
SET QUOTED IDENTIFIER OFF
declare @ResultTx varchar(255)
declare @CompressPosNb tinyint
select @ResultTx = @StringToCompress
/* Compress ' ' (spaces) */
select @CompressPosNb = PATINDEX ('% %', @ResultTx)
while @CompressPosNb <> 0
begin
select @ResultTx = substring(@ResultTx, 1, @CompressPosNb - 1) + substring(@ResultTx, @CompressPosNb + 1, len(@ResultTx))
select @CompressPosNb = PATINDEX ('% %', @ResultTx)
end
/* Compress '-' */
select @CompressPosNb = PATINDEX ('%-%', @ResultTx)
while @CompressPosNb <> 0 /* and @CompressPosNb <> len(@ResultTx) */
begin
select @ResultTx = substring(@ResultTx, 1, @CompressPosNb - 1) + substring(@ResultTx, @CompressPosNb + 1, len(@ResultTx))
select @CompressPosNb = PATINDEX ('%-%', @ResultTx)
end
/* Compress "'" */
select @CompressPosNb = PATINDEX ("%'%", @ResultTx)
while @CompressPosNb <> 0 /* and @CompressPosNb <> len(@ResultTx) */
begin
select @ResultTx = substring(@ResultTx, 1, @CompressPosNb - 1) + substring(@ResultTx, @CompressPosNb + 1, len(@ResultTx))
select @CompressPosNb = PATINDEX ("%'%", @ResultTx)
end
/* Compress '.' (period) */
select @CompressPosNb = PATINDEX ('%.%', @ResultTx)
while @CompressPosNb <> 0
begin
select @ResultTx = substring(@ResultTx, 1, @CompressPosNb - 1) + substring(@ResultTx, @CompressPosNb + 1, len(@ResultTx))
select @CompressPosNb = PATINDEX ('%.%', @ResultTx)
end
select @ReturnVal = @ResultTx
set nocount off
GO
And finally, here is the SQL script which calls the stored procedure within a cursor:
/* SQL Script to process table using stored procedure */
SET ANSI_NULLS OFF
SET NOCOUNT ON
DECLARE @UnqId int
DECLARE @FullName varchar(60)
DECLARE @CompressedFullName varchar(60)
DECLARE @RowCt int
DECLARE @RowTotalCt int
DECLARE GetUnqId CURSOR FOR
SELECT distinct UnqId from test
OPEN GetUnqId
FETCH GetUnqId INTO @UnqId
select @RowCt = 0
select @RowTotalCt = @@CURSOR_ROWS
WHILE @@fetch_status = 0
BEGIN
select @FullName = null
select @CompressedFullName = null
select @FullName =
(select BusNm from test
where UnqId = @UnqId)
EXECUTE sp_compress2 @FullName, @CompressedFullName OUTPUT
IF COALESCE(@CompressedFullName,'Y') <> 'Y'
begin
UPDATE test
SET BusNm = @CompressedFullName
WHERE UnqId=@UnqId AND BusNm = @FullName
end
/* Next UnqId */
select @FullName = null
select @CompressedFullName = null
CHECKPOINT
FETCH GetUnqId INTO @UnqId
select @RowCt = @RowCt + 1
if (@RowCt % 10000 = 0 OR @RowCt = @RowTotalCt)
select convert(char(25), getdate()) + 'Processed ' + convert(varchar(6), @RowCt) + ' rows of ' + convert(varchar(6), @RowTotalCt)
END /* (WHILE) */
DEALLOCATE GetUnqId
SET NOCOUNT OFF
GO
Thanks for looking at it.
larry.hilyard@bcbsfl.com
March 4, 2005 at 1:14 am
Well, if it's just removing certain characters, I would recommend using REPLACE - this will work without cursor. You can nest several REPLACEs, which makes it a lot easier.
SELECT REPLACE(BusNm, '-','') FROM YourTable
removes all occurences of the minus sign (-). Syntax is replace WHERE, WHAT, WITH WHAT. If WITH WHAT is empty string (two single quotes), the character is simply removed.
SELECT REPLACE(REPLACE(BusNm, '-',''),' ','') FROM YourTable
removes all occurrences of both minus sign and space at once - and so on. Very simple and efficient. Just make sure you use the correct syntax when removing single quote, like here:
SELECT REPLACE(BusNm, '''', '') FROM YourTable
... because single quote is text delimiter, so you have to double it inside the single quotes (i.e. put 2 single quotes there, not a double quote!) to tell SQL Server that this time it is meant as a character, not as a delimiter. Good luck!
March 4, 2005 at 1:34 am
Just a quick comment on this - removing data, especially characters like '-' and ' ', can leads to incorrect results. For example, 3 different companies:
'SuperTech'
'Super-Tech'
'Super Tech'
All become:
'SuperTech'
Just a thought....
March 4, 2005 at 1:47 am
Wanderer,
as I understood Larry, this is not a permanent removal - it is used just to create a search pattern or matchcode or whatever you want to call it. In such case it is always better to skip certain characters, because people tend to differ in writing the same company name or address - someone uses 356/15, someone 356-15 as a house number, someone adds one more empty space between two words, but it is still the same house or company... or at least it could be the same. Results of such search are rarely used directly, in most cases if several are found, they are displayed to the user and user selects the one that is correct.
Well, at least that's how we use the matchcode
March 4, 2005 at 1:52 am
Vladan,
True enough, but given 3 validly (is there even such a word ? ) different companies, they might incorrectly match...
As I said - just a thought...
A more important thougt is ... IT'S FRIDAY. enjoy the weekend..
March 4, 2005 at 4:23 am
Sorry for being late on this.
Actually, I think there is more than one way to solve this.
Given your sample data:
set nocount on
use tempdb
create table #cleanme
(
c1 varchar(30)
)
insert into #cleanme values('BRINK''S CO')
insert into #cleanme values('SONNY''S REAL PIT BAR-B-Q ')
insert into #cleanme values('R J GATOR''S')
set nocount off
As has been mentioned, you can use REPLACE. So, something like this maybe:
select replace(replace(replace(c1,'''',''),' ',''),'-','') from #cleanme
---------------------
BRINKSCO
SONNYSREALPITBARBQ
RJGATORS
(3 row(s) affected)
And, another way, which I would perhaps prefer as it cleans up all noncharacters, is to use a UDF like this
CREATE FUNCTION dbo.RemoveNonChars(@Input varchar(1000))
RETURNS VARCHAR(1000)
BEGIN
DECLARE @pos INT
SET @Pos = PATINDEX('%[^a-Z]%',@Input)
WHILE @Pos > 0
BEGIN
SET @Input = STUFF(@Input,@pos,1,'')
SET @Pos = PATINDEX('%[^a-Z]%',@Input)
END
RETURN @Input
END
GO
SELECT dbo.RemoveNonChars(c1) from #cleanme
---------------------
BRINKSCO
SONNYSREALPITBARBQ
RJGATORS
(3 row(s) affected)
--
Frank Kalis
Microsoft SQL Server MVP
Webmaster: http://www.insidesql.org/blogs
My blog: http://www.insidesql.org/blogs/frankkalis/[/url]
March 4, 2005 at 6:37 am
Just another thaught on this. If this search is occuring often and considering the pretty great number of rows, wouldn't it be better just to create a new column with all the replaced characters. It would only need to be maintaned through triggers when the data is inserted or modified so that you wouldn't have to rerun the replace code every time someones enters a search. The great thing about this method is that you could use full-text indexing on this column which would yield far greater performance than any version of the replace function you can write and would be far less costly on the cpu to maintain.
March 4, 2005 at 7:19 pm
Frank has a great point above removing all characters except for letters. However, if you have a predefined list you want to remove, the technique below works really well.
The big problem with the techniques above is that they are not set based. You'll never get the performance you need with these techniques. The one below, however, is, and it shows in the performance.
-------------------------------------------------------------------------------------------
use pubs
create table MatchCodeRemoveChar
(Name char(1) Primary Key
)
Insert MatchCodeRemoveChar
select '<'
union all
select '>'
union all
select '!'
union all
select '#'
union all
select '&'
union all
select ','
union all
select ''''
go
create function getMatchCode (@MatchValue varchar(255))
returns varchar(255)
as
BEGIN
select @MatchValue = replace(@MatchValue, Name, '')
From MatchCodeRemoveChar
return @MatchValue
END
go
select dbo.getMatchCode('M<A>T!C##H&C,O''D<>!#&,E')
Signature is NULL
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply