March 27, 2008 at 4:12 pm
Ok, I'm able to get a good translation of data that is all upper case to title case ("THE WORLD" to "The World"). Problem is, some of the upper case values have acronyms in them, and I need to set them back to all upper case after the title case conversion happens. The code I started to use to do this is:
create function titleCase(@Text as varchar(8000))
returns varchar(8000)
as
begin
declare @Reset bit;
declare @Ret varchar(8000);
declare @i int;
declare @C char(1);
select @Reset = 1, @i=1, @Ret = '';
while (@i <= len(@Text))
select @C= substring(@Text,@i,1),
@Ret = @Ret + case when @Reset=1 then UPPER(@c) else LOWER(@c) end,
@Ret = replace(@Ret, '_', ' '),
@Ret = replace(@Ret, 'Rsms', 'RSMS'),
@Ret = replace(@Ret, 'Aip', 'AIP'),
@Ret = replace(@Ret, 'Arp', 'ARP'),
@Ret = replace(@Ret, 'Av ', 'AV '),
@Ret = replace(@Ret, 'Cnrp', 'CNRP'),
@Ret = replace(@Ret, 'Dnl', 'DNL'),
@Ret = replace(@Ret, 'Gps', 'GPS'),
@Ret = replace(@Ret, 'Mc', 'MC'),
@Ret = replace(@Ret, 'Vars', 'VARS'),
@Ret = replace(@Ret, 'Nvlap', 'NVLAP'),
@Ret = replace(@Ret, 'Nro', 'NRO'),
@Ret = replace(@Ret, 'Oo', 'OO'),
@Ret = replace(@Ret, 'Wds', 'WDS'),
@Ret = replace(@Ret, 'Qc', 'QC'),
@Ret = replace(@Ret, 'Rfi', 'RFI'),
@Ret = replace(@Ret, ' Ca ', ' CA '),
@Reset = case when @C like '[a-zA-Z]' then 0 else 1 end,
@i = @i +1
return @Ret
end
go
However, this seems a little too static in case an acronym is added, etc. So what I did is create an acronym table ("METADATA_ACRONYM") and I want to read two values from it, the bad capitalization and the correct acronym capitalization (Gps to GPS, etc.) I'm trying to figure out how to replace the static code above with an iterative process that happens each time. Any suggestions? Was this a detailed enough request, or is more info needed. Thanks in advance for any suggestions.
Brad
March 31, 2008 at 12:40 pm
Since you're using the replace command (and I can't think of another way to do what you need), I think your best bet is a cursor to step through the accronyms table and run a replace on each one.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 31, 2008 at 1:11 pm
This is one of those cases where CLR would likely be substantially more efficient than the loop. First - you could leverage the "Proper Case" syntax, so that you don't have to do this "by hand". And then - apply the replacements in order. I'd recommend keeping the replacements in a table, and passing them to the CLR either as XML (not bad), or a delimited list (probably better).
the CLR function would then use the delimited list to apply the replacements in sequence.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
March 31, 2008 at 1:47 pm
Hmmm...ok, I understand what you are saying in concept, but my experience here is probably more than beginner but less than intermediate, so I need more guidance on how you would implement what you suggest. I understand the above cursor suggestion a little better, and have actually played with implementing it that way, but not completely successful yet. Either way I might see what other guidance you can provide. 😐
Brad
March 31, 2008 at 2:03 pm
I can't speak for the CLR implementation Matt is suggesting, but the cursor would look something like:
declare C1 cursor local fast_forward for
select Wrong, Right
from Metadata_Acronyms
open c1
declare @sql nvarchar(1000), @Wrong nvarchar(100), @Right nvarchar(100)
fetch next from c1
into @wrong, @right
while @@fetch_status > 0
begin
select @sql = n'update dbo.Table set Field = replace(field, @wrong, @right)'
exec sp_executesql @sql, n'@wrong nvarchar(100), @right nvarchar(100)', @wrong, @right
fetch next from c1
into @wrong, @right
end
(I haven't written a lot of dynamic SQL, but I think this will work.)
To write a CLR, you'll need to write it in on of the .NET languages (VB.NET, C#.NET, etc.). I'm in the process of teaching myself how to do that, but can't help you on that solution yet.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 31, 2008 at 2:15 pm
Thanks, I will give that a shot and let you know the results.
On a similar theme...is it possible to run spell check of some sort on the data in a column? I can look manually through all 3992+ records to determine where acronyms may have been mis-capitalized, but if I could spell check things it would immediately help me focus on the acronyms.
Thoughts?
Brad
March 31, 2008 at 2:34 pm
I'm sure it's possible, but I've never seen anything that would do it.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 31, 2008 at 2:57 pm
This would be the CLR user-defined function you'd need:
Imports System
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Imports System.Text.RegularExpressions
Partial Public Class UserDefinedFunctions
<Microsoft.SqlServer.Server.SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _
Public Shared Function RegexReplace(ByVal input As SqlChars, ByVal pattern As SqlString, ByVal rep As SqlString) As SqlString
' Add your code here
Dim rex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern.Value)
If input.IsNull Then
Return New SqlString(SqlString.Null.Value)
Else
Return New SqlString(rex.Replace(New String(input.Value), rep.Value))
End If
End Function
<Microsoft.SqlServer.Server.SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _
Public Shared Function MyCustomCase(ByVal input As SqlChars, ByVal Keeplist As SqlString, ByVal separator As SqlString) As SqlString
Dim tmp As String
Dim i As Integer
Dim rex_inner As System.Text.RegularExpressions.Regex
Dim keeps As String()
Dim keepone As String
Dim rex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(separator.Value)
keeps = rex.Split(Keeplist.Value)
tmp = StrConv(New String(input.Value), VbStrConv.ProperCase) & " "
For i = 0 To keeps.GetUpperBound(0)
If Len(keeps(i)) > 0 Then
keepone = "\s" & keeps(i) & "\s"
rex_inner = New System.Text.RegularExpressions.Regex(keepone, RegexOptions.IgnoreCase)
tmp = rex_inner.Replace(tmp, " " & StrConv(keeps(i), VbStrConv.Uppercase) & " ")
End If
Next
Return Left(tmp, tmp.Length - 1)
End Function
End Class
At that point, once you deploy that to your server, you could use it like this:
declare @teststring varchar(2000)
set @teststring=' this is the test of the emergency IBM broadcast system. gm gmbastor lotus';
declare @keeplist varchar(2000)
set @keeplist='IBM||GM||LOTUS||';
declare @separator varchar(5)
set @separator='\|\|';
select dbo.mycustomCase(@teststring,@keeplist,@separator)
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
March 31, 2008 at 3:00 pm
Spell-checking is not available i think. It involves something interactive on a .NET control (the CLR we're talking here is not interactive and not user-control-based.)
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
March 31, 2008 at 3:02 pm
As far as "building" the list you'd want for the "Acronyms to keep" list - use any one of the solutions identified here:
http://www.sqlservercentral.com/articles/Test+Data/61572/[/url]
Since I assume you want to run this against a table, the "Keep list" should be built once, and then used after that. Cycling on each character and replacement pattern for each row is going to be REALLY slow.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
March 31, 2008 at 5:02 pm
... and, no matter how you shape it, a cursor or While Loop is going to kill performance.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 31, 2008 at 9:46 pm
Jeff Moden (3/31/2008)
... and, no matter how you shape it, a cursor or While Loop is going to kill performance.
And just to prove that:
Here's the test data:
drop table matt1
create table matt1(
id int identity(1,1) primary key clustered,
strlong varchar(1000),
strlongproper varchar(1000))
go
insert matt1(strlong)
select top 10000
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20) +' '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+' apple '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+'ibm '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+'lotus '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+
replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+'lotus'
from
sys.all_columns sc1,
sys.all_columns sc2
drop table matt2
select * --make a copy for the second test
into matt2
from matt1
Now - the test script:
set nocount on
--====Some Declarations
declare @g datetime;
declare @keeplist varchar(100);
declare @term varchar(6);
set @term='\|\|';
set @keeplist='IBM||Lotus||apple||gm||GE';
--==Set up the test header
print replicate('=',60)
print 'Proper+regex'
print replicate('=',60)
set @g=getdate();
update matt1
set strlongproper=dbo.MyCustomCase(strlong,@keeplist,@term);
--print the time
print datediff(ms,@g,getdate());
go
--====Test #2
--====Some Declarations
declare @Reset bit;
declare @text varchar(1000);
declare @Ret varchar(1000);
declare @id int;
declare @i int;
declare @C char(1);
declare @g datetime;
print replicate('=',60)
print 'Cursor loop'
print replicate('=',60)
set @g=getdate();
--===The Cursor
declare curse cursor local fast_forward for
select id, strlong from matt2
open curse
fetch next from curse into @id,@text;
--===The While Loop
While @@fetch_status=0
BEGIN
select @Reset = 1, @i=1, @Ret = '';
while (@i <= len(@Text))
BEGIN
select @C= substring(@Text,@i,1),
@Ret = @Ret + case when @Reset=1 then UPPER(@c) else LOWER(@c) end,
@Reset = case when @C like '[a-zA-Z]' then 0 else 1 end,
@i = @i +1
--EDIT - moved the second set of replaces out of the loop - only need to run one time
END
--DONE EDIT - moved the second set of replaces out of the loop - only need to run one time
select
@Ret = replace(@Ret, '_', ' '),
@Ret = replace(@Ret, ' Lotus ', ' LOTUS '),
@Ret = replace(@Ret, ' Ibm ', ' IBM '),
@Ret = replace(@Ret, ' Apple ', 'APPLE '),
@Ret = replace(@Ret, ' Gm ', ' GM '),
@Ret = replace(@Ret, ' Ge ', ' GE ')
--stored the value we just created
update matt2
set strlongproper=@ret
where ID=@id
fetch next from curse into @id,@text;
END
close curse
deallocate curse
print datediff(ms,@g,getdate());
set nocount off
And the results:
============================================================
Proper+regex
============================================================
4486
============================================================
Cursor loop
============================================================
215793
By the way - that 48 times faster, and that's only at 10,000 rows. The bigger the set, the slower the cursor seems to get.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 1, 2008 at 9:23 am
Thanks, guys. I can see that the cursor solution will be slower, so I will work on implementing it the CLR way. I'll give a shout if I have other questions regarding anything there. I really appreciate your taking time to respond!
Brad
April 1, 2008 at 11:37 am
Greetings all... I had an issue that came up some time ago in my SS 2000 box. I was able to find an extended function that allowed me to use a regular expression to find a text sequence in a string and remove it.
Too bad it can't be ported over to SS 2005. As someone else has pointed out it isn't the fastest phase of the job but it does work.
More specifically I have email address in the body of text that has to be removed. Process takes about 1 hr per 3000 rows of data.
Kurt W. Zimmerman
SR DBA
Lefrak Organization
New York, NY
http://www.linkedin.com/in/kurtwzimmerman
April 1, 2008 at 11:44 am
Kurt W. Zimmerman (4/1/2008)
Greetings all... I had an issue that came up some time ago in my SS 2000 box. I was able to find an extended function that allowed me to use a regular expression to find a text sequence in a string and remove it.Too bad it can't be ported over to SS 2005. As someone else has pointed out it isn't the fastest phase of the job but it does work.
More specifically I have email address in the body of text that has to be removed. Process takes about 1 hr per 3000 rows of data.
Kurt - that piece of code above DOES use built-in Regex functionality (it's in the CLR function). Regex used that way is truly fast in 2005 (at least 3-5 times faster than the XP version from 2000 from what I've seen). And it blows the doors off of the cursor. We ran a test like that previously and 3000 e-mail addresses would be done sub-second. Depending on how big your "body of text" is, I truly doubt it would take 30 seconds, let alone a minute.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply