March 20, 2008 at 10:45 am
Sergei Zarembo (3/20/2008)
If you go via a CLR regex function to render "235 dogs, 5 cats" to "2355" then [considering u r doing millions of recs] this a bit more advanced replace would probably give you better performance:regex.replace("235 dogs, 5 cats", @"\D+","")
--returns: "2355"
it allows the .NET regex engine to roll from Nth to (N+1)th occurrence of a digit [\d] without stopping for a replace, capture all the \D (non-digits) and replace them at once with an empty str.
Give the man a cigar! Very good thought, and borne out in testing. It shaves off another 10 seconds.
Same function, just better replacement criteria. Very nice.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
March 20, 2008 at 10:55 am
;)...for your regex needs, consider the best regex site:
regexadvice.com
We've got a regex construction forum there, including .NET, C# and VB versions.
everyone is welcome to stop by.
April 20, 2008 at 10:33 pm
Sorry guys, I lost track of this one... what ended up winning the race on the 4 million row test? RegEx or one of the functions?
--Jeff Moden
Change is inevitable... Change for the better is not.
April 21, 2008 at 8:04 am
The regex won if the pattern had text in between separate numbers, and you wanted that simply removed, the last function I proposed won if not.
The example I used was "234 dogs, 5 cats". The regex would get you "2345" as the number, while my last function would get "234 dogs, 5". Since neither of those results makes any sense in any real-world situation, I don't think that part matters.
On the other hand, complex numbers like "1.234e-10", I'm not sure what the regex would do, but my function would leave that intact. Mine would destroy "X^5", leaving only the "5". Again, not sure what the regex would do. Haven't tried it.
As always, it depends on what data you're trying to clean up. But in cases where it will work, the last version of the string function was definitely much faster.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 21, 2008 at 8:27 am
GSquared (4/21/2008)
The regex won if the pattern had text in between separate numbers, and you wanted that simply removed, the last function I proposed won if not.The example I used was "234 dogs, 5 cats". The regex would get you "2345" as the number, while my last function would get "234 dogs, 5". Since neither of those results makes any sense in any real-world situation, I don't think that part matters.
On the other hand, complex numbers like "1.234e-10", I'm not sure what the regex would do, but my function would leave that intact. Mine would destroy "X^5", leaving only the "5". Again, not sure what the regex would do. Haven't tried it.
As always, it depends on what data you're trying to clean up. But in cases where it will work, the last version of the string function was definitely much faster.
Correct - if you don't have advanced knowledge of any pattern - then regex takes over handily. Advanced knowledge that there is only one contiguous numeric range helps tilt that advantage to a T-SQL solution.
As to scientific notation, or formula detection - not sure we took that one on. The initial issue simply involved pulling out nothing but the digits (no decimal places, signs, etc...).
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 21, 2008 at 8:53 am
Yeah, telling a regex to keep an "e" or "E" if it's after a number and before another number or a "+" or "-" which is followed by another number, but to remove it otherwise, is a bit of a trick. I'm sure it can be done (I'm NOT a regex designer), but it wasn't part of this request. I segued from "just keep the digits" to more complex situations.
For the purpose of the original post, I think the regex function will do what's needed best, since it will deal with "(123) 456-7890" correctly. Though, for phone numbers, extension numbers will break the regex unless you make sure to truncate after the right number of digits. You'd have to also make sure it could deal with a variable for that, so that it handles "1 (234) 567-8901" if the preceding "1" is included in the field, or "(234) 567-8901" if it's not. Otherwise, you'll corrupt data like "1-234-567-8901 ext 2345", and end up with an undialable number, "1234567890" (left 10), or "123456789012345".
Either way, use the regex to turn it into just numbers, or expand the regex to deal with phone number formats, or after you turn it into just numbers, check for known phone number issues like leading "1" and more than 10 digits. (At least, that will work for US phones. Will be different in other countries.)
That part could easily be set-based.
update table set phone = right(phone, len(phone) - 1) where phone like '1%'
update table set extension= right(phone, len(phone) - 10) where len(phone) > 10
update table set phone = left(phone, 10)
Something like that, run after the regex, would get the desired end result of phone numbers in a standardized format, with extension stored in a separate column.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 22, 2008 at 7:43 pm
So, in the million and 4 million row tests, what does your table and data look like? Are you just making a table with a Primary Key and a VARCHAR column to contain "235 dogs, 5 cats"?
--Jeff Moden
Change is inevitable... Change for the better is not.
April 22, 2008 at 8:17 pm
Never mind... I got it from one of the posts above... just to make sure, you're using this from one of Gus' posts?
create table NumberClean (
Number bigint,
Clean varchar(100))
go
set nocount on
set statistics time OFF
insert into dbo.numberclean(number, clean)
select 100, '100 ' union all
select 5, 'number 5' union all
select 15, '643' union all
select 1234567890123456, '1234567890123456 pounds'
go 1000000
--Jeff Moden
Change is inevitable... Change for the better is not.
April 22, 2008 at 9:38 pm
Jeff Moden (4/22/2008)
Never mind... I got it from one of the posts above... just to make sure, you're using this from one of Gus' posts?
create table NumberClean (
Number bigint,
Clean varchar(100))
go
set nocount on
set statistics time OFF
insert into dbo.numberclean(number, clean)
select 100, '100 ' union all
select 5, 'number 5' union all
select 15, '643' union all
select 1234567890123456, '1234567890123456 pounds'
go 1000000
I can' t find what I used to build my table (started with the table def above, but rebuilt it once of twice) - right now, looks like 2 newID()'s cast as varchar's and smacked together....
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 22, 2008 at 10:34 pm
Thanks... I'll use Gus', then. I'll also throw in the dog/cat thing.
Just in some preliminary testing on 4Meg rows, I've found that the SET STATISTICS TIME ON command really makes things go haywire. One test I did without it took 3:24... ran the exact same code snippet with it on and stopped the query after 13 minutes. Guess I'm going to have to upgrade from sp2 to the latest cumulitive update...
Also, the following code makes the same test table as what Gus built, but a wee bit faster...
[font="Courier New"]--=======================================================================================
-- Create the 4 million row test table
-- 20 seconds to create, 31 seconds to create CLUSTERED PK
--=======================================================================================
--DROP TABLE dbo.NumberClean
--===== Create and populate the table on the fly
;WITH
cteDataBlock AS
(
SELECT CAST(100 AS BIGINT) AS Number, CAST('100 ' AS VARCHAR(100)) AS Clean UNION ALL
SELECT 5, 'Number 5' UNION ALL
SELECT 15, '643' UNION ALL
SELECT 1234567890123456, '1234567890123456 Pounds'
)
SELECT TOP 4000000 IDENTITY (INT,1,1) AS ID,db.Number,db.Clean
INTO dbo.NumberClean
FROM cteDataBlock db,
Master.sys.All_Objects ao1,
Master.sys.All_Objects ao2
--===== Add the clustered PK
ALTER TABLE dbo.NumberClean
ADD CONSTRAINT PK_NumberClean_ID
PRIMARY KEY CLUSTERED (ID)[/font]
--Jeff Moden
Change is inevitable... Change for the better is not.
April 23, 2008 at 5:39 pm
GSquared (4/21/2008)
Yeah, telling a regex to keep an "e" or "E" if it's after a number and before another number or a "+" or "-" which is followed by another number, but to remove it otherwise, is a bit of a trick. I'm sure it can be done (I'm NOT a regex designer), but it wasn't part of this request. I segued from "just keep the digits" to more complex situations.
the logic can be implemented by the following regex match pattern:
(?<!\d)[eE](?![+-]?\d)
when run vs this input text:
<
1.25e-10; 123break5
Yeah, telling a regex to keep
an "e" or "E" if it's after a number and before another number or a "+" or "-" which is
followed by another number, but to remove it otherwise, is a bit of a tric
>
..with subsequent replacement of the matches by an empty string will give this result
<
1.25e-10; 123brak5
Yah, tlling a rgx to kp
an "" or "" if it's aftr a numbr and bfor anothr numbr or a "+" or "-" which is
followd by anothr numbr, but to rmov it othrwis, is a bit of a trick.
>
..keeping the mantissa intact. Construct (?<!\d) , called a look-behind, is telling the regex engine NOT to match on the next char IF it's preceded by a digit. Logically {AND} combined with similarly-functioning look-ahead (?![+-]?\d) {DON"T match IF a signed or unsigned digit follows} allows to avoid matching on *E* in hopefully most of the relevant cases. Depends on the input of course.
April 24, 2008 at 2:05 pm
Now, can that be added into Matt's regex easily, to deal with removing all other non-numeric symbols, but leaving complex numbers intact?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 24, 2008 at 3:22 pm
GSquared (4/24/2008)
Now, can that be added into Matt's regex easily, to deal with removing all other non-numeric symbols, but leaving complex numbers intact?
GSquared,
please tell me which Matt's regex you have in mind? I scrolled down the posts but could not find it.
Thanks,
Sergei Z
April 25, 2008 at 6:49 am
I don't think he actually posted the code for this one.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 25, 2008 at 6:56 am
Sergei Zarembo (4/24/2008)
GSquared (4/24/2008)
Now, can that be added into Matt's regex easily, to deal with removing all other non-numeric symbols, but leaving complex numbers intact?GSquared,
please tell me which Matt's regex you have in mind? I scrolled down the posts but could not find it.
Thanks,
Sergei Z
I'm thinking GSquared is actually referring to the Regex expression you came up with
dbo.regexreplace('235 dogs, 5 cats', '(\D)+','')
the function is simply a wrapper to get to the CLR Regex functionality.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Viewing 15 posts - 31 through 45 (of 81 total)
You must be logged in to reply to this topic. Login to reply