January 25, 2012 at 9:58 am
I have an import process where we receive US Zip and Canadian Postal Codes in the same field. We use the left most character to determine the region in which that record belongs.
The problem is that we receive files that contain erroneous values because the databases under which they were entered have no validation (probably AS400 systems) and the address information is corrected and handled external from the organization to which sends us the file. In other words there is no way for us to correct this problem at the source, we have to live with receiving garbage data. We import many disparate files into one nvarchar/varchar table, which is where this function would work.
I wrote a function to clean the postal code/zip please tell me if there is a better way to write it.
I found details on how to format postal codes and zip codes from these two Wikipedia articles:
http://en.wikipedia.org/wiki/Postal_codes_in_Canada
http://en.wikipedia.org/wiki/ZIP_code
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author:W McCarty
-- Create date: 1/23/2012
-- Description:This function will accept a Canadian Postal Code or US Zip and strip unwanted characters
-- and then check validity of the Code, and pass back a valid Code, or Null if none can be found
-- =============================================
Alter Function [dbo].[CleanPostalCode](@Temp VarChar(1000))
Returns VarChar(1000)
AS
Begin
-- Declare a temporary value that will be returned leaving the passed value intact
declare @RetStmt VarChar(1000)
-- Strip out non alpha-numeric characters
While PatIndex('%[^a-z0-9]%', @Temp) > 0
Set @Temp = Stuff(@Temp, PatIndex('%[^a-z0-9]%', @Temp), 1, '')
-- If there is a Canadian Postal Code in the cleaned string, then set the return statement to that code
If PatIndex('%[a-eghj-npr-tvxy][0-9][a-eghj-npr-tv-z][0-9][a-eghj-npr-tv-z][0-9]%', @Temp) >0
Begin
set @RetStmt = Substring(@Temp, PatIndex('%[a-eghj-npr-tvxy][0-9][a-eghj-npr-tv-z][0-9][a-eghj-npr-tv-z][0-9]%', @Temp),6)
End
-- Else look for a numeric string and try to pass back a valid US Zip Code
Else
Begin
-- Make a string for comparison
declare @numstr VarChar(1000)
declare @numindex int
set @numindex = 0
Set @numstr = '[0-9]'
-- Loop through the string to find a contiguous numeric string
While PatIndex('%' + replicate(@numstr, @numindex + 3 )+ '%', @Temp) > 0
Begin
set @numindex = @numindex + 1
End
-- If there was a numeric string that was 3 digits or longer then
if @numindex > 0
Begin
-- First set the @Temp variable so that it contains the result of the numeric search
set @Temp = substring(@Temp, PatIndex('%' + replicate(@numstr, @numindex + 2 )+ '%', @Temp),@numindex +2)
-- Then reset the @NumIndex so we can use it again (do this step after @temp reset)
set @numindex = 1
-- Only look at digits that taken as a whole are greater than 501 (digits trimmed of left zeros)
if cast(@temp as int) >= 501
Begin
-- Enter a loop until we find a digit string that is greater or equal to 501 and the number cannot be three characters in length
while substring(@temp,@numindex,5) < 501
set @numindex = @numindex + 1
-- Make the return statement the first position that the first five is greater than 501, and take nine digits from that location
set @retstmt = substring(@temp,@numindex,9)
-- Enter a loop if the return statement is less than 9 digits long and just add leading or trailing zeros
if len(@retstmt) < 5
begin
while len(@retstmt) < 5
Set @retstmt = '0' + @retstmt
end
if len(@retstmt) > 5 and len(@retstmt) < 9
begin
while len(@retstmt) < 9
Set @retstmt = @retstmt + '0'
end
End
Else
Begin
set @retstmt = Null
End
End
else
begin
set @retstmt = Null
end
End
Return @RetStmt
-- End of Function
End
GO
January 25, 2012 at 10:04 am
Before even looking at this, from the sounds of it, you are running this against a table of values, not a single value. Is this correct? If so, the first thing I would do is to move this entire process into either a Table Valued Function or a Stored Procedure to handle it as a set based operation rather than 1 record at a time, which will be very very very slow.
If I am incorrect, let me know and I will review the Scalar Function you presented.
Fraggle
January 25, 2012 at 10:13 am
Hello, thank you for replying! Yes we currently look at records one by on, because we receive inconsistent data files from many different sources, and these data files change their 'alignment', file type, and format often. So we have to rip the files into a single varchar staging table and then process and modify the data. We are redesigning our process of importing files, but for now this is the way it sits.
January 25, 2012 at 10:26 am
If I follow your description correctly, you are pulling data into a staging table, then running this on the staging table in order to get the postal code out of that.
Can you provide the structure (create script) for the table, and an insert statement that puts some sample data in it? A couple of dozen rows should suffice for testing purposes.
I think the whole thing can be massively simplified pretty easily, but I need sample data to test on.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 25, 2012 at 11:07 am
We import the disparate files into this table, row by row (with column headings from text files, xls, csv, etc)
[Code]
CREATE TABLE [dbo].[raw](
[tableID] [int] IDENTITY(1,1) NOT NULL,
[col001] [varchar](500) NULL,
[col002] [varchar](500) NULL,
[col003] [varchar](500) NULL,
[col004] [varchar](500) NULL,
[col005] [varchar](500) NULL,
[col006] [varchar](500) NULL,
[col007] [varchar](500) NULL,
[fileName] [varchar](500) NULL,
[loadDate] [varchar](50) NULL CONSTRAINT [DF_rawstage_loadDate_2] DEFAULT (getdate()),
[uniqueid] [uniqueidentifier] NULL CONSTRAINT [DF_rawstage_uniqueid] DEFAULT (newid())
) ON [PRIMARY]
[/Code]
I have attached sample data in the file raw.txt (CSV)
The data would get aligned and put into a table like this
CREATE TABLE [dbo].[stage](
[postal] [varchar](250) NULL,
[postalShipTo] [varchar](50) NULL,
[FileName] [varchar](500) NULL,
[loadDate] [smalldatetime] NULL,
[tableID] [int] IDENTITY(1,1) NOT NULL,
[uniqueid] [uniqueidentifier] NULL
) ON [PRIMARY]
The stage table then has other processes performed on it and eventually the data is moved into a final table that reports are generated from.
January 25, 2012 at 11:25 am
Okay, it looks like what you need is to parse out any rows where Col01, Col02, etc., have a postal code in them. It can be any one of those columns, and it'll be by itself in the column. Is that correct?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 25, 2012 at 11:31 am
Thank you for you reply. Yes we do have a process to parse out the locations of postal codes (which we also are refining) but anything that looks like a postal code is caught and then is 'cleaned'. If the thing that looks like a postal code is not a valid US or CDN code, we set it to NULL and move on. Where the code is missing or only one of bill and ship is there we leave it null or copy the existing to the null respectively.
Edit, the way we get the files, sometimes there is other garbage in the field, and where there is more that one piece of data, like address and postal code in one field we set it to null. So in other words most of the time postal code will be in only one field sometimes with extraneous characters. I.E. Zip 8950 shown as 8950USA.
January 25, 2012 at 11:41 am
Okay.
8950 isn't a valid Zip code, but I get the idea of what you're saying.
Are there potentially fields with numeric data in them that might mimic a Zip code, but not be one? Like a street address that starts with a 5-digit number, e.g.,: "98112 E 45th Ave, Podunk, TX". 98112 isn't a Zip code in this case, it's a street number, but it also happens to be a valid Zip code in Seattle, WA.
Is it possible for data to have that kind of thing in it?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 25, 2012 at 12:08 pm
Yes it is very possible for an address to be in the Zip field, but it does not happen all that often. The majority of the data is valid, around 99%. For those cases we just would want to take the number, as we have both the bill and ship to use.
The more likely scenario is for the data to just contain something in it that the person entering felt like putting in that day like 78155IHAVEACAT, or 78144USA to say that it is in the USA. Sometimes the data does have phone numbers as well, so for that we take whatever looks like a code and press.
At the moment we consider the US as a whole and therefore do not do much validation on if the actual code maps out to a correct address. When the left most character in the PCode we use is a number we consider it USA. I suppose we could then try to validate the postal against the address, city and state/province, we could have this running as a separate process after the postal is stripped out, perhaps as a process running on the stage table. In which we would parse out the Zip/PCode to its separate parts and compare it with a database of cities and states. But for now its not required
The way I count a valid US ZIP is that it is 3-9 digits in length, and is >= 501. The Canadian PCode is much simpler, however it is possible to have a code that is in the valid format but does not exist.
Normally if one of the bill or ship Pcodes is Canadian we would favor that one.
January 26, 2012 at 7:14 am
Okay.
Here's one way that might make this more efficient:
IF OBJECT_ID(N'tempdb..#raw') IS NOT NULL
DROP TABLE #raw ;
CREATE TABLE #raw
([tableID] [int] NOT NULL,
[col001] [varchar](500) NULL,
[col002] [varchar](500) NULL,
[col003] [varchar](500) NULL,
[col004] [varchar](500) NULL,
[col005] [varchar](500) NULL,
[col006] [varchar](500) NULL,
[col007] [varchar](500) NULL,
[fileName] [varchar](500) NULL,
[loadDate] [varchar](50) NULL
CONSTRAINT [DF_rawstage_loadDate_2] DEFAULT (GETDATE()),
[uniqueid] [uniqueidentifier] NULL
CONSTRAINT [DF_rawstage_uniqueid] DEFAULT (NEWID())) ;
INSERT INTO #raw
(tableID, col001, col002, col003, col004, col005, col006, col007, fileName, loadDate, uniqueid)
VALUES (1, 'Bill To Customer Postal Code', 'Ship To Customer Name', 'Prod Name', NULL, NULL, NULL, NULL,
'201112 File1.txt', 'Jan 25 2012 12:34PM', '61e9b0c9-4cb2-4048-a7af-c945fb1d107d'),
(2, 'V5J 5K3', 'ABC Company', 'Monitor', NULL, NULL, NULL, NULL, '201112 File1.txt', 'Jan 25 2012 12:34PM',
'1f2e8f4f-2c02-41e0-b4ef-8c710a084192'),
(3, '88451', 'AB1 Company', 'Monitor', NULL, NULL, NULL, NULL, '201112 File1.txt', 'Jan 25 2012 12:34PM',
'efa29c24-6799-4d40-a500-e5b0aac68a70'),
(3591, '1233prodcode', 'manfcode12356', '299.68', 'ON', '8845USA', 'L4W 1E1', NULL, '20111201File2.txt',
'Jan 25 2012 12:38PM', 'c5aaab9d-e7fd-4053-8ca5-f473caef3604'),
(3592, '1233prodcode', 'manfcode12356', '599.36', 'ON', '78458', 'M4G 4B5', NULL, '20111201File2.txt',
'Jan 25 2012 12:38PM', '8d03f757-161a-449c-aa20-4697a593f093'),
(3593, '1233prodcode', 'manfcode12356', '299.68', 'ON', 'L4B-4W3', 'L6M 2R7 {End of line Skype Highlighting}',
NULL, '20111201File2.txt', 'Jan 25 2012 12:38PM', '2a4543a0-b532-46c8-9748-705c486edbe5'),
(1175, 'Qty.', 'Unit price', 'Postal code', NULL, NULL, NULL, NULL, '20111201File3.txt', 'Jan 25 2012 12:36PM',
'f8682b0a-90ab-4cb2-8385-f1aac0b93f0b'),
(1176, '1', '89.99', 'V2S 3T2', NULL, NULL, NULL, NULL, '20111201File3.txt', 'Jan 25 2012 12:36PM',
'9314f670-badc-4a55-9144-7cb4bef6befa'),
(1177, '1', '93.99', 'V5H 2B1', NULL, NULL, NULL, NULL, '20111201File3.txt', 'Jan 25 2012 12:36PM',
'1a0cf2fa-1404-41ad-b38e-abda9c835133'),
(4432, 'Sold-to Province', 'Sold-to Code Postal', 'Take Slm Div', 'Invoice #', 'Invoice Date', 'Category',
'Ship-to Code Postal', '20111201File4.txt', 'Jan 25 2012 12:39PM', '6b1b0d6e-48e7-4f3e-8c1a-2c8c9e0fa7ad'),
(4433, 'ON', 'L1W3K1', '61-3', '3302254', '51097', 'M', 'Z1W3K1', '20111201File4.txt', 'Jan 25 2012 12:39PM',
'8389dccd-4607-48b3-b522-da98123ea8b8'),
(4434, 'PQ', 'G5C1C8', '92-2', '3302323', '51097', 'M', 'G5C1C8', '20111201File4.txt', 'Jan 25 2012 12:39PM',
'26bd6258-a525-4839-b675-d4eb0d5a6615'),
(4435, 'TX', 'G5C1C8', '101-2', '3302670', '51097', 'M', 'M1R2T5', '20111201File4.txt', 'Jan 25 2012 12:39PM',
'1817c7bf-0535-4038-aef4-c75fe2accb4b') ;
SELECT uniqueid,
PossibleZip
FROM #raw
CROSS APPLY (SELECT COALESCE(col001, '') + COALESCE(col002, '') + COALESCE(col003, '') + COALESCE(col004, '')
+ COALESCE(col005, '') + COALESCE(col006, '') + COALESCE(col007, '') AS Conc) AS Conc
OUTER APPLY (SELECT SUBSTRING(Conc, Number, 5) AS PossibleZip
FROM dbo.Numbers
WHERE SUBSTRING(Conc, Number, 5) LIKE '[0-9][0-9][0-9][0-9][0-9]'
UNION ALL
SELECT SUBSTRING(Conc, Number, 10) AS PossibleZip
FROM dbo.Numbers
WHERE SUBSTRING(Conc, Number, 10) LIKE '[0-9][0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
UNION ALL
SELECT SUBSTRING(Conc, Number, 7) AS PossibleZip
FROM dbo.Numbers
WHERE SUBSTRING(Conc, Number, 7) LIKE '[a-eghj-npr-tvxy][0-9][a-eghj-npr-tv-z][ [-]][0-9][a-eghj-npr-tv-z][0-9]'
UNION ALL
SELECT SUBSTRING(Conc, Number, 6) AS PossibleZip
FROM dbo.Numbers
WHERE SUBSTRING(Conc, Number, 6) LIKE '[a-eghj-npr-tvxy][0-9][a-eghj-npr-tv-z][0-9][a-eghj-npr-tv-z][0-9]')
AS Zips ;
The Zips sub-query just has a list of patterns that you would consider valid for a postal code. You can add "and not" statements to any of the Where clauses to exclude patterns that are a negative (false-positive) match.
You can add more patterns to it easily. Just follow the pattern of the queries in there.
Does that help?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 27, 2012 at 11:17 am
GSquared, thanks for the reply! I am testing your implementation, it is a much different concept than what I was using and I think it has solved a few other problems too!
One other question the dbo.Numbers table, do I just need the 0-9 numbers for the purposes of this query?
January 27, 2012 at 11:33 am
Sorry about that. I have a Numbers table that includes integer values from 0-10,000. Comes in handy for dozens of uses.
You probably only need a limited scope for this particular solution, but I recommend building one with a larger scope for future uses.
Should have mentioned that in the reply.
Edit: To work on this solution, you'll need numbers from 0 to the length of the longest row in your staging table.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply