Querying Postcodes gracefully

Question

Querying Postcodes gracefully

Thom A

SSC Guru

Points: 99133
More actions
May 3, 2017 at 9:07 am

#342752

Recently I've been getting an influx of requests asking me if I can find clients on our system that have a postcode found in a provided list (normally an Excel Document, however, could be our own). Although this is a fairly "simple" question, considering the volume of data that is sometimes provided, or range of customer tables I have to check, this can sometimes be quite the slow process from a query end (it really doesn't run fast sometimes).
The main problem, at least I believe, is down to the formatting of a UK postcode. They SHOULD be in one of the following formats:
X0 0XX
XX0 0XX
X00 0XX
XX00 0XX
X0X 0XX
XX0X 0XX
The problem is, both in our customer tables and often the data I'm supplied, the space could be missing. So, say I'm provided the postcode AB12 3CD. I would need to check for AB12 3CD and AB123CD. However, because I don't know how reliable the source data is, the postcode could have been supplied in the format AB123CD (which would also need to match AB12 3CD and AB123CD).
Now, I'm all too aware that doing the following is just bad practice and slow:
REPLACE(Customer.Postcode,' ','') = REPLACE(Data.Postcode,' ','')
At the moment, I've therefore gone with something along the lines of:
Customer.Postcode IN (Data.Postcode,REPLACE(Data.Postcode,' ',''),STUFF(Data.Postcode, LEN(Data.Postcode) - 2,0, ' ')))
This, on the current work I did, had a significant improvement (5% cost compared to 95% cost in the same batch), as I could use the Index on the Customer Table, however, I still know the data on the right side of the JOIN isn't sargable. For the data I worked on today, that wasn't a massive problem (I only had 300 odd postcodes to compare to our Customer database). On the other hand I know all too well I'm going to get a bigger list at some point again.
Now, yes I could set up a staging table, I have considered this, but some of data could be held within our own customer data tables, so I'll simply be comparing two tables (rather than an external source which I can manipulate). In the above SQL, this would mean i could use the index on the postcode for the left side of the join, but not the index on the right side, so could be quite slow. Thus my question; is there, apart from putting the data in a staging table (and reformatting it), a more graceful way? If I can't trust either side of the data (ours or external), that means that I may need to stage both tables, which depending on the volume of data, could be a very costly process unto itself. I'm not sure what gains that would have (if any?).
I can also imagine that many of you may well ask me something along the lines of "Well, why not fix your data in the database first, and then ensure that formatting is preserved". Unfortunately our database is effective a read only environment. The data all comes from what is likely a Flat File Mainframe system that replicates the data into a SQL database. That process is one way, updates on the mainframe go to SQL, however, updates to SQL do not go back to the Mainframe. Thus amended the data wouldn't help, as all those changes would be lost when either the record is updated, or the entire database is refreshed (which happens at least once a week).
If I haven't been clear, or you have any questions let me know. I'm really looking for insight, rather than someone to do the work for me, and any pointers you might have.
Thanks all!
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Ant-Green SSC Guru Points: 113507 More actions · Answer 1

My view on this is to create a computed column on the table in question which removes all the spaces and then stick an index on top of that column. Then you can stage your excel sheet with spaces removed and join on the two space less post code columns. Saves worrying about sargability, the only issue comes with if you can modify the schema on the SQL side without the mainframe having a fit when it tries to update the data.

Luis Cazares SSC Guru Points: 183706 More actions · Answer 2

What about using an indexed computed column and simply query without spaces? Obviously that takes more storage space, but it might work.

EDIT: Seems like Anthony had the same idea and posted before I did. 🙂

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Thom A SSC Guru Points: 99133 More actions · Answer 3

anthony.green - Wednesday, May 3, 2017 9:23 AM
the only issue comes with if you can modify the schema on the SQL side without the mainframe having a fit when it tries to update the data.

Heh, unfortunately I would be very surprised if the server had a fit.

@luis:
A definite consideration. Fortunately, the client table is unlikely to be changed, so provided we do this after the BULK INSERT, it'll remain there. Some tables, however, we're not so lucky on. If it wants to update a column, the back end is sometimes more than happy to drop any Views and tables it needs to and then recreate them and bulk insert the data back in. Yeah, really helpful when you did have custom indexes on that table and especially as it's a LIVE database...

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 4

Would something like this work for you?
WITH cte_PostalVariations AS ( SELECT DISTINCT -- only use DISTINCT if there is a possibility of dupes in the source request. PCV = LTRIM(STUFF(pc.PostalCode, sl.StuffLocal, 0, ' ')) FROM ( VALUES ('AB12 3CD'), ('AB123CD'), ('AB213CD'), ('1234567'), ('98 12345') -- copied from source of request. ) pc (PostalCode) CROSS APPLY (VALUES (REPLACE(pc.PostalCode, ' ', '')) ) r (PostalCode) CROSS APPLY ( VALUES (1), (3), (4), (5)) sl (StuffLocal) ) SELECT c.* FROM dbo.Customers c JOIN cte_PostalVariations pv ON c.PostalCode = pv.PCV;

Just a thought...

ScottPletcher SSC Guru Points: 101118 More actions · Answer 5

Change your table to never store the space in the post code column (you can use a trigger to do this, at least until you change the app; you can add a tinyint column to indicate the original location of a space if you need to).

Then you can simply remove all spaces from search values before doing the search.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Thom A SSC Guru Points: 99133 More actions · Answer 6

ScottPletcher - Wednesday, May 3, 2017 4:05 PM
Change your table to never store the space in the post code column (you can use a trigger to do this, at least until you change the app; you can add a tinyint column to indicate the original location of a space if you need to).
Then you can simply remove all spaces from search values before doing the search.

This, unfortunately, would not work, as the tables are dropped when a BULK INSERT is performed (weekly). Indexes, triggers, etc, are all created after data insertion. Plus, not sure I like the idea of the database not reflecting the application front end/Mainframe back end.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Ed Wagner SSC Guru Points: 287026 More actions · Answer 7

Okay Thom, if the load process drops and rebuilds tables when you fire your BULK INSERT, how about doing an ALTER TABLE to add the persisted computed column after you load it? I'm not sure if this will fit into your load process or not.

Other than that, I'd say to go with a staging table. I don't load data without first loading it to a staging table for cleaning and verification. Then you can have your persisted computed column and index it to your heart's content.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

michael.shires SSC Enthusiast Points: 134 More actions · Answer 8

First attempt at a post here, so please bear with me. I can't claim to have written this, but have adapted for use elsewhere, but one of our devs created a UD function to format an incoming postcode.

The function is
ALTER FUNCTION [dbo].[uf_FormatPostcode]


    (@Postcode VARCHAR(30))

RETURNS VARCHAR(30)
AS
BEGIN

    DECLARE @Return VARCHAR(30)

    SELECT @Return =
CASE WHEN LEN(LTRIM(RTRIM(@Postcode))) BETWEEN 5 AND 7
         THEN
            CASE WHEN
                PATINDEX('[a-z][1-9][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                PATINDEX('[a-z][1-9][0-9][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                PATINDEX('[a-z][a-z][1-9][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                PATINDEX('[a-z][a-z][1-9][0-9][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                PATINDEX('[a-z][1-9][a-z][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                PATINDEX('[a-z][a-z][1-9][a-z][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                PATINDEX('[a-z][1-9][1-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1
            THEN
                LEFT(LTRIM(RTRIM(@Postcode)),LEN(LTRIM(RTRIM(@Postcode)))-3) + ' ' + RIGHT(LTRIM(RTRIM(@Postcode)),3)
            ELSE
                @Postcode
            END
        ELSE
            @Postcode
END

RETURN @Return

END

I insert the initial table into a #table with the postcode correctly formatted and apply an index - and then use this as the matching table.

Regards
Mike

Sergiy SSC Guru Points: 110208 More actions · Answer 9

Thom,

I'd suggest to follow the advice from Scott.

Except, may be, the insistence on removing spaces.

You actually do not need to that, but you must make sure you bring all post codes coming from all possible sources to *consistent* formatting.

Whatever string is entered into the "search by post code" field you need to format it first according to the rule applied on the mainframe, and only after that use it in a query.

_____________
Code for TallyGenerator

Ken McKelvey SSCoach Points: 18854 More actions · Answer 10

To aid searching, I would look at left justifying the area and right justifying the district.
You will probably need to take into account partial postcodes and mix ups between 0(zero)/capital o and 1(one)/capital i.
Something like the following should get you started:
CREATE FUNCTION dbo.SearchPostcode ( @Postcode varchar(8) ) RETURNS TABLE AS RETURN ( WITH Patterns AS ( SELECT * FROM ( VALUES ('[A-Z][0-9][0-9][A-Z][A-Z]', 1, 2, 1, 3) ,('[A-Z][A-Z][0-9][0-9][A-Z][A-Z]', 2, 3, 1, 3) ,('[A-Z][0-9][0-9][0-9][A-Z][A-Z]', 1, 2, 2, 3) ,('[A-Z][A-Z][0-9][0-9][0-9][A-Z][A-Z]', 2, 3, 2, 3) ,('[A-Z][0-9][A-Z][0-9][A-Z][A-Z]', 1, 2, 1, 4) ,('[A-Z][A-Z][0-9][A-Z][0-9][A-Z][A-Z]', 2, 3, 1, 4) ) P (Pattern, AreaLength, DistrictStart, DistrictLength, Rest) ) SELECT LEFT(LEFT(X.PostCode, P.AreaLength) + '_', 2) + RIGHT('_' + SUBSTRING(X.PostCode, P.DistrictStart, P.DistrictLength), 2) + RIGHT('_' + RIGHT(X.PostCode, P.Rest), 4) AS SearchCode FROM Patterns P CROSS JOIN (SELECT REPLACE(@Postcode, ' ', '')) X (PostCode) WHERE X.Postcode LIKE P.Pattern );

You should then be able to check the results with something like:
SELECT * FROM YourTable Y CROSS APPLY (SELECT SearchCode FROM dbo.SearchPostcode(Y.Postcode)) X;

ScottPletcher SSC Guru Points: 101118 More actions · Answer 11

Thom A - Wednesday, May 3, 2017 4:39 PM
ScottPletcher - Wednesday, May 3, 2017 4:05 PM
Change your table to never store the space in the post code column (you can use a trigger to do this, at least until you change the app; you can add a tinyint column to indicate the original location of a space if you need to).
Then you can simply remove all spaces from search values before doing the search.
This, unfortunately, would not work, as the tables are dropped when a BULK INSERT is performed (weekly). Indexes, triggers, etc, are all created after data insertion. Plus, not sure I like the idea of the database not reflecting the application front end/Mainframe back end.

Formatting characters are generally left out of database data, as they just waste space. For example, as you know, date/datetime does not internally store any of the formatting chars. U.S. SSNs should be stored as 000000000 not 000-00-0000.

Equally importantly, the code then becomes a very simple column = REPLACE(input_value, ' ', '') comparison, easy to write and fully sargable.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Thom A SSC Guru Points: 99133 More actions · Answer 12

Thanks guys for all the replies. I'm experimenting with the indexed computed column, as this is probably the most viable option for me. What I'm trying to overcome now is getting that to appear in the views we have (which are much more end friendly, for reasons I'm not going to dive into). That, however, is definitely a "feature" of the 3rd party software we use, but I'm sure I'll (eventually), find a solution.

I have, on a side note, also built a function that splits the postcode into the possible permutations, using my own and some ideas others have given me above, which works fairly well as well. It's nice to now have a couple of options and use the other when one seems to perform slower than I'd like. 🙂

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Thom A SSC Guru Points: 99133 More actions · Answer 13

ScottPletcher - Thursday, May 4, 2017 7:51 AM
Formatting characters are generally left out of database data, as they just waste space. For example, as you know, date/datetime does not internally store any of the formatting chars. U.S. SSNs should be stored as 000000000 not 000-00-0000.
Equally importantly, the code then becomes a very simple column = REPLACE(input_value, ' ', '') comparison, easy to write and fully sargable.

Although I agree, this would be easier to store the postcodes without the sapce, and put it as a formatting process on the front end, this isn't something I have control over. Third party application, third party database.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk