Join - that may not be possible

Question

Join - that may not be possible

danielfountain

SSCarpal Tunnel

Points: 4229
More actions
January 6, 2012 at 3:53 am

#391253

Hey all,
Well we have all been there. A database design and data that is pooorly designed and no way to change it. Well i have an odd one - and i am not sure its possible to fix. Here is an example:
CREATE TABLE #TempCodes (CodeRef VARCHAR(3), DESCRIPTION VARCHAR(10))
INSERT INTO #TempCodes VALUES ('C01','Red')
INSERT INTO #TempCodes VALUES ('C','Green')
INSERT INTO #TempCodes VALUES ('D9','purple')
INSERT INTO #TempCodes VALUES ('D','yellow')
INSERT INTO #TempCodes VALUES ('C08','black')
CREATE TABLE #TempData (Reference INT, codes VARCHAR(30))
INSERT INTO #TempData VALUES (1,'CO1,C & D9 10/10/2011')
INSERT INTO #TempData VALUES (2,'2011')
INSERT INTO #TempData VALUES (3,'C03,D')
INSERT INTO #TempData VALUES (4,'')
INSERT INTO #TempData VALUES (5,'D')
INSERT INTO #TempData VALUES (6,'C03/D/d9/c01/c')
SELECT *
FROM #TempData A
LEFT outer JOIN
#TempCodes B
ON
CHARINDEX (B.CodeRef,A.codes) <> 0
So effectively there is a free format field which has codes typed into it - and sometimes other stuff, sometimes wrong codes as well. Bascially if the code exists in that field then i need to join to the #tempcodes table on that.
So with my select at the bottom i have 2 issues.
1) Because charindex returns 0 for no match - if the code is the first thing in the list its not joining.
2) There are codes that have similar data so C and C01. because there is a C in C01 its matching. I dont want it too.
however because its free format field people will put comma separated, slash separated, semi colon, colon, space.... you name it.
The more i write the less possible it sounds. Anyone able to prove me wrong?
Thanks for reading my plight.
Dan

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 1

Dan

I feel your pain. If you don't have the option to change the database, do you have the option to change the application? That way, you could perform some basic validation on what goes into the database. It may also be worth performing a data cleansing exercise on your existing data, so that only one delimiter is used. Once you have all that, you can use a splitter function to split each row into its individual codes. Far from ideal, but better than what you have at the moment.

John

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 2

Without specifically capturing dates, you might get some milage from this:

SELECT a.Reference, x1.codes, y.[DESCRIPTION]

FROM #TempData a

CROSS APPLY(

SELECT codes = REPLACE(REPLACE(REPLACE(a.codes,'& ',''),'/',','),' ',',')

) x1

OUTER APPLY (

SELECT b.[DESCRIPTION]

FROM #TempCodes B

CROSS APPLY dbo.DelimitedSplit8K (x1.codes,',') split

WHERE b.CodeRef = split.Item

) y

dbo.DelimitedSplit8K is a string splitter function devised by Jeff Moden et al, published here[/url].

Edit - added link to article.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

sturner SSC-Insane Points: 22411 More actions · Answer 3

I would suggest doing it something like this:

SELECT *

FROM #TempData A

LEFT JOIN #TempCodes B

ON

A.codes like '%'+CodeRef'+'%'

I'd put any regex directives (ie '%') right in the CodeRef column so you don't have to tack the percent character at the beginning and end of the coderef column in the join. It would also allow you to be more sepecific, say the code has to be at the begging, or the end, or could be anywhere etc.

The probability of survival is inversely proportional to the angle of arrival.

matzer SSCrazy Eights Points: 9727 More actions · Answer 4

If replacing potential delimiters is sufficient, then there no need for splitting the field. Just replace potential delimiters with a unified delimiter and surround the code to search for with that delimiter, i.e.

; WITH Cte AS

(

SELECT

Reference, codes,

-- Replace any character (including spaces) that might separate a code with a '|'.

delimited_codes = '|' + REPLACE(REPLACE(REPLACE(REPLACE(codes, ' ', '|'), ',', '|'), '&', '|'), '/', '|') + '|'

FROM

#TempData

)

SELECT

*

FROM

Cte A

LEFT outer JOIN

#TempCodes B ON CHARINDEX ('|' + B.CodeRef + '|' ,A.delimited_codes) > 0

P.S. Am not quite sure what was meant with the first issue. Isn't there just a typo in the sample data. #TempCodes contains 'C03' (with a zero) while codes in the first row contains 'CO3' (with an Ohh).

danielfountain SSCarpal Tunnel Points: 4229 More actions · Answer 5

Thanks for all suggestions.

Unfortunately no i cant change front end. Only thing i have a chance to do is change what people type in - and you know how well that works.

In terms of the Like join - that doesnt unfortunately help. It still brings back the exactly same 12 records (not the ones right at the begining and still has the problem with the C01 and C).

I think my only chance may be to split it - i had concidered it, but i am concerned what end users might use to delimiter!

@,.!"£$%^&

that isnt me swearing - thats me thinking what they "could" do!! 🙂

Ta

Dan

Ben Teraberry SSChampion Points: 11845 More actions · Answer 6

For small amounts of data it's not a big deal, but any kind of string manipulation gets horrible really fast in T-SQL if you have any kind of volume of data. Is CLR an option for you? I would write a CLR that would parse out all the possible values from your garbage field ... just figure out everything that can split values like spaces slashes, ampersands, etc. Process that for each record inserted into a new one to many table. Then add a prioritization for the possible values or just figure out a simple way (such as alphabetical) to pick out a single one in case multiple values match. At that point you can simply join and end up with scalable performance.

If you can't do the CLR, you can still basically do the same thing with T-SQL, it just won't perform as well.

└> bt

Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Paul White SSC Guru Points: 150467 More actions · Answer 7

This seems to work reasonably well:

-- Standard in-line numbers table

WITH

N1 AS (SELECT N1.n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS N1 (n)),

N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R),

N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R),

N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R),

N AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) AS n FROM N4)

-- Query body

SELECT

A.*,

B.*

FROM #TempData AS A

CROSS APPLY

(

-- Numbers from 1 to the length of the current codes string

SELECT TOP (LEN(A.codes))

N.n

FROM N

ORDER BY N.n

) AS P (pos)

JOIN #TempCodes AS B ON

-- Full length match between code and codes string

SUBSTRING(A.codes, P.pos, LEN(B.CodeRef)) = B.CodeRef

-- Following character is not 0-9 or A-Z (includes end of string case)

AND SUBSTRING(A.codes, P.pos + LEN(B.CodeRef), 1) NOT LIKE '[0-9A-Z]'

ORDER BY

A.Reference

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

danielfountain SSCarpal Tunnel Points: 4229 More actions · Answer 8

Paul,

Sorry for the late response.

That - is brilliant. Genious thinking. Thank you ever so much. I never thought of it quite like that.

I am not 100% sure what the numbers table is exactly doing - but i will look at it in more detail shortly.

Thank you so much for your time.

Dan

Paul White SSC Guru Points: 150467 More actions · Answer 9

danielfountain (2/1/2012)
I am not 100% sure what the numbers table is exactly doing - but i will look at it in more detail shortly.

Thanks. If you mean the weird-looking code in the common table expression (the WITH section), it's just a way of generating a sequence of numbers out of thin air. I think it was Itzik Ben-Gan that first came up with it. Of course, you probably have a real numbers table somewhere in your database so you could/should use that instead. Hopefully the rest of the code & logic is reasonably transparent; let me know if there's anything you need clarification on. Interesting problem.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

danielfountain SSCarpal Tunnel Points: 4229 More actions · Answer 10

Hmmm.... ok 99% there.

Some bright spark has added in something like this into the codes.

C01 (Red)

So the code picks up the red, but also picks up yellow from the end of the word Red.

How your code checks that the next character is not alpha numeric, is it possible to add to check that the previous character is not alphanumeric too?

I tried... i cant work it out

Thanks again

Dan

ameda16 Valued Member Points: 68 More actions · Answer 11

CREATE TABLE TempCodes (CodeRef VARCHAR(3), DESCRIPTION VARCHAR(10))

INSERT INTO TempCodes VALUES ('C01','Red')

INSERT INTO TempCodes VALUES ('C','Green')

INSERT INTO TempCodes VALUES ('D9','purple')

INSERT INTO TempCodes VALUES ('D','yellow')

INSERT INTO TempCodes VALUES ('C08','black')

CREATE TABLE TempData (Reference INT, codes VARCHAR(30))

INSERT INTO TempData VALUES (1,'CO1,C & D9 10/10/2011')

INSERT INTO TempData VALUES (2,'2011')

INSERT INTO TempData VALUES (3,'C03,D')

INSERT INTO TempData VALUES (4,'')

INSERT INTO TempData VALUES (5,'D')

INSERT INTO TempData VALUES (6,'C03/D/d9/c01/c')

/*

Assumptions:

1.CodeRef in TempCodes has the first letter as alphabet and remaining letters as numbers

2.CodeRef is always in UPPER case

*/

/* Cleaning data*/

---create table TempDataSplit: to split the codes----

create table #TempDataSplit

(

Reference INT, codes VARCHAR(30)

)

truncate table #TempDataSplit

/*Creating cursor to split the codes row by row*/

declare @Reference int

declare @Codes varchar(30)

declare split cursor for

select Reference,Replace(Replace(REPLACE(UPPER(codes),'/',','),' ',','),'&',',') as codes from TempData

open split

fetch next from split into @Reference,@codes

WHILE @@FETCH_STATUS = 0

BEGIN

DECLARE@pos int,

@nextpos int,

@valuelen int

SELECT @pos = 0, @nextpos = 1

WHILE @nextpos > 0

BEGIN

SELECT @nextpos = charindex(',', @codes, @pos + 1)

SELECT @valuelen = CASE WHEN @nextpos > 0THEN @nextpos

ELSE len(@codes) + 1

END - @pos - 1

INSERT into #TempDataSplit (Reference, codes)

VALUES ( @Reference,substring(@codes, @pos + 1, @valuelen))

SELECT @pos = @nextpos

END

fetch next from split into @Reference,@codes

END

close split

deallocate split

/* observe the split table*/

select * from #TempDataSplit

--further cleansing of data and obtaining required result--

select * from TempCodes A

join

(select Reference ,substring(codes,1,1)+replace(substring(codes,2,LEN(codes)),'O','0') as codes from #TempDataSplit)B

on A.CodeRef=B.codes

Hope this helps you.

Paul White SSC Guru Points: 150467 More actions · Answer 12

danielfountain (2/1/2012)
Hmmm.... ok 99% there.
Some bright spark has added in something like this into the codes.
C01 (Red)
So the code picks up the red, but also picks up yellow from the end of the word Red.
How your code checks that the next character is not alpha numeric, is it possible to add to check that the previous character is not alphanumeric too?
I tried... i cant work it out

Sorry Dan for the late reply:

-- Standard in-line numbers table

WITH

N1 AS (SELECT N1.n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS N1 (n)),

N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R),

N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R),

N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R),

N AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) AS n FROM N4)

-- Query body

SELECT

A.*,

B.*

FROM #TempData AS A

CROSS APPLY

(

-- Numbers from 1 to the length of the current codes string

SELECT TOP (LEN(A.codes))

N.n

FROM N

ORDER BY N.n

) AS P (pos)

JOIN #TempCodes AS B ON

-- Full length match between code and codes string

SUBSTRING(A.codes, P.pos, LEN(B.CodeRef)) = B.CodeRef

-- Following character is not 0-9 or A-Z (includes end of string case)

AND SUBSTRING(A.codes, P.pos + LEN(B.CodeRef), 1) NOT LIKE '[0-9A-Z]'

-- Previous character is not 0-9 or A-Z

AND SUBSTRING(A.codes, P.pos - 1, 1) NOT LIKE '[0-9A-Z]'

ORDER BY

A.Reference;

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi