Join - that may not be possible

  • Hey all,

    Well we have all been there. A database design and data that is pooorly designed and no way to change it. Well i have an odd one - and i am not sure its possible to fix. Here is an example:

    CREATE TABLE #TempCodes (CodeRef VARCHAR(3), DESCRIPTION VARCHAR(10))

    INSERT INTO #TempCodes VALUES ('C01','Red')

    INSERT INTO #TempCodes VALUES ('C','Green')

    INSERT INTO #TempCodes VALUES ('D9','purple')

    INSERT INTO #TempCodes VALUES ('D','yellow')

    INSERT INTO #TempCodes VALUES ('C08','black')

    CREATE TABLE #TempData (Reference INT, codes VARCHAR(30))

    INSERT INTO #TempData VALUES (1,'CO1,C & D9 10/10/2011')

    INSERT INTO #TempData VALUES (2,'2011')

    INSERT INTO #TempData VALUES (3,'C03,D')

    INSERT INTO #TempData VALUES (4,'')

    INSERT INTO #TempData VALUES (5,'D')

    INSERT INTO #TempData VALUES (6,'C03/D/d9/c01/c')

    SELECT *

    FROM #TempData A

    LEFT outer JOIN

    #TempCodes B

    ON

    CHARINDEX (B.CodeRef,A.codes) <> 0

    So effectively there is a free format field which has codes typed into it - and sometimes other stuff, sometimes wrong codes as well. Bascially if the code exists in that field then i need to join to the #tempcodes table on that.

    So with my select at the bottom i have 2 issues.

    1) Because charindex returns 0 for no match - if the code is the first thing in the list its not joining.

    2) There are codes that have similar data so C and C01. because there is a C in C01 its matching. I dont want it too.

    however because its free format field people will put comma separated, slash separated, semi colon, colon, space.... you name it.

    The more i write the less possible it sounds. Anyone able to prove me wrong?

    Thanks for reading my plight.

    Dan

  • Dan

    I feel your pain. If you don't have the option to change the database, do you have the option to change the application? That way, you could perform some basic validation on what goes into the database. It may also be worth performing a data cleansing exercise on your existing data, so that only one delimiter is used. Once you have all that, you can use a splitter function to split each row into its individual codes. Far from ideal, but better than what you have at the moment.

    John

  • Without specifically capturing dates, you might get some milage from this:

    SELECT a.Reference, x1.codes, y.[DESCRIPTION]

    FROM #TempData a

    CROSS APPLY(

    SELECT codes = REPLACE(REPLACE(REPLACE(a.codes,'& ',''),'/',','),' ',',')

    ) x1

    OUTER APPLY (

    SELECT b.[DESCRIPTION]

    FROM #TempCodes B

    CROSS APPLY dbo.DelimitedSplit8K (x1.codes,',') split

    WHERE b.CodeRef = split.Item

    ) y

    dbo.DelimitedSplit8K is a string splitter function devised by Jeff Moden et al, published here[/url].

    Edit - added link to article.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • I would suggest doing it something like this:

    SELECT *

    FROM #TempData A

    LEFT JOIN #TempCodes B

    ON

    A.codes like '%'+CodeRef'+'%'

    I'd put any regex directives (ie '%') right in the CodeRef column so you don't have to tack the percent character at the beginning and end of the coderef column in the join. It would also allow you to be more sepecific, say the code has to be at the begging, or the end, or could be anywhere etc.

    The probability of survival is inversely proportional to the angle of arrival.

  • If replacing potential delimiters is sufficient, then there no need for splitting the field. Just replace potential delimiters with a unified delimiter and surround the code to search for with that delimiter, i.e.

    ; WITH Cte AS

    (

    SELECT

    Reference, codes,

    -- Replace any character (including spaces) that might separate a code with a '|'.

    delimited_codes = '|' + REPLACE(REPLACE(REPLACE(REPLACE(codes, ' ', '|'), ',', '|'), '&', '|'), '/', '|') + '|'

    FROM

    #TempData

    )

    SELECT

    *

    FROM

    Cte A

    LEFT outer JOIN

    #TempCodes B ON CHARINDEX ('|' + B.CodeRef + '|' ,A.delimited_codes) > 0

    P.S. Am not quite sure what was meant with the first issue. Isn't there just a typo in the sample data. #TempCodes contains 'C03' (with a zero) while codes in the first row contains 'CO3' (with an Ohh).

  • Thanks for all suggestions.

    Unfortunately no i cant change front end. Only thing i have a chance to do is change what people type in - and you know how well that works.

    In terms of the Like join - that doesnt unfortunately help. It still brings back the exactly same 12 records (not the ones right at the begining and still has the problem with the C01 and C).

    I think my only chance may be to split it - i had concidered it, but i am concerned what end users might use to delimiter!

    @,.!"£$%^&

    that isnt me swearing - thats me thinking what they "could" do!! 🙂

    Ta

    Dan

  • For small amounts of data it's not a big deal, but any kind of string manipulation gets horrible really fast in T-SQL if you have any kind of volume of data. Is CLR an option for you? I would write a CLR that would parse out all the possible values from your garbage field ... just figure out everything that can split values like spaces slashes, ampersands, etc. Process that for each record inserted into a new one to many table. Then add a prioritization for the possible values or just figure out a simple way (such as alphabetical) to pick out a single one in case multiple values match. At that point you can simply join and end up with scalable performance.

    If you can't do the CLR, you can still basically do the same thing with T-SQL, it just won't perform as well.

    └> bt



    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • This seems to work reasonably well:

    -- Standard in-line numbers table

    WITH

    N1 AS (SELECT N1.n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS N1 (n)),

    N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R),

    N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R),

    N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R),

    N AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) AS n FROM N4)

    -- Query body

    SELECT

    A.*,

    B.*

    FROM #TempData AS A

    CROSS APPLY

    (

    -- Numbers from 1 to the length of the current codes string

    SELECT TOP (LEN(A.codes))

    N.n

    FROM N

    ORDER BY N.n

    ) AS P (pos)

    JOIN #TempCodes AS B ON

    -- Full length match between code and codes string

    SUBSTRING(A.codes, P.pos, LEN(B.CodeRef)) = B.CodeRef

    -- Following character is not 0-9 or A-Z (includes end of string case)

    AND SUBSTRING(A.codes, P.pos + LEN(B.CodeRef), 1) NOT LIKE '[0-9A-Z]'

    ORDER BY

    A.Reference

  • Paul,

    Sorry for the late response.

    That - is brilliant. Genious thinking. Thank you ever so much. I never thought of it quite like that.

    I am not 100% sure what the numbers table is exactly doing - but i will look at it in more detail shortly.

    Thank you so much for your time.

    Dan

  • danielfountain (2/1/2012)


    I am not 100% sure what the numbers table is exactly doing - but i will look at it in more detail shortly.

    Thanks. If you mean the weird-looking code in the common table expression (the WITH section), it's just a way of generating a sequence of numbers out of thin air. I think it was Itzik Ben-Gan that first came up with it. Of course, you probably have a real numbers table somewhere in your database so you could/should use that instead. Hopefully the rest of the code & logic is reasonably transparent; let me know if there's anything you need clarification on. Interesting problem.

  • Hmmm.... ok 99% there.

    Some bright spark has added in something like this into the codes.

    C01 (Red)

    So the code picks up the red, but also picks up yellow from the end of the word Red.

    How your code checks that the next character is not alpha numeric, is it possible to add to check that the previous character is not alphanumeric too?

    I tried... i cant work it out

    Thanks again

    Dan

  • CREATE TABLE TempCodes (CodeRef VARCHAR(3), DESCRIPTION VARCHAR(10))

    INSERT INTO TempCodes VALUES ('C01','Red')

    INSERT INTO TempCodes VALUES ('C','Green')

    INSERT INTO TempCodes VALUES ('D9','purple')

    INSERT INTO TempCodes VALUES ('D','yellow')

    INSERT INTO TempCodes VALUES ('C08','black')

    CREATE TABLE TempData (Reference INT, codes VARCHAR(30))

    INSERT INTO TempData VALUES (1,'CO1,C & D9 10/10/2011')

    INSERT INTO TempData VALUES (2,'2011')

    INSERT INTO TempData VALUES (3,'C03,D')

    INSERT INTO TempData VALUES (4,'')

    INSERT INTO TempData VALUES (5,'D')

    INSERT INTO TempData VALUES (6,'C03/D/d9/c01/c')

    /*

    Assumptions:

    1.CodeRef in TempCodes has the first letter as alphabet and remaining letters as numbers

    2.CodeRef is always in UPPER case

    */

    /* Cleaning data*/

    ---create table TempDataSplit: to split the codes----

    create table #TempDataSplit

    (

    Reference INT, codes VARCHAR(30)

    )

    truncate table #TempDataSplit

    /*Creating cursor to split the codes row by row*/

    declare @Reference int

    declare @Codes varchar(30)

    declare split cursor for

    select Reference,Replace(Replace(REPLACE(UPPER(codes),'/',','),' ',','),'&',',') as codes from TempData

    open split

    fetch next from split into @Reference,@codes

    WHILE @@FETCH_STATUS = 0

    BEGIN

    DECLARE@pos int,

    @nextpos int,

    @valuelen int

    SELECT @pos = 0, @nextpos = 1

    WHILE @nextpos > 0

    BEGIN

    SELECT @nextpos = charindex(',', @codes, @pos + 1)

    SELECT @valuelen = CASE WHEN @nextpos > 0THEN @nextpos

    ELSE len(@codes) + 1

    END - @pos - 1

    INSERT into #TempDataSplit (Reference, codes)

    VALUES ( @Reference,substring(@codes, @pos + 1, @valuelen))

    SELECT @pos = @nextpos

    END

    fetch next from split into @Reference,@codes

    END

    close split

    deallocate split

    /* observe the split table*/

    select * from #TempDataSplit

    --further cleansing of data and obtaining required result--

    select * from TempCodes A

    join

    (select Reference ,substring(codes,1,1)+replace(substring(codes,2,LEN(codes)),'O','0') as codes from #TempDataSplit)B

    on A.CodeRef=B.codes

    Hope this helps you.

  • danielfountain (2/1/2012)


    Hmmm.... ok 99% there.

    Some bright spark has added in something like this into the codes.

    C01 (Red)

    So the code picks up the red, but also picks up yellow from the end of the word Red.

    How your code checks that the next character is not alpha numeric, is it possible to add to check that the previous character is not alphanumeric too?

    I tried... i cant work it out

    Sorry Dan for the late reply:

    -- Standard in-line numbers table

    WITH

    N1 AS (SELECT N1.n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS N1 (n)),

    N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R),

    N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R),

    N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R),

    N AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) AS n FROM N4)

    -- Query body

    SELECT

    A.*,

    B.*

    FROM #TempData AS A

    CROSS APPLY

    (

    -- Numbers from 1 to the length of the current codes string

    SELECT TOP (LEN(A.codes))

    N.n

    FROM N

    ORDER BY N.n

    ) AS P (pos)

    JOIN #TempCodes AS B ON

    -- Full length match between code and codes string

    SUBSTRING(A.codes, P.pos, LEN(B.CodeRef)) = B.CodeRef

    -- Following character is not 0-9 or A-Z (includes end of string case)

    AND SUBSTRING(A.codes, P.pos + LEN(B.CodeRef), 1) NOT LIKE '[0-9A-Z]'

    -- Previous character is not 0-9 or A-Z

    AND SUBSTRING(A.codes, P.pos - 1, 1) NOT LIKE '[0-9A-Z]'

    ORDER BY

    A.Reference;

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply