January 6, 2012 at 3:53 am
Hey all,
Well we have all been there. A database design and data that is pooorly designed and no way to change it. Well i have an odd one - and i am not sure its possible to fix. Here is an example:
CREATE TABLE #TempCodes (CodeRef VARCHAR(3), DESCRIPTION VARCHAR(10))
INSERT INTO #TempCodes VALUES ('C01','Red')
INSERT INTO #TempCodes VALUES ('C','Green')
INSERT INTO #TempCodes VALUES ('D9','purple')
INSERT INTO #TempCodes VALUES ('D','yellow')
INSERT INTO #TempCodes VALUES ('C08','black')
CREATE TABLE #TempData (Reference INT, codes VARCHAR(30))
INSERT INTO #TempData VALUES (1,'CO1,C & D9 10/10/2011')
INSERT INTO #TempData VALUES (2,'2011')
INSERT INTO #TempData VALUES (3,'C03,D')
INSERT INTO #TempData VALUES (4,'')
INSERT INTO #TempData VALUES (5,'D')
INSERT INTO #TempData VALUES (6,'C03/D/d9/c01/c')
SELECT *
FROM #TempData A
LEFT outer JOIN
#TempCodes B
ON
CHARINDEX (B.CodeRef,A.codes) <> 0
So effectively there is a free format field which has codes typed into it - and sometimes other stuff, sometimes wrong codes as well. Bascially if the code exists in that field then i need to join to the #tempcodes table on that.
So with my select at the bottom i have 2 issues.
1) Because charindex returns 0 for no match - if the code is the first thing in the list its not joining.
2) There are codes that have similar data so C and C01. because there is a C in C01 its matching. I dont want it too.
however because its free format field people will put comma separated, slash separated, semi colon, colon, space.... you name it.
The more i write the less possible it sounds. Anyone able to prove me wrong?
Thanks for reading my plight.
Dan
January 6, 2012 at 4:22 am
Dan
I feel your pain. If you don't have the option to change the database, do you have the option to change the application? That way, you could perform some basic validation on what goes into the database. It may also be worth performing a data cleansing exercise on your existing data, so that only one delimiter is used. Once you have all that, you can use a splitter function to split each row into its individual codes. Far from ideal, but better than what you have at the moment.
John
January 6, 2012 at 4:24 am
Without specifically capturing dates, you might get some milage from this:
SELECT a.Reference, x1.codes, y.[DESCRIPTION]
FROM #TempData a
CROSS APPLY(
SELECT codes = REPLACE(REPLACE(REPLACE(a.codes,'& ',''),'/',','),' ',',')
) x1
OUTER APPLY (
SELECT b.[DESCRIPTION]
FROM #TempCodes B
CROSS APPLY dbo.DelimitedSplit8K (x1.codes,',') split
WHERE b.CodeRef = split.Item
) y
dbo.DelimitedSplit8K is a string splitter function devised by Jeff Moden et al, published here[/url].
Edit - added link to article.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 6, 2012 at 9:25 am
I would suggest doing it something like this:
SELECT *
FROM #TempData A
LEFT JOIN #TempCodes B
ON
A.codes like '%'+CodeRef'+'%'
I'd put any regex directives (ie '%') right in the CodeRef column so you don't have to tack the percent character at the beginning and end of the coderef column in the join. It would also allow you to be more sepecific, say the code has to be at the begging, or the end, or could be anywhere etc.
The probability of survival is inversely proportional to the angle of arrival.
January 6, 2012 at 9:36 am
If replacing potential delimiters is sufficient, then there no need for splitting the field. Just replace potential delimiters with a unified delimiter and surround the code to search for with that delimiter, i.e.
; WITH Cte AS
(
SELECT
Reference, codes,
-- Replace any character (including spaces) that might separate a code with a '|'.
delimited_codes = '|' + REPLACE(REPLACE(REPLACE(REPLACE(codes, ' ', '|'), ',', '|'), '&', '|'), '/', '|') + '|'
FROM
#TempData
)
SELECT
*
FROM
Cte A
LEFT outer JOIN
#TempCodes B ON CHARINDEX ('|' + B.CodeRef + '|' ,A.delimited_codes) > 0
P.S. Am not quite sure what was meant with the first issue. Isn't there just a typo in the sample data. #TempCodes contains 'C03' (with a zero) while codes in the first row contains 'CO3' (with an Ohh).
January 6, 2012 at 9:41 am
Thanks for all suggestions.
Unfortunately no i cant change front end. Only thing i have a chance to do is change what people type in - and you know how well that works.
In terms of the Like join - that doesnt unfortunately help. It still brings back the exactly same 12 records (not the ones right at the begining and still has the problem with the C01 and C).
I think my only chance may be to split it - i had concidered it, but i am concerned what end users might use to delimiter!
@,.!"£$%^&
that isnt me swearing - thats me thinking what they "could" do!! 🙂
Ta
Dan
January 6, 2012 at 9:10 pm
For small amounts of data it's not a big deal, but any kind of string manipulation gets horrible really fast in T-SQL if you have any kind of volume of data. Is CLR an option for you? I would write a CLR that would parse out all the possible values from your garbage field ... just figure out everything that can split values like spaces slashes, ampersands, etc. Process that for each record inserted into a new one to many table. Then add a prioritization for the possible values or just figure out a simple way (such as alphabetical) to pick out a single one in case multiple values match. At that point you can simply join and end up with scalable performance.
If you can't do the CLR, you can still basically do the same thing with T-SQL, it just won't perform as well.
January 8, 2012 at 9:22 am
This seems to work reasonably well:
-- Standard in-line numbers table
WITH
N1 AS (SELECT N1.n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS N1 (n)),
N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R),
N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R),
N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R),
N AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) AS n FROM N4)
-- Query body
SELECT
A.*,
B.*
FROM #TempData AS A
CROSS APPLY
(
-- Numbers from 1 to the length of the current codes string
SELECT TOP (LEN(A.codes))
N.n
FROM N
ORDER BY N.n
) AS P (pos)
JOIN #TempCodes AS B ON
-- Full length match between code and codes string
SUBSTRING(A.codes, P.pos, LEN(B.CodeRef)) = B.CodeRef
-- Following character is not 0-9 or A-Z (includes end of string case)
AND SUBSTRING(A.codes, P.pos + LEN(B.CodeRef), 1) NOT LIKE '[0-9A-Z]'
ORDER BY
A.Reference
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
February 1, 2012 at 4:50 am
Paul,
Sorry for the late response.
That - is brilliant. Genious thinking. Thank you ever so much. I never thought of it quite like that.
I am not 100% sure what the numbers table is exactly doing - but i will look at it in more detail shortly.
Thank you so much for your time.
Dan
February 1, 2012 at 5:52 am
danielfountain (2/1/2012)
I am not 100% sure what the numbers table is exactly doing - but i will look at it in more detail shortly.
Thanks. If you mean the weird-looking code in the common table expression (the WITH section), it's just a way of generating a sequence of numbers out of thin air. I think it was Itzik Ben-Gan that first came up with it. Of course, you probably have a real numbers table somewhere in your database so you could/should use that instead. Hopefully the rest of the code & logic is reasonably transparent; let me know if there's anything you need clarification on. Interesting problem.
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
February 1, 2012 at 9:45 am
Hmmm.... ok 99% there.
Some bright spark has added in something like this into the codes.
C01 (Red)
So the code picks up the red, but also picks up yellow from the end of the word Red.
How your code checks that the next character is not alpha numeric, is it possible to add to check that the previous character is not alphanumeric too?
I tried... i cant work it out
Thanks again
Dan
February 2, 2012 at 9:15 am
CREATE TABLE TempCodes (CodeRef VARCHAR(3), DESCRIPTION VARCHAR(10))
INSERT INTO TempCodes VALUES ('C01','Red')
INSERT INTO TempCodes VALUES ('C','Green')
INSERT INTO TempCodes VALUES ('D9','purple')
INSERT INTO TempCodes VALUES ('D','yellow')
INSERT INTO TempCodes VALUES ('C08','black')
CREATE TABLE TempData (Reference INT, codes VARCHAR(30))
INSERT INTO TempData VALUES (1,'CO1,C & D9 10/10/2011')
INSERT INTO TempData VALUES (2,'2011')
INSERT INTO TempData VALUES (3,'C03,D')
INSERT INTO TempData VALUES (4,'')
INSERT INTO TempData VALUES (5,'D')
INSERT INTO TempData VALUES (6,'C03/D/d9/c01/c')
/*
Assumptions:
1.CodeRef in TempCodes has the first letter as alphabet and remaining letters as numbers
2.CodeRef is always in UPPER case
*/
/* Cleaning data*/
---create table TempDataSplit: to split the codes----
create table #TempDataSplit
(
Reference INT, codes VARCHAR(30)
)
truncate table #TempDataSplit
/*Creating cursor to split the codes row by row*/
declare @Reference int
declare @Codes varchar(30)
declare split cursor for
select Reference,Replace(Replace(REPLACE(UPPER(codes),'/',','),' ',','),'&',',') as codes from TempData
open split
fetch next from split into @Reference,@codes
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE@pos int,
@nextpos int,
@valuelen int
SELECT @pos = 0, @nextpos = 1
WHILE @nextpos > 0
BEGIN
SELECT @nextpos = charindex(',', @codes, @pos + 1)
SELECT @valuelen = CASE WHEN @nextpos > 0THEN @nextpos
ELSE len(@codes) + 1
END - @pos - 1
INSERT into #TempDataSplit (Reference, codes)
VALUES ( @Reference,substring(@codes, @pos + 1, @valuelen))
SELECT @pos = @nextpos
END
fetch next from split into @Reference,@codes
END
close split
deallocate split
/* observe the split table*/
select * from #TempDataSplit
--further cleansing of data and obtaining required result--
select * from TempCodes A
join
(select Reference ,substring(codes,1,1)+replace(substring(codes,2,LEN(codes)),'O','0') as codes from #TempDataSplit)B
on A.CodeRef=B.codes
Hope this helps you.
February 2, 2012 at 9:26 am
danielfountain (2/1/2012)
Hmmm.... ok 99% there.Some bright spark has added in something like this into the codes.
C01 (Red)
So the code picks up the red, but also picks up yellow from the end of the word Red.
How your code checks that the next character is not alpha numeric, is it possible to add to check that the previous character is not alphanumeric too?
I tried... i cant work it out
Sorry Dan for the late reply:
-- Standard in-line numbers table
WITH
N1 AS (SELECT N1.n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS N1 (n)),
N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R),
N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R),
N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R),
N AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) AS n FROM N4)
-- Query body
SELECT
A.*,
B.*
FROM #TempData AS A
CROSS APPLY
(
-- Numbers from 1 to the length of the current codes string
SELECT TOP (LEN(A.codes))
N.n
FROM N
ORDER BY N.n
) AS P (pos)
JOIN #TempCodes AS B ON
-- Full length match between code and codes string
SUBSTRING(A.codes, P.pos, LEN(B.CodeRef)) = B.CodeRef
-- Following character is not 0-9 or A-Z (includes end of string case)
AND SUBSTRING(A.codes, P.pos + LEN(B.CodeRef), 1) NOT LIKE '[0-9A-Z]'
-- Previous character is not 0-9 or A-Z
AND SUBSTRING(A.codes, P.pos - 1, 1) NOT LIKE '[0-9A-Z]'
ORDER BY
A.Reference;
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply