July 20, 2011 at 10:59 am
Hello Everyone
I am working on some Health Care Data, I know, you don't have to rub it in.
Some of the data that is being pumped in from HL7(who ever came up with this standard, needs to have their @$$ removed). Some of the data has what looks like XML tags all thru it
example: <OBX.5><OBX.5.1>Blah Blah Data</OBX.5.1></OBX.5>
And of course with the ending tags as well.
I need to remove all of the tags. Anyone has an idea of how I can remove all the tags from within a column? So that the data comes out to be only the
Blah Blah Data
There may be many of the tags, with all different numbers inside.
Thank You in advance
Andrew SQLDBA
July 20, 2011 at 11:04 am
Have you tried the REPLACE() function?
July 20, 2011 at 11:09 am
The long and nasty way is to loop through the characters in the data.
Use loops and find first and last '<', '>'.
if < & > are not in the actual data it is fairly simple to remove first and last for each <obx.5> which works for <tag> and </mytag>
This could take awhile depending upon how many columns. I would prefer to clean the chaff out of the target table and then put a stripping step in the data import.
good luck
July 20, 2011 at 11:13 am
Yes, I have thought of that. But what I am not getting is all of the number within the tags.
So my data looks like this if there are more than 2 digits within the tags originally.
<.5.1>
How can I ensure that everything within the tags, and including the tags get replaced?
Andrew SQLDBA
July 20, 2011 at 11:17 am
i have this saved in my snippets for stripping out HTML tags. note this assumes that well formed html is in palce...so % lt ; should be there for < symbols that exist in the text data portion.
declare @htmlText varchar(max)
--===== Replace all HTML tags with nothing
WHILE CHARINDEX('<',@HTMLText) > 0
SELECT @HTMLText = STUFF(@HTMLText,
CHARINDEX('<',@HTMLText),
CHARINDEX('>',@HTMLText)-CHARINDEX('<',@HTMLText)+1,
'')
SELECT @HTMLText
Lowell
July 20, 2011 at 11:18 am
While looping through the characters in the row.
Find the first open bracket, do not include any data until you have went past the closing bracket.
Note: if the input is consistent it can be even simpler.
<Tag1><tag1.1>yada</tag1.1><Tag1> Consistent meaning the Yada is the only data you are looking for. Loop until you get to a character after '>' that is not '<' use that character string until you get to the next '<' done.
July 20, 2011 at 11:22 am
There are hundreds of tags throughout the column, and the tags are not in the same location in each row.
Any advice?
Thanks
Andrew SQLDBA
July 20, 2011 at 11:23 am
If you had a couple of different sanitized examples it would give a better picture. thx
July 20, 2011 at 11:24 am
Why would a combination of REPLACE(), LIKE() and PATINDEX() not work here?
EDIT: Like in Lowell's example? It's set based, no looping. Looping bad if you don't need to do it.
July 20, 2011 at 11:26 am
I forgot to mention this, but the column is a text data type, and replace is not working with Text data type.
Andrew SQLDBA
July 20, 2011 at 11:30 am
AndrewSQLDBA (7/20/2011)
I forgot to mention this, but the column is a text data type, and replace is not working with Text data type.
Can you update it to VARCHAR(MAX) without breaking anything?
Or add a VARCHAR(MAX) column to the table, copy over the text data into the new column, then use the REPLACE()?
July 20, 2011 at 11:31 am
Cast your text type as a varchar(max)
July 20, 2011 at 11:42 am
ok here's a tested working example...that i reworked quickly because you changed the datatype tyo TEXT.
i'm assuming this'll be a one time update, so it might be slow on a MillionBillion rows of data:
--DROP TABLE MySampleData
CREATE TABLE MySampleData (ID INT,HTMLText TEXT)
INSERT INTO MySampleData(ID,HTMLText)
SELECT 1,'NoHTML Inside' UNION ALL
SELECT 2, '<Tag1><tag1.1>yada</tag1.1><Tag1>' UNION ALL
SELECT 3 ,'<OBX.5><OBX.5.1>Blah Blah Data</OBX.5.1></OBX.5>Other Stuff outside of the tags'
DECLARE @TagsExist INT
SET @TagsExist = 1
WHILE @TagsExist > 0
BEGIN
IF NOT EXISTS(SELECT 1 FROM MySampleData WHERE CHARINDEX('<',CONVERT(VARCHAR(MAX),HTMLText)) > 0)
BEGIN
SET @TagsExist = 0
END
UPDATE MySampleData
SET HTMLText = STUFF(CONVERT(VARCHAR(MAX),HTMLText),
CHARINDEX('<',CONVERT(VARCHAR(MAX),HTMLText)),
CHARINDEX('>',CONVERT(VARCHAR(MAX),HTMLText)) - CHARINDEX('<',CONVERT(VARCHAR(MAX),HTMLText))+ 1,
'')
WHERE CHARINDEX('<',CONVERT(VARCHAR(MAX),HTMLText)) > 0
END --while loop
SELECT * FROM MySampleData
Lowell
July 20, 2011 at 11:49 am
Nice snippet Lowell.
That looks like a nice piece of code to place in the tool chest!
July 20, 2011 at 11:51 am
What happens if the text outside the tags containt < or > ?
I never fully tackled this problem, but I know it's far from simple.
Moreover I think that RBARing this might be a good option to try. Right now you're hitting and updating the same row a ton of times. I'd preffer using a itvf or even a simple function, loop in it and then update the base row only once (read / write once max).
Viewing 15 posts - 1 through 15 (of 24 total)
You must be logged in to reply to this topic. Login to reply