Parsing a Text field using TSQL

Question

Parsing a Text field using TSQL

brendt hess

Hall of Fame

Points: 3256
More actions
May 31, 2006 at 10:52 am

#153009

I am looking over a database that I have inherited, and there is one field that I need to work with that is being a pain.
This is a Text field with multiple "records" with multiple values in each appended into it. These records / values are of the format:
AAAA;BBBB;CCCC|aaa;BBBBBBB;xxxxx|aaaaaaa;bbb;xx
Each "record" is separated by a pipe character, and each field is separated by a semicolon.
Now, in cases where the Datalength of the field is 8k or less, this is relatively simple to parse. However, parsing more than 8k is difficult, and made more so by the fact that the READTEXT function will not place the text read into a variable. So, my question is:
How would you parse this data, when more than 8k exists in the TEXT field, using only T-SQL.
Thanks in advance,
Brendt

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

stax68 SSChampion Points: 11711 More actions · Answer 1

This is the best I have come up with so far. The first version works for a single row, the second is a version which processes a whole rowset. The code assumes an element length < 100, but this could be up to 8000. It also only splits data once, so you would have to add code to do the second step (splitting the row elements into individual data values) - but that would probably be easier since (we hope) rows will be <8001 chars.

/*******************************************************/
/**********SPLIT DELIMITED LISTS IN TEXT FIELD**********/
/***********Copyright (c) Tim WIlkinson 2006************/
/************All non-commercial use allowed*************/
/*******************************************************/

--this version loops through the text in a single cell
--and produces a numbered recordset
--containing the split values in left-right order.

declare

@separator varchar(3)

declare

@retval table(item_order int identity, val varchar(100))

declare

@text table(txt text)

declare

@id int

declare @i int, @patindex int, @curpos int, @lastpos int, @len int, @val varchar(100)

select

@id = 8, @separator = ' '

--replace reference to the source table. Replace 'de.reference' with a unique key,

--and 'de.displaytext' with the text column you want to process.

insert

@text select DisplayText from dbo.Diary_Entry where reference = @id

select

@i = 1, @len = datalength(txt), @curpos = 1 from @text

while

@curpos <= @len

begin

select @lastpos=@curpos

,@patindex=patindex('%'+@separator+'%',substring(txt,@lastpos+1,101))

,@patindex=case @patindex when 0 then @len-@lastpos else @patindex end

,@val=substring(txt,@lastpos,@patindex)

,@curpos=@lastpos+@patindex+1

from @text

insert @retval(val) values(@val)

end

select

* from @retval order by item_order

go

--this one splits multiple records and produces a recordset with

--both the id of the record containing the source data,

--and an item_order number representing the order of elements in the list.

--it involves making N passes through the data,

--where N = MAX(number of elements in the list)

declare

@separator varchar(3)

declare

@retval table(rowid int, item_order int, val varchar(100))

declare

@text table(txt text)

declare

@val varchar(100), @lastpos int, @len int, @patindex int, @curpos int, @i int, @j-2 int

declare @vars table(Vlastpos int, Vpatindex int, Vpatindex2 int, Vcurpos int, Vlen int,

Vkey int, Vtext text, vVal varchar(100), vCtr int)

select

@separator = ' ', @j-2 = 0, @i = 1

insert

@vars(Vcurpos, Vlen, Vkey, Vtext)

--replace reference to the source table. Replace 'de.reference' with a unique key,

--and 'de.displaytext' with the text column you want to process.

select 1, datalength(de.displaytext), de.reference, de.displaytext from dbo.Diary_Entry de

while

@i > 0

begin

update v

set @lastpos=Vcurpos

,@len = Vlen

,@patindex=patindex('%'+@separator+'%',substring(vtext,@lastpos+1,101))

,@patindex=case @patindex when 0 then @len-@lastpos else @patindex end

,@curpos=@lastpos+@patindex+1

,Vval=substring(vtext,@lastpos,@patindex)

,Vcurpos=@curpos

,Vlen = @len - @curpos

,vCtr = @j-2

from @vars v

where Vlen > 0

select @i = @@rowcount, @j-2 = @j-2 + 1

insert @retval(rowid, val, item_order) select vkey, vval, @j-2 from @vars where vCtr = @j-2-1

end

select

* from @retval order by rowid, item_order

go

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

brendt hess Hall of Fame Points: 3256 More actions · Answer 2

Unfortunately, the rows include some with Text of over 30kb size. So, although this would work for about 65% of the rows, I am still stymied as to how to handle the remainder of the rows.

Thanks for your efforts, though. I appreciate them.

stax68 SSChampion Points: 11711 More actions · Answer 3

Looks like you have some import files stuck in your DB. You want to extract them and use an import process: bcp out, bcp in. Or perhaps ask the DTS fiddlers forum for newly wrapped ways of doing it.

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

stax68 SSChampion Points: 11711 More actions · Answer 4

Just out of interest, is the problem with the 30K text values just a performance issue, are your embedded 'rows' > 8K, or is a variable overflowing, or none of the above?

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

Sergiy SSC Guru Points: 110208 More actions · Answer 5

Data in text column is not actually data. You need to make data from it before you can use it.

Use this procedure to recover your data from the text field and place into appropriate tables.

IF NOT EXISTS (SELECT name FROM sysobjects WHERE name = N'ListDelimitedLong' AND type = 'P')

EXEC (' CREATE PROCEDURE dbo.ListDelimitedLong

@List text,

@Delim char(1)

AS

SELECT @Delim')

GO

ALTER PROC dbo.ListDelimitedLong

@List text, -- A delimiter-separated list of values, this parameter may be up to 2GB in length

@Delim char(1)

AS

SET NOCOUNT ON

DECLARE @ptrval binary(16)

DECLARE @One int

SET @One = 1

DECLARE @LeftPointer int, @RightPointer int, @SubStr varchar(50)

SET @LeftPointer = 0

DECLARE @val TABLE (IntVal bigint )

CREATE TABLE #TXTList (List text)

INSERT INTO #TXTList

SELECT @List

SELECT @ptrval = TEXTPTR(List) FROM #TXTList

WHILE @RightPointer < (SELECT DATALENGTH(List) FROM #TXTList) OR @RightPointer IS NULL

BEGIN

SELECT @RightPointer = CHARINDEX(@Delim, SUBSTRING(List, @LeftPointer+@One, DATALENGTH(List)) )

+ @LeftPointer - @One FROM #TXTList

IF @RightPointer <= @LeftPointer

SELECT @RightPointer = DATALENGTH(List) FROM #TXTList

SELECT @SubStr = SUBSTRING(List, @LeftPointer + @One, @RightPointer-@LeftPointer)

FROM #TXTList

SELECT @LeftPointer = @RightPointer + DATALENGTH(@Delim)

END

SELECT DISTINCT IntVal

FROM @val

GO

_____________
Code for TallyGenerator

stax68 SSChampion Points: 11711 More actions · Answer 6

But you can't do it setwise, of course.

MS have deprecated use of READTEXT with text pointer saying it will be phased out, and recommended using SUBSTRING instead. Does anyone know any more about this?

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

vadba SSChampion Points: 11133 More actions · Answer 7

You can use Substring to access blocks of text from a text column (or ntext). The blocks are limited by the maximum varchar sizes, which are varchar=8000 and nvarchar=4000.

The simplest way would be to loop until the amount of text retrieve with the substring() function is less than the block size (i.e. the length of the text requested in the substring function). I use Datalength to do this because Len() doesn't count trailing spaces.

The problem with using Substring() in a loop is that the data may be variable length, and thus a "record" could span two blocks of text. You could use CharIndex to locate the pipes symbols and incorporate that into the loop.

Anyway, here's a simple loop that prints out the text in query analyzer (I have set the maximum number of characters per columns in Tools|Options |Results to 8000):

DECLARE @buffer varchar(4000)

, @blocksize int

, @blockPtr int

SET @blocksize = 4000

SET @buffer = ''

SET @blockPtr = 1

WHILE 1=1

BEGIN

SELECT @buffer = Substring(TextColumn, @blockPtr, @blockSize)

FROM MyTable

WHERE pkey= <A_Key_Value>

PRINT @buffer

IF DataLength(@buffer) < @blocksize

BEGIN

BREAK

END

SET @blockPtr = @blockPtr + @blocksize

END --WHILE

stax68 SSChampion Points: 11711 More actions · Answer 8

This one will fetch a single field value at a time and sort out the row numbering in the process. So as long as none of your data values (+ trailing separator!) exceed 8000 chars, you'll be OK on that score.

--These two lines:

,@patindexF=patindex('%'+@separatorF+'%',substring(vtext,@lastpos,8000))

,@patindexR=patindex('%'+@separatorR+'%',substring(vtext,@lastpos,8000))

--require altering to allow a full 8000 characters in your data values.

More than that in a single field, and the SQL approach is reaching its limits.

You don't get a recordset per input row; the data is unpivoted into sourcerecordID, recordID, fieldID, value. You would need to create appropriate tables and pivot the data into those.

declare
--field delimiter, row delimiter
--Set up for testing on natural language text:
--sentence = record, word = field value.
--note this means it can cope with ragged 'recordsets'.
select
-------------------------------------------------------
declare
declare
,
declare @vars table(Vlastpos int, Vcurpos int, Vlen int, Vkey int, Vtext text, vVal varchar(8000),vFieldCtr int, vrowCtr int, vCtr int, vNewRow bit)
select
insert
--replace reference to the source table. Replace 'de.reference' with a unique key,
--and 'de.displaytext' with the text column you want to process.
select 1, datalength(de.displaytext), de.reference, de.displaytext,0,0,1 from dbo.Diary_Entry de
----------------------------------------------------------------
while
begin
update v
set @FirstField=vNewRow
,@lastpos=Vcurpos
,@len = Vlen
,@patindexF=patindex('%'+@separatorF+'%',substring(vtext,@lastpos,8000)) --should = max field len + len of largest delimiter.
,@patindexR=patindex('%'+@separatorR+'%',substring(vtext,@lastpos,8000))
,@LastField= case when @patindexF = 0 then 1 when @patindexR > 0 and @patindexF > @patindexR then 1 else 0 end
,@patindex=case @LastField when 1 then @patindexR else @patindexF end
,@incrlen=case @LastField when 1 then @seplenR else @seplenF end
,@patindex=case @patindex when 0 then @len-@lastpos else @patindex end
,@curpos=@lastpos+@patindex+@incrlen-1
,Vval=substring(vtext,@lastpos,@patindex-1)
,Vcurpos=@curpos
,Vlen = @len - @curpos
,vCtr = @j-2
,vRowCtr = case @FirstField when 1 then vRowCtr+1 else vRowCtr end
,vFieldCtr = case @FirstField when 1 then 1 else vFieldCtr + 1 end
,vNewRow = @LastField
from @vars v
where Vlen > 0
insert @retval(sourcerowid, val, recordid, field_order) select vkey, vval, vRowCtr, vFieldCtr
from @vars where vCtr = @j-2
select @i = @@rowcount, @j-2 = @j-2 + 1
end
select
go

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant