October 14, 2004 at 11:23 am
I need help to create a script that would perform text parsing.
I have a text (words separated by spaces, probably by more than one space and with any different symbols). I need to extract all words from the text and insert each individual word in another table.
eg.
TableA
SentenceID Sentence
000001 How are you today?
000002 What is your name?
TableB
SentenceID Word
000001 How
000001 are
000001 you
000001 today
000002 What
000002 is
000002 your
000003 name
Any help appreciated!
October 14, 2004 at 1:06 pm
Best to do that type of work using a computer language that's good for parsing text. TSQL is not one of those languages.
Vyas talks about many methods to parse comma demited parameter. That's similar to a space delimited sentence.
http://vyaskn.tripod.com/passing_arrays_to_stored_procedures.htm
October 14, 2004 at 9:55 pm
This is part of a SPROC I wrote for multi-word text matching procedure. I use a temp table to hold the parsed words, which then I run through clean up filters. Runs fairly quick. Hope it helps....Jeff
DECLARE @Position INT --Used for looping during word parsing
DECLARE @Counter INT --Loop counter for parsing the word from string
DECLARE @String VARCHAR(100) --Company name to be parsed into words
DECLARE @WordEnd INT --String position of the end character of the word
DECLARE @Weight INT --Word weighting factor
DECLARE @WordCtr INT --Count of how many words were in the string
--Create the word list table
DECLARE @WordTable TABLE
(
Iden INT NULL,
word VARCHAR(30) NULL
 
SET @String = '<your string here'
--Clear the Word table of any entries
--Reset all working variables to start parsing the new string
SET @WordCtr = 1
SET @Position = DATALENGTH(@Company)
SET @wordend = DATALENGTH(@Company)
WHILE @position >= 0
BEGIN
IF (ASCII(SUBSTRING(@Company, @position, 1)) = 32) --Looks for the space character
BEGIN
INSERT @WordTable (Iden, word) VALUES (@WordCtr, SUBSTRING(@Company, @position +1, (@wordEnd -@position) ) )
SET @WordEnd = @Position - 1
SET @Position = @Position - 1
SET @WordCtr = @WordCtr +1
END --If character is a space
ELSE --Character is not a speace
BEGIN
IF @Position = 0 --We have parsed to the beginning of the string, so add the first word in string
BEGIN
INSERT @WordTable (Iden, word) Values (@WordCtr, SUBSTRING(@Company, 0, @wordEnd+1) )
SET @position = @position - 1
END
ELSE --Not to the beginning of the string yet
BEGIN
SET @position = @position - 1
END
END --Character is not a space
END --While Position >=0
SELECT word FROM @WordTable
October 15, 2004 at 8:48 am
Jeff, this is awesome! I have created a sproc from this to share with my team - really good stuff!
[font="Courier New"]ZenDada[/font]
October 15, 2004 at 8:53 am
Cool, glad you found it useful. I actually had to write it because I got tasked with comparing non-standard company names from two different CRM systems to get 'close matches' using a free-text search routine, what a pain...anyway have a great day....jd
October 15, 2004 at 2:37 pm
Hi All,
Thanks for all comments and to Jeff for enclosed SQL code. It is very good! Unfortunatly I could not used because of dead line at Friday (Sydney time zone) so I have been forced to create something similar to Jeff stuff!
I wrote function:
CREATE FUNCTION fnWord ( @SentenceID CHAR( 8 ),
@Sentence VARCHAR ( 100 ) )
RETURNS @TableA TABLE ( SentenceID TEXT, Word TEXT )
AS
BEGIN
SET @Sentence = @Sentence + ' '
DECLARE @WordStart INT
DECLARE @WordEnd INT
SET @WordStart = 1
SET @WordEnd = 1
WHILE @WordStart < LEN( @Sentence )
BEGIN
SET @WordStart = CHARINDEX( ' ', @Sentence, @WordStart )
INSERT @TableA
SELECT @SentenceID, CAST( SUBSTRING( @Sentence, @WordEnd, @WordStart - @WordEnd ) AS VARCHAR )
SET @WordEnd = @WordStart + 1
SET @WordStart = @WordStart + 1
END
RETURN
END
GO
I tested function with:
SELECT *
FROM fnWord('00123483','What is your given name?')
Result is:
SentenceID Word
00123483 What
00123483 is
00123483 your
00123483 given
00123483 name?
(5 row(s) affected)
To be able to apply function on data set (15,000,000 rows/ 150 words per sentence), I used CURSOR:
SET NOCOUNT ON
DECLARE @part varchar(8)
DECLARE @ID VARCHAR(8000)
DECLARE CTest CURSOR FOR
SELECT part_no, words
FROM #word
OPEN CTest
FETCH NEXT FROM CTest
INTO @PART, @ID
WHILE @@FETCH_STATUS = 0
BEGIN
INSERT INTO #words
SELECT * FROM fnWord( @part, @ID )
FETCH NEXT FROM CTest
INTO @PART, @id
END
CLOSE CTest
DEALLOCATE CTest
It took 50 mins. For me it is better than I expected!!
Another chelenge (hate cursors)!
Can somebody change cursor (row level processing) to dataset level using 'INSERT'
Thanks,
Milovan
October 17, 2004 at 8:20 am
hey milovan
iam not sure whether u have solved ur pblm.
i did not understand ur question. cursor are for row level processing. if u could elaborate ur doubt i might try to clear it
any way here is a proc that could parse a sentence and enter the parsed words into another table
alter proc parsing
as
declare @sentencelength int
declare @sentenceid int
declare @sentence varchar(100)
declare @count int
declare @temp_str varchar(100)
begin
declare c1_parse cursor for select * from sentence
open c1_parse
fetch next from c1_parse into @sentenceid,@sentence
while @@fetch_status = 0
begin/*first while loop begins*/
set @count=0;
set @sentencelength=len(@sentence)
set @temp_str=''
while @count <= @sentencelength
begin/*2nd while loop begins*/
if (substring(@sentence,@count+1,1) = ' ' or @count=@sentencelength )
begin/* if starts*/
insert into words values(@sentenceid,@temp_str)
set @temp_str=''
end/* if ends*/
else
begin/*else starts*/
set @temp_str=@temp_str+substring(@sentence,@count+1,1)
end /*else ends*/
set @count=@count + 1;
end/*2nd while loop ends*/
fetch next from c1_parse into @sentenceid,@sentence
end/*end the first while loop*/
close c1_parse
deallocate c1_parse
end/*end proc*/
Rajiv.
October 17, 2004 at 11:46 am
Hi Rajiv
Thanks for your code. I think that you misunderstand me!
The problem is not how to insert row. As you can see I already created function fnWord that extract words from sentence and return table with results. I like this part of solution because it is very simple and stright forward.
I do not like another part of solution that insert rows into final table #words by calling function for each sentence (each row of input table). I used cursor. You didn't tell nothing new in your example because you use CURSOR too.
I'll be more clear with my question:
Can you apply function on data set?? (using INSERT/SELECT structure instead of CURSOR)?
If you can do it you will make my code more efficient!!
Regards.
Milovan
October 17, 2004 at 12:36 pm
No, what you want to do can't be done with SQL2000. You would need to join your 'inpt table' to the output from the function.
Apparently the next release of SQL Server will support this however to we will just have to wait...
October 17, 2004 at 7:06 pm
Hi Douglas,
Thanks for your stright answer. It looks like I tried to create perpetual motion. Anyhow thanks for giving me peace of mind!
Regards,
Milovan
January 16, 2006 at 5:33 pm
Hi folks,
looking for a UDF that will parse a string, and address field in particular.
say you have '123 any street'. I need each component separated.
Any feedback or thoughts will be appreciated
January 16, 2006 at 8:13 pm
Milovan,
I noticed that all of the other solutions, although well written and functional, contain WHILE loops. Inherently, WHILE loops are slower than setbased code especially if you have a lot of records and a lot of words in each "sentence". So, I propose an alternate.
You need to make a Tally or Numbers table which consists of nothing but a column of numbers and it has many, many uses. Before I show you how to use it to solve your problem in a very high speed fashion, let me show you how to make a Tally table. You could, of course, make it as a temp table but it will serve you much better as a permanent table...
--===== Create and populate the Tally table
-- The cross-join is intentional
SELECT TOP 9999 IDENTITY(SMALLINT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SYSCOLUMNS sc1,
Master.dbo.SYSCOLUMNS sc2
--===== Add a primary key to the Tally table for speed
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N PRIMARY KEY CLUSTERED (N)
...Ok... that wasn't so painful... heck, I didn't even use a loop to make the Tally table. It creates a table that contains a column of numbers from 1 to 9999. I use 9999 because I also use the Tally table as a very high speed 4 digit random number generator. For example, the following code generates, say, 100 4 digit random numbers in a heartbeat...
SELECT TOP 100 N
FROM dbo.Tally
WHERE N >= 1000
ORDER BY NEWID()
You can also use it to generate a set of dates... for example, if you wanted a list of the next 12 Fridays, you could do this...
SELECT TOP 12 CONVERT(VARCHAR(10),GETDATE()+N,101)
FROM dbo.Tally
WHERE DATENAME(dw,GETDATE()+N) = 'Friday'
ORDER BY GETDATE()+N
...again, notice that there are no loops in this and my fingers never left my hand.
Ok, now for your problem... I included some test code to make a (temp) table to hold your sample data and parse it as you requested... as someone suggested, you could easily turn it into a table UDF and use it in a FROM clause, but this will probably suffice...
--===== If it exists, drop the test table
IF OBJECT_ID('TempDB..#YourTable') IS NOT NULL
DROP TABLE #YourTable
--===== Create the a test table to contain the original posted data
-- plus a couple of more
CREATE TABLE #YourTable
(
SentenceID INT NOT NULL,
Sentence VARCHAR(8000)
)
INSERT INTO #YourTable (SentenceID,Sentence)
SELECT 000001,'How are you today?' UNION ALL
SELECT 000002,'What is your name?' UNION ALL
SELECT 000003,'Now is the time for all good men to come to the aid of their country.'
--===== Add a primary key to the test table just because it's the
-- right thing to do.
ALTER TABLE #YourTable
ADD PRIMARY KEY CLUSTERED (SentenceID)
--===== Now, split the "Sentence" column at the " " and display 1 per
-- row along with it's SentenceID number...
SELECT SentenceID,
SUBSTRING(' '+y.Sentence+' ',N+1,CHARINDEX(' ',' '+y.Sentence+' ',N+1)-N-1) AS ParsedData
FROM #YourTable y,
dbo.Tally t
WHERE t.N < LEN(' '+y.Sentence+' ')
AND SUBSTRING(' '+y.Sentence+' ',N,1) =' '
ORDER BY SentenceID
You may run the above with impunity and repeatedly for testing purposes. Here's the output as the code currently stands...
SentenceID ParsedData
----------- ------------
1 How
1 are
1 you
1 today?
2 What
2 is
2 your
2 name?
3 Now
3 is
3 the
3 time
3 for
3 all
3 good
3 men
3 to
3 come
3 to
3 the
3 aid
3 of
3 their
3 country.
Compared to the use of WHILE loops, Cursors, and other RBAR (my pet name for "Row By Agonzing Row"), the Tally table solution runs lightning quick.
The Tally table can also be used for removing unwanted characters, converting things to proper case, doing all sorts of date tricks, and a lot more... all without the use of relatively slow RBAR methods.
Any questions as to why you might need a Tally table?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 10, 2006 at 1:12 pm
I'm starting to get the hang of using these tally tables, but I'm a little foggy on the sequence of events (my experience up 'till now is in purely procedural languages). Could you, for instance, create a column that indicated the word number in addition to the sentance number, so that you would end up with:
SentenceID word position ParsedData
----------- ------------- ------------
1 1 How
1 2 are
1 3 you
1 4 today?
2 1 What
2 2 is
2 3 your
2 4 name?
3 1 Now
3 2 is
3 3 the
3 4 time
3 5 for
3 6 all
3 7 good
3 8 men
3 9 to
3 10 come
3 11 to
3 12 the
3 13 aid
3 14 of
3 15 their
3 16 country.
I've been trying to do this for several days now, and just can't figure it out. Everything I try comes up with either nonsense or the position of the first letter of each word, or... a bunch of other things, but never what I need. My first thought was to use a separate tally table, but soon got confused by how the selects would be set up. Do you have any suggestions?
Carl E. Campbell
nyprehabmed.org
March 10, 2006 at 2:10 pm
For some things, you don't need the tally table
SELECT SentenceID,
datalength(Left(' '+y.Sentence+' ', N)) - datalength(replace(Left(' '+y.Sentence+' ', N), ' ', ''))InSentenceNo,
SUBSTRING(' '+y.Sentence+' ',N+1,CHARINDEX(' ',' '+y.Sentence+' ',N+1)-N-1) AS ParsedData
FROM #YourTable y,
dbo.Tally t
WHERE t.N < LEN(' '+y.Sentence+' ')
AND SUBSTRING(' '+y.Sentence+' ',N,1) =' '
ORDER BY SentenceID
Cheers,
* Noel
March 10, 2006 at 9:06 pm
FROM #YourTable y,
dbo.Tally t
Ok... now I'm confused That is a Tally table in your code, isn't it?
Althought the Jury is still out on the subject and for a different reason, I've started some mega row testing on Tally table functions and it turns out that a WHILE loop may actually perform better than a Tally table on function like these ... I have more testing to do before I publish the results...
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 29 total)
You must be logged in to reply to this topic. Login to reply