/* download the test text file Moby-Dyck.zip here */
/* You might notice that I refer to Moby **** in this blog. This is because the busybodyish nanny software cannot allow me to refer to my test data, the 1851 novel by Herman Melville. This reminds me of when the Royal Society for the Preservation of Birds used community server for their website. They had to refer to 'Male Bird' rather than a Cock. It is like a scene from a novel by Philip K Dick.*/
DECLARE @LotsOfText VARCHAR(MAX)
SELECT @LotsOfText='
It surprises me how often I need to chop up text into its individual words, paragraphs or lines. I suppose I do a lot of parsing of difficult input to get it into a database. I also like to show off by doing really good searches of data, which requires an inversion table.
Everyone must know how to chop up delimited strings into individual fields, but chopping text into its individual words requires a rather different technique. There are no consistent delimiters, just whitespace. Actually, you can think of slicing text into words as an extreme form of parsing delimited strings.
Here is a routine that has the advantage that it does not need to be encapsulated into a function, although it doesn''t mind being used in a function. It is also very rapid.
The derived table (g) is simply the index into the string of all the places where there is a transition from whitespace to word-characters; in other words, the start of each word. After that, all you have to do is find the end of the word and you are home and dry!
Characters such as '' or - are allowed within words such as double-decker or shan''t, but are treated as whitespace if outside a word. We start with what I guess is a pretty simple routine using a ''helper'' table called ''numbers'' which you''ll just have to set up using one of Anith''s methods described in his ''Faking Arrays in Transact SQL''.http://www.simple-talk.com/sql/t-sql-programming/faking-arrays-in-transact-sql/ '
SELECT SUBSTRING(@LotsOfText,start,
PATINDEX('%[^A-Z''a-z0-9-]%',SUBSTRING (@LotsOfText,start,50)+' ')-1)
FROM
(
SELECT [start]=number
FROM numbers
WHERE number< LEN(@LotsOfText)+1
AND ( PATINDEX('[^A-Za-z0-9-''][A-Za-z0-9]',SUBSTRING(' '+@LotsOfText,number,2))=1
OR PATINDEX('[^A-Za-z0-9-"''][''"][A-Za-z0-9]',SUBSTRING(' '+@LotsOfText,number,3))=1)
)g
ORDER BY start
/* Of course, you'd usually use this sort of routine as a function. */
GO
/*------------------------------------------------------------*/
IF OBJECT_ID(N'WordChop') IS NOT NULL DROP FUNCTION WordChop
GO
CREATE FUNCTION [dbo].[WordChop]
(
@string VARCHAR(MAX)
)
RETURNS
@Results TABLE
(
Item VARCHAR(255)
)
AS
BEGIN
INSERT INTO @results(item)
SELECT SUBSTRING(@string,start,
PATINDEX('%[^A-Z''a-z0-9-]%',SUBSTRING (@string,start,50)+' ')-1)
FROM
(
SELECT [start]=number
FROM numbers
WHERE number< LEN(@string)+1
AND (PATINDEX('[^A-Za-z0-9''-][A-Za-z0-9]',SUBSTRING(' '+@string,number,2))=1
OR PATINDEX('[^A-Za-z0-9-"''][''"][A-Za-z0-9]',SUBSTRING(' '+@string,number,3))=1)
)g
ORDER BY start
RETURN
END
GO
/* ... but is it really faster than a simple iterative solution? */
/*------------------------------------------------------------*/
IF OBJECT_ID(N'IterativeWordChop') IS NOT NULL DROP FUNCTION IterativeWordChop
GO
CREATE FUNCTION [dbo].[IterativeWordChop]
(
@string VARCHAR(MAX)
)
RETURNS
@Results TABLE
(
Item VARCHAR(255)
)
AS
BEGIN
DECLARE @Len INT, @Start INT, @end INT, @Cursor INT,@length INT
SELECT @Cursor=1, @len=LEN(@string)
WHILE @cursor<@len
BEGIN
SELECT @start=PATINDEX('%[^A-Za-z0-9][A-Za-z0-9]%',
' '+SUBSTRING (@string,@cursor,50)
)-1
SELECT @length=PATINDEX('%[^A-Z''a-z0-9-]%',SUBSTRING (@string,@cursor+@start+1,50)+' ')
INSERT INTO @results(item)
SELECT SUBSTRING(@string,@cursor+@start,@length)
SELECT @Cursor=@Cursor+@Start+@length+1
END
RETURN
END
GO
/* now we have two very similar functions, one of which is iterative, and the other uses a number 'helper' table. Which is the most efficient? Which is the faster? It would seem obvious, from the RBAR school of thought. The number table is always going to win. We can easily put this to the test.*/
/* We can easily put this to the test. Let's put the lovely first chapter of Moby Dick in*/
DECLARE @LotsOfText VARCHAR(MAX)
SELECT @LotsOfText = BulkColumn
-- 2005/8 only for this code)
FROM OPENROWSET(BULK 'C:\workbench\moby-dick-C1.txt', SINGLE_BLOB) AS x
SET STATISTICS time ON
SELECT 'Do it the RBAR way'
SELECT COUNT(*) FROM Iterativewordchop(@LotsOfText)-- (2220 words)
/* SQL Server Execution Times:
CPU time = 468 ms, elapsed time = 791 ms.*/
SELECT 'and now the number-table way'
SELECT COUNT(*) FROM wordchop(@LotsOfText)----------- (2219 words)
SET STATISTICS time OFF
/* SQL Server Execution Times:
CPU time = 94 ms, elapsed time = 112 ms.*/
GO
/* Let's put the entire text of Moby Dick in. Prepare for a surprise. (remember 2005/8 only for this code)*/
DECLARE @LotsOfText VARCHAR(MAX)
SELECT @LotsOfText = BulkColumn
FROM OPENROWSET(BULK 'C:\workbench\moby-dick.txt', SINGLE_BLOB) AS x
SELECT COUNT(*) FROM Iterativewordchop(@LotsOfText)--26 seconds (214730 words)
SELECT COUNT(*) FROM wordchop(@LotsOfText)-----------48 seconds (214243 words)
/*
word thinks it is 216381 words it took 6 seconds!
Yes, one algorithm is more linear than the other. RBAR wins in this case.*/
/*
Conclusion?
Believe nobody. Test everything. Try your algorithm under the sort of loads you are going to expect in your application, and anyone who says that WHILE loops or cursors are always less efficient is simplifying the grey truth. However, for your piddly databases you should never bet on RBAR. */