September 17, 2010 at 5:32 am
September 17, 2010 at 7:01 am
Richard it's fairly straightforward, but involves a combination of techniques.
if it's a file, you need to use bcp or bulk insert to get the text.
you need to use a split function to parse the file into words.
then it's simply a group by to review the results.
here's a working example i just put together, which assumes you are bulk inserting a file:
CREATE FUNCTION dbo.DelimitedSplit8K
/*
Revision History:
Rev 00 - 20 Jan 2010 - Concept: Lynn Pettis and others.
Redaction/Implementation: Jeff Moden
- Base 10 redaction and reduction for CTE. (Total rewrite)
Rev 01 - 13 Mar 2010 - Jeff Moden
- Removed one additional concatenation and one subtraction from the SUBSTRING in the
SELECT List for that tiny bit of extra speed.
Rev 02 - 14 Apr 2010 - Jeff Moden
- No code changes. Added CROSS APPLY usage example to the header, some additional credits,
and extra documentation.
Rev 03 - 18 Apr 2010 - Jeff Moden
- No code changes. Added notes 7, 8, and 9 about certain "optimizations" that dont'
actually work for this type of function.
Rev 04 - 29 Jun 2010 - Jeff Moden
- Added WITH SCHEMABINDING thanks to a note by Paul White. This prevents an unnecessary
"Table Spool" when the function is used in an UPDATE statement even though the function
makes no external references.
***************************************************************************************************/
--===== Define I/O parameters
(
@pString VARCHAR(7999),
@pDelimiter CHAR(1)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
--===== "Inline" CTE Driven "Tally Table” produces values up to
-- 10,000... enough to cover VARCHAR(8000)
WITH
E1(N) AS ( --=== Create Ten 1's
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 --10
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT N)) FROM E4)
--===== Do the split
SELECT ROW_NUMBER() OVER (ORDER BY N) AS ItemNumber,
SUBSTRING(@pString, N, CHARINDEX(@pDelimiter, @pString + @pDelimiter, N) - N) AS Item
FROM cteTally
WHERE N < LEN(@pString) + 2
AND SUBSTRING(@pDelimiter + @pString, N, 1) = @pDelimiter
;
GO
--assuming the file is on the C:\ drive and the SQL account has access to it.
CREATE TABLE BULKWORDS(RAWDATA VARCHAR (8000))
BULK INSERT BULKWORDS FROM 'c:\sp_find.sql'
WITH (
DATAFILETYPE = 'char',
--tilde assumed to not exist in the file, so we get only one column
FIELDTERMINATOR = '~',
ROWTERMINATOR = '\n',
FIRSTROW = 1
)
--preview the raw data?
select * from BULKWORDS
--create a destination table
CREATE TABLE WordCounter(TheWord varchar(255) )
--get the data
INSERT INTO WordCounter
select
TheFunction.Item
FROM BULKWORDS
cross apply dbo.DelimitedSplit8K(RAWDATA,' ') TheFunction
----finally
--look at the words
SELECT TheWord, COUNT(TheWord) as TheCount
FROM WordCounter
--ignore whitespace that was captured
WHERE LTRIM(TheWord) <> ''
GROUP BY TheWord
ORDER BY TheCount DESC
Lowell
September 17, 2010 at 7:56 am
September 18, 2010 at 4:32 pm
Nicely done, Lowell. 🙂
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply