January 30, 2012 at 12:45 pm
So we're looking at implementing FTS for a few term type lookup things in one of ours apps.
Of course I remembered that the Thesaurus that comes with SQL Server is essentially empty.
Is there a Thesaurus that can be downloaded for US English so I won't have to sit there with Roget's doing data entry for a few weeks?
January 30, 2012 at 1:15 pm
Google this: Thesaurus Database Download and it will give you this: http://wordnet.princeton.edu/
Jared
CE - Microsoft
January 30, 2012 at 1:30 pm
And when you add SQL Server Full Text onto that search you get an entirely useless set of links π
Thanks... I take it the file is in SQL Server ready to use format?
January 30, 2012 at 1:35 pm
I have no idea, I just googled it and found it. I guess you could have done that for yourself :hehe: Sorry I'm not much help.
Jared
CE - Microsoft
January 30, 2012 at 1:51 pm
here you go Mark;
dunno if this will help, i took a thesaurus file from the Gutenberg project, and bulk inserted it into a table, then built the table of synonyms.
it's probably a bit too much: 30K words with 2.5 million synonyms.but it was one of those proof of concept items i slap together now and then:
the file is here:
--Edit: replaced 24 meg txt with zip:
you'll have to change the path on the bulk insert to match your working folder, but the rest is paste and GO:
IF OBJECT_ID('[dbo].[Thesaurus]') IS NOT NULL
DROP TABLE [dbo].[Thesaurus]
GO
CREATE FUNCTION dbo.DelimitedSplit8K
--===== Define I/O parameters
(
@pString VARCHAR(7999),
@pDelimiter CHAR(1)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
--===== "Inline" CTE Driven "Tally Tableβ produces values up to
-- 10,000... enough to cover VARCHAR(8000)
WITH
E1(N) AS ( --=== Create Ten 1's
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 --10
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT N)) FROM E4)
--===== Do the split
SELECT ROW_NUMBER() OVER (ORDER BY N) AS ItemNumber,
SUBSTRING(@pString, N, CHARINDEX(@pDelimiter, @pString + @pDelimiter, N) - N) AS Item
FROM cteTally
WHERE N < LEN(@pString) + 2
AND SUBSTRING(@pDelimiter + @pString, N, 1) = @pDelimiter
;
GO
GO
create table [Thesaurus](
definition varchar(max) )
BULK INSERT [Thesaurus] FROM 'C:\_PA\Workspace\Workspace_Dictionary\Thesaurus\mthesaur.txt'
WITH (
DATAFILETYPE = 'char',
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
FIRSTROW = 1
)
--pulls out the keyword for the thesaurus
select LEFT([definition],CHARINDEX(',',[definition])-1)AsTheKeyWord,[definition] As TheDefinition from [Thesaurus]
--gets the split for a given string
select * from dbo.DelimitedSplit8K('A-bomb,H-bomb,atomic bomb,atomic warhead,clean bomb',',')
--this is erroring out, not getting what i was expecting
select top 500 myAlias.* ,[Thesaurus].*
from (select row_number() over( order by definition) AS RW,definition from [Thesaurus])[Thesaurus]
cross apply dbo.DelimitedSplit8K(Thesaurus.definition,',') myAlias
where RW = 1
--where idx=1
--where myAlias.item = LEFT([definition],CHARINDEX(',',[definition])-1)
CREATE TABLE TheThesaurus(
ThesaurusID int identity(1,1) NOT NULL PRIMARY KEY,
Word varchar(100),
TheDefinition varchar(max) )
CREATE TABLE TheSynonyms(
SynonymsID int identity(1,1) NOT NULL PRIMARY KEY,
ThesaurusID int REFERENCES TheThesaurus(ThesaurusID),
Word varchar(100),
TheDefinition varchar(max) )
insert into TheThesaurus(Word,TheDefinition)
select LEFT([definition],CHARINDEX(',',[definition])-1)As TheKeyWord,[definition] As TheDefinition from [Thesaurus]
INSERT INTO TheSynonyms(ThesaurusID,Word,TheDefinition)
select ThesaurusID,Word,Item
from TheThesaurus
cross apply dbo.DelimitedSplit8K(TheDefinition,',') myAlias
where Word <> Item
and rtrim(Item) <> ''
select top 50000 * from TheSynonyms
Lowell
January 30, 2012 at 1:56 pm
Lowell, with that size thesaurus, did you see a significant performance hit on FTS queries? Or did you test it that far?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 30, 2012 at 2:24 pm
didn't go that far as to play with Full Text.;
it was purely from a Text-Dictionary to SQL Server Table project I fiddled with a few years ago; Full text wasn't part of the project at the time,w hich is what makes me wonder if it will help Mark, but it was omething I at least had handy in my scripts.
Lowell
January 30, 2012 at 2:27 pm
Thanks.
I'm about to implement a fairly large first-names thesaurus, and was curious if your experience covered performance hits. I'll be testing it, of course, but data from others with prior experience is always good.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 30, 2012 at 2:40 pm
It's pretty cool.. but it only seems to load one row from the file.
February 7, 2012 at 3:42 am
I think thats due to the ROWTERMINATOR parameter not being supplied to the BULK INSERT procedure. (probably because WYSIWYG not displaying the character \ n)
Try something like:
BULK INSERT [Thesaurus] FROM 'C:\mthesaur.txt'
WITH (
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\ n'
)
You'll need to remove the space between the '\' and the 'n' in the ROWTERMINATOR.
Good post guys, thanks for lending a hand!
-----------------
... Then again, I could be totally wrong! Check the answer.
Check out posting guidelines here for faster more precise answers[/url].
I believe in Codd
... and Thinknook is my Chamber of Understanding
February 7, 2012 at 9:03 am
Actually it was the commented section in the query that says "this errors out and isn't what I was expecting"
I originally just changed the path to the file and ran it... and it errored out.
Running it step by step worked... Now to see if TPTB really want to impliment a full english language thesaurus, like they originally stated... probably not π
Which of course moves back to figuring out terms and determining synonymns within the business... fun fun... getting management's views regarding this sort of stuff is generally like pulling teeth.
August 17, 2015 at 4:55 pm
Hi,
I tried to download the mthesaurs at the link provided, but got error file not found or unavailable. Is it possible you could provide a link to me?
Thank you so much!!
Lyle
August 18, 2015 at 5:35 am
lyleholc (8/17/2015)
Hi,I tried to download the mthesaurs at the link provided, but got error file not found or unavailable. Is it possible you could provide a link to me?
Thank you so much!!
Lyle
I updated my old post with a corrected path, but repalced the old txt file with a zip, and also retested my example code, which still works perfectly, although a little slow for my taste.
for a one time ETL, it's fine.
http://www.stormrage.com/SQLStuff/mthesaur.zip
;
Lowell
May 4, 2019 at 6:38 am
The file seems to come from a 1911 Roget's Thesaurus compiled by the Project Gutenberg team. Here's one place I found the unzipped version of that file.
https://github.com/statico/dotfiles/blob/master/.vim/mthes10
It is not in tsenu.xml format, but it seems straightforward to convert it. The key gotcha will be unicode byte ordering as mentioned in https://docs.microsoft.com/en-us/sql/relational-databases/search/configure-and-manage-thesaurus-files-for-full-text-search?view=sql-server-2017
March 29, 2021 at 9:38 am
Hi,
I tried to download the thesaurus, but the latest link http://www.stormrage.com/SQLStuff/mthesaur.zip leads to a 404 error.
Is it possible to update it or send the thesaurus to me privately via email?
Thank you!
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply