Count the occurrence of word in a Resume Table

Question

Count the occurrence of word in a Resume Table

shambhavi13

SSC Veteran

Points: 286
More actions
August 3, 2016 at 11:10 am

#310213

I have a table called Resume with columns in Text, Word and PDF format. I enabled Full Text Search on the table and able to search the keyword through three columns and also able to rank the profile according to the keyword search .. Used below query snippet for the same:
SELECT ct., ct.[RANK], br.ProfileID ,br.TextVersion, k.KeyWordName
FROM CONTAINSTABLE( dbo.tblResumeTag, TextVersion , @SearchPhrase) as ct
INNER JOIN dbo.tblResumeTag
ON ct. = br.ID
INNER JOIN dbo.tblKeywords k
ON br.TextVersion LIKE '%' + k.KeyWordName + '%'
AND k.KeyWordName LIKE '%' + @SearchPhrase + '%'
ORDER BY ct.[RANK] DESC
Now I want show the occurence(Count ) of the @search phrase in Textversion col, which I am struggling to return the exact count. Any suggestions will be helpful.
Thank you

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 1

In SQL 2016 you can use the STRING_SPLIT() function to convert your long string value to a series of individual words and then use standard COUNT(*) GROUP BY to get your totals. In the code below, a couple of REPLACE functions were needed so that commas and periods would not be included as part of a word.

d eclare @treatise varchar(max) = ''

d eclare @x int = 0

while @x < 10000

begin

set @x = @x + 1

set @treatise = @treatise + 'Now is the time for all good men to come to the aid of their country. It is of the greatest importance. Man up, men, for fate awaits your country. '

end

;with cte as (select replace(replace(value,'.',''),',','') as Word from string_split(@treatise,' ') )

select word, count(*)

from cte

where word > ''

group by word

order by count(*) desc

Sorry about the D ECLARE, but I couldn't submit the code without making the change. In your code, you should substitute Word > '' in the WHERE clause with Word LIKE your keyword search strings.

Looking at your example, it seems that you want to enter the word BOOK, and if the words BOOKEND and BOOKWORM are in your keyword table, you want to find them, but not words like BOOKING. Is this correct?

Finally I have to ask how big your tables are going to get. I think you are going to take a big performance hit from your joins if you are running this against a large number of rows. If you are going to have hundreds of texts needing to be evaluated, you should first filter them down to WHERE br.TextVersion like @Searchphrase. (If your search phrase isn't in the text, none of the keywords you want will be in the text.) At the very least try producing the list of keywords you want separately with a cte or derived table, then join ON br.TextVersion LIKE '%' + k.KeyWordName + '%'. JOIN clauses should be kept simple as possible with filtering pushed out to the WHERE clause.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

shambhavi13 SSC Veteran Points: 286 More actions · Answer 2

Thank you so much !! Appreciate your help. Yeah got it right I want to find words with '%BOOK% ' . The table does have approx 20,000 rows.

shambhavi13 SSC Veteran Points: 286 More actions · Answer 3

I am using SQL server 2014 .. Apologies for posting in wrong section.

I don't understand the use of 'value' ... what does that stands for ?

with cte as (select replace(replace(value,'.',''),',','') as Word from string_split(@treatise,' ') )

Thank you

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 4

shambhavi13 (8/3/2016)
I have a table called Resume with columns in Text, Word and PDF format. I enabled Full Text Search on the table and able to search the keyword through three columns and also able to rank the profile according to the keyword search .. Used below query snippet for the same:
SELECT ct., ct.[RANK], br.ProfileID ,br.TextVersion, k.KeyWordName
FROM CONTAINSTABLE( dbo.tblResumeTag, TextVersion , @SearchPhrase) as ct
INNER JOIN dbo.tblResumeTag
ON ct. = br.ID
INNER JOIN dbo.tblKeywords k
ON br.TextVersion LIKE '%' + k.KeyWordName + '%'
AND k.KeyWordName LIKE '%' + @SearchPhrase + '%'
ORDER BY ct.[RANK] DESC
Now I want show the occurence(Count ) of the @search phrase in Textversion col, which I am struggling to return the exact count. Any suggestions will be helpful.
Thank you

As a bit of a sidebar, I think that this is a really poor way to evaluate resumes. If someone were looking for "SSIS" as a key word, I could type "I hate SSIS" in obfuscated text 100 times and this method would pick me as a leader for "SSIS" skills just based on a word count.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Alan Burstein SSC Guru Points: 61141 More actions · Answer 5

shambhavi13 (8/3/2016)
I have a table called Resume with columns in Text, Word and PDF format. I enabled Full Text Search on the table and able to search the keyword through three columns and also able to rank the profile according to the keyword search .. Used below query snippet for the same:
SELECT ct., ct.[RANK], br.ProfileID ,br.TextVersion, k.KeyWordName
FROM CONTAINSTABLE( dbo.tblResumeTag, TextVersion , @SearchPhrase) as ct
INNER JOIN dbo.tblResumeTag
ON ct. = br.ID
INNER JOIN dbo.tblKeywords k
ON br.TextVersion LIKE '%' + k.KeyWordName + '%'
AND k.KeyWordName LIKE '%' + @SearchPhrase + '%'
ORDER BY ct.[RANK] DESC
Now I want show the occurence(Count ) of the @search phrase in Textversion col, which I am struggling to return the exact count. Any suggestions will be helpful.
Thank you

See this article: Nasty Fast N-Grams (Part 1)[/url]. Specifically the section titled, "basic text analytics on resumes stored in a table".

This is exactly the kind of thing that a good NGrams function is for.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

shambhavi13 SSC Veteran Points: 286 More actions · Answer 6

shambhavi13

SSC Veteran

Points: 286

August 4, 2016 at 9:44 am

#1893852

Thank you . I will look into the same.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 7

shambhavi13 (8/3/2016)
I am using SQL server 2014 .. Apologies for posting in wrong section.
I don't understand the use of 'value' ... what does that stands for ?
with cte as (select replace(replace(value,'.',''),',','') as Word from string_split(@treatise,' ') )
Thank you

Value is simply the name of the column returned by the STRING_SPLIT() function.

If you aren't on 2016 yet, you should go HERE[/url] to get a copy of Jeff Moden's excellent delimited splitter function. ( By the way, Jeff. Would you trust any paper resume' nowadays? )

Also, with respect to the N-grams. All you are asking for is unigrams (in your example, individual words), and the rest of it is probably beyond the scope of your current problem.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Alan Burstein SSC Guru Points: 61141 More actions · Answer 8

Also, with respect to the N-grams. All you are asking for is unigrams (in your example, individual words), and the rest of it is probably beyond the scope of your current problem.

In this case it would be <size of the search word>-grams...

-- Resume Table

DECLARE @tblResumeTag TABLE ( int identity primary key, TextVersion varchar(8000));

INSERT @tblResumeTag(TextVersion) VALUES

('Joe Doe - Objective: To use my many years of SQL experience... blah blah SQL...

blah SQL this & SQL that... SSIS.. SSIS.. SSIS..'),

('John Green, PMP, MSM - SQL Guru... yada yada SQL... blah blah blah... SSRS... SSAS'),

('Sue Jones Objective: To utilize my awesome skills and experience... yada SQL...

blah SQL this & SQL that... more experience with SQL... SSIS... SSRS');

-- Keywords table

DECLARE @keywords TABLE (KeyWordName varchar(100));

INSERT @keywords VALUES('SQL'),('SSIS'),('SSRS');

-- Getting all instances of the KeyWordName

SELECT , TextVersion, KeyWordName

FROM @tblResumeTag c

CROSS JOIN @keywords k

CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName))

WHERE token = k.KeyWordName

ORDER BY ; -- order by for display only, not required

-- To group by each keyword found

SELECT , KeyWordName, KWSubTotal = COUNT(*)

FROM @tblResumeTag c

CROSS JOIN @keywords k

CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName))

WHERE token = k.KeyWordName

GROUP BY , KeyWordName;

-- To group by number of all keywords found

SELECT , KWSubTotal = COUNT(*)

FROM @tblResumeTag c

CROSS JOIN @keywords k

CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName))

WHERE token = k.KeyWordName

GROUP BY ;

And doing it like this might outperform a splitter-based approach.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 9

I've read the Nasty Fast N-Grams article and the approach is very similar to Jeff's DelimitedSplit8K() function. The primary difference being that the N-grams would be fixed length strings starting at every Nth character of the input string while the words produced by a delimited split would be of varying length, using space(1) as a delimiter.

By using fixed length strings the N-gram approach avoids having to use charindex to find the end of the word. That means each fixed-length N-gram requires slightly fewer CPU cycles to identify than would a word. But a much larger number of N-gram rows are being produced, as a opposed to word rows. These will have to be filtered which would add some cycles back. I'm sure the size of the input string(s) is important to the relative performance, but I haven't tested both techniques.

How would the N-gram handle a situation where the initial search identifies BOOKWORM and BOOKEND as two keywords to match? They have different lengths. Would two passes be necessary? I'm genuinely curious because, although the benefits of bigrams, trigrams, etc. are clear, the benefits of unigrams over words in this instance is muddy (to me).

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 10

ALan....consider the following please

DECLARE @tblResumeTag TABLE ( int identity primary key, TextVersion varchar(8000));

INSERT @tblResumeTag(TextVersion) VALUES

('I accelerated the development of the system by making rationale decisions based on my collaboration with senior corporate managers')

DECLARE @keywords TABLE (KeyWordName varchar(100));

INSERT @keywords VALUES('Rat')

SELECT , TextVersion, KeyWordName

FROM @tblResumeTag c

CROSS JOIN @keywords k

CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName))

WHERE token = k.KeyWordName

is there a way to use your ngrams for whole words?

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 11

If you were looking for 'Rat' (the word) you could look for Ngrams like '% RAT %'. You'd still have to use REPLACE to handle punctuation marks. The problem I see is that multiple N-grams would be produced.

For example if we were doing unigrams of length 7.

'He is a rat Mugsy.' becomes

'He is a'

'e is a '

' s a ra'

's a rat'

' a rat '

'a rat M'

' rat MU'

etc

The last three of these would satisfy LIKE '% RAT %'.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

shambhavi13 SSC Veteran Points: 286 More actions · Answer 12

Thank you all for your help. But as I was working on the queries I realized I have three columns in my table namely Text, Word and PDF Resumes .. I want to display the candidate rank as an avg of all the three columns . Any suggestions will be helpful. And also currently I am using if else statement to choose the column is there any better way so that I can search all three columns at once ?

pietlinden SSC Guru Points: 63506 More actions · Answer 13

This kind of begs the question... If I had a resume like

SQL Developer 2<24 months>

Company ABC

SSMS, SSRS, SSAS

SQL Developer 1<36 months>

Company XYZ

SMS, SSRS

how would you count my experience? doing word counts seems kind of a simplistic metric. would it not be better to do {skill, time period} at the very least?

shambhavi13 SSC Veteran Points: 286 More actions · Answer 14

You are absolutely right... but the idea here is just to tag the candidate with skills like JAVA(6), C++(5) .. where 6, 5 is the number of keyword occurrence.