Count the occurrence of word in a Resume Table

  • I have a table called Resume with columns in Text, Word and PDF format. I enabled Full Text Search on the table and able to search the keyword through three columns and also able to rank the profile according to the keyword search .. Used below query snippet for the same:

    SELECT ct., ct.[RANK], br.ProfileID ,br.TextVersion, k.KeyWordName

    FROM CONTAINSTABLE( dbo.tblResumeTag, TextVersion , @SearchPhrase) as ct

    INNER JOIN dbo.tblResumeTag

    ON ct. = br.ID

    INNER JOIN dbo.tblKeywords k

    ON br.TextVersion LIKE '%' + k.KeyWordName + '%'

    AND k.KeyWordName LIKE '%' + @SearchPhrase + '%'

    ORDER BY ct.[RANK] DESC

    Now I want show the occurence(Count ) of the @search phrase in Textversion col, which I am struggling to return the exact count. Any suggestions will be helpful.

    Thank you

  • In SQL 2016 you can use the STRING_SPLIT() function to convert your long string value to a series of individual words and then use standard COUNT(*) GROUP BY to get your totals. In the code below, a couple of REPLACE functions were needed so that commas and periods would not be included as part of a word.

    d eclare @treatise varchar(max) = ''

    d eclare @x int = 0

    while @x < 10000

    begin

    set @x = @x + 1

    set @treatise = @treatise + 'Now is the time for all good men to come to the aid of their country. It is of the greatest importance. Man up, men, for fate awaits your country. '

    end

    ;with cte as (select replace(replace(value,'.',''),',','') as Word from string_split(@treatise,' ') )

    select word, count(*)

    from cte

    where word > ''

    group by word

    order by count(*) desc

    Sorry about the D ECLARE, but I couldn't submit the code without making the change. In your code, you should substitute Word > '' in the WHERE clause with Word LIKE your keyword search strings.

    Looking at your example, it seems that you want to enter the word BOOK, and if the words BOOKEND and BOOKWORM are in your keyword table, you want to find them, but not words like BOOKING. Is this correct?

    Finally I have to ask how big your tables are going to get. I think you are going to take a big performance hit from your joins if you are running this against a large number of rows. If you are going to have hundreds of texts needing to be evaluated, you should first filter them down to WHERE br.TextVersion like @Searchphrase. (If your search phrase isn't in the text, none of the keywords you want will be in the text.) At the very least try producing the list of keywords you want separately with a cte or derived table, then join ON br.TextVersion LIKE '%' + k.KeyWordName + '%'. JOIN clauses should be kept simple as possible with filtering pushed out to the WHERE clause.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Thank you so much !! Appreciate your help. Yeah got it right I want to find words with '%BOOK% ' . The table does have approx 20,000 rows.

  • I am using SQL server 2014 .. Apologies for posting in wrong section.

    I don't understand the use of 'value' ... what does that stands for ?

    with cte as (select replace(replace(value,'.',''),',','') as Word from string_split(@treatise,' ') )

    Thank you

  • shambhavi13 (8/3/2016)


    I have a table called Resume with columns in Text, Word and PDF format. I enabled Full Text Search on the table and able to search the keyword through three columns and also able to rank the profile according to the keyword search .. Used below query snippet for the same:

    SELECT ct., ct.[RANK], br.ProfileID ,br.TextVersion, k.KeyWordName

    FROM CONTAINSTABLE( dbo.tblResumeTag, TextVersion , @SearchPhrase) as ct

    INNER JOIN dbo.tblResumeTag

    ON ct. = br.ID

    INNER JOIN dbo.tblKeywords k

    ON br.TextVersion LIKE '%' + k.KeyWordName + '%'

    AND k.KeyWordName LIKE '%' + @SearchPhrase + '%'

    ORDER BY ct.[RANK] DESC

    Now I want show the occurence(Count ) of the @search phrase in Textversion col, which I am struggling to return the exact count. Any suggestions will be helpful.

    Thank you

    As a bit of a sidebar, I think that this is a really poor way to evaluate resumes. If someone were looking for "SSIS" as a key word, I could type "I hate SSIS" in obfuscated text 100 times and this method would pick me as a leader for "SSIS" skills just based on a word count.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • shambhavi13 (8/3/2016)


    I have a table called Resume with columns in Text, Word and PDF format. I enabled Full Text Search on the table and able to search the keyword through three columns and also able to rank the profile according to the keyword search .. Used below query snippet for the same:

    SELECT ct., ct.[RANK], br.ProfileID ,br.TextVersion, k.KeyWordName

    FROM CONTAINSTABLE( dbo.tblResumeTag, TextVersion , @SearchPhrase) as ct

    INNER JOIN dbo.tblResumeTag

    ON ct. = br.ID

    INNER JOIN dbo.tblKeywords k

    ON br.TextVersion LIKE '%' + k.KeyWordName + '%'

    AND k.KeyWordName LIKE '%' + @SearchPhrase + '%'

    ORDER BY ct.[RANK] DESC

    Now I want show the occurence(Count ) of the @search phrase in Textversion col, which I am struggling to return the exact count. Any suggestions will be helpful.

    Thank you

    See this article: Nasty Fast N-Grams (Part 1)[/url]. Specifically the section titled, "basic text analytics on resumes stored in a table".

    This is exactly the kind of thing that a good NGrams function is for.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thank you . I will look into the same.

  • shambhavi13 (8/3/2016)


    I am using SQL server 2014 .. Apologies for posting in wrong section.

    I don't understand the use of 'value' ... what does that stands for ?

    with cte as (select replace(replace(value,'.',''),',','') as Word from string_split(@treatise,' ') )

    Thank you

    Value is simply the name of the column returned by the STRING_SPLIT() function.

    If you aren't on 2016 yet, you should go HERE[/url] to get a copy of Jeff Moden's excellent delimited splitter function. ( By the way, Jeff. Would you trust any paper resume' nowadays? )

    Also, with respect to the N-grams. All you are asking for is unigrams (in your example, individual words), and the rest of it is probably beyond the scope of your current problem.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Also, with respect to the N-grams. All you are asking for is unigrams (in your example, individual words), and the rest of it is probably beyond the scope of your current problem.

    In this case it would be <size of the search word>-grams...

    -- Resume Table

    DECLARE @tblResumeTag TABLE ( int identity primary key, TextVersion varchar(8000));

    INSERT @tblResumeTag(TextVersion) VALUES

    ('Joe Doe - Objective: To use my many years of SQL experience... blah blah SQL...

    blah SQL this & SQL that... SSIS.. SSIS.. SSIS..'),

    ('John Green, PMP, MSM - SQL Guru... yada yada SQL... blah blah blah... SSRS... SSAS'),

    ('Sue Jones Objective: To utilize my awesome skills and experience... yada SQL...

    blah SQL this & SQL that... more experience with SQL... SSIS... SSRS');

    -- Keywords table

    DECLARE @keywords TABLE (KeyWordName varchar(100));

    INSERT @keywords VALUES('SQL'),('SSIS'),('SSRS');

    -- Getting all instances of the KeyWordName

    SELECT , TextVersion, KeyWordName

    FROM @tblResumeTag c

    CROSS JOIN @keywords k

    CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    ORDER BY ; -- order by for display only, not required

    -- To group by each keyword found

    SELECT , KeyWordName, KWSubTotal = COUNT(*)

    FROM @tblResumeTag c

    CROSS JOIN @keywords k

    CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    GROUP BY , KeyWordName;

    -- To group by number of all keywords found

    SELECT , KWSubTotal = COUNT(*)

    FROM @tblResumeTag c

    CROSS JOIN @keywords k

    CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    GROUP BY ;

    And doing it like this might outperform a splitter-based approach.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • I've read the Nasty Fast N-Grams article and the approach is very similar to Jeff's DelimitedSplit8K() function. The primary difference being that the N-grams would be fixed length strings starting at every Nth character of the input string while the words produced by a delimited split would be of varying length, using space(1) as a delimiter.

    By using fixed length strings the N-gram approach avoids having to use charindex to find the end of the word. That means each fixed-length N-gram requires slightly fewer CPU cycles to identify than would a word. But a much larger number of N-gram rows are being produced, as a opposed to word rows. These will have to be filtered which would add some cycles back. I'm sure the size of the input string(s) is important to the relative performance, but I haven't tested both techniques.

    How would the N-gram handle a situation where the initial search identifies BOOKWORM and BOOKEND as two keywords to match? They have different lengths. Would two passes be necessary? I'm genuinely curious because, although the benefits of bigrams, trigrams, etc. are clear, the benefits of unigrams over words in this instance is muddy (to me).

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • ALan....consider the following please

    DECLARE @tblResumeTag TABLE ( int identity primary key, TextVersion varchar(8000));

    INSERT @tblResumeTag(TextVersion) VALUES

    ('I accelerated the development of the system by making rationale decisions based on my collaboration with senior corporate managers')

    DECLARE @keywords TABLE (KeyWordName varchar(100));

    INSERT @keywords VALUES('Rat')

    SELECT , TextVersion, KeyWordName

    FROM @tblResumeTag c

    CROSS JOIN @keywords k

    CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    is there a way to use your ngrams for whole words?

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • If you were looking for 'Rat' (the word) you could look for Ngrams like '% RAT %'. You'd still have to use REPLACE to handle punctuation marks. The problem I see is that multiple N-grams would be produced.

    For example if we were doing unigrams of length 7.

    'He is a rat Mugsy.' becomes

    'He is a'

    'e is a '

    ' s a ra'

    's a rat'

    ' a rat '

    'a rat M'

    ' rat MU'

    etc

    The last three of these would satisfy LIKE '% RAT %'.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Thank you all for your help. But as I was working on the queries I realized I have three columns in my table namely Text, Word and PDF Resumes .. I want to display the candidate rank as an avg of all the three columns . Any suggestions will be helpful. And also currently I am using if else statement to choose the column is there any better way so that I can search all three columns at once ?

  • This kind of begs the question... If I had a resume like

    SQL Developer 2<24 months>

    Company ABC

    SSMS, SSRS, SSAS

    SQL Developer 1<36 months>

    Company XYZ

    SMS, SSRS

    how would you count my experience? doing word counts seems kind of a simplistic metric. would it not be better to do {skill, time period} at the very least?

  • You are absolutely right... but the idea here is just to tag the candidate with skills like JAVA(6), C++(5) .. where 6, 5 is the number of keyword occurrence.

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply