Count the occurrence of word in a Resume Table

  • shambhavi13 (8/8/2016)


    You are absolutely right... but the idea here is just to tag the candidate with skills like JAVA(6), C++(5) .. where 6, 5 is the number of keyword occurrence.

    That's mostly what I'm objecting to. Everyone knows how these things are "tagged" (rated, really, since there's a quantity involved) and that's how fakers and posers "game" the system. They use as many key words as possible trying to get the highest count possible. That also makes them a form of "cheater" and, possibly, "liar". When such idiots actually get an interview with me, I normally prove that theory. About 80% of the moroffs claiming to be Senior DBAs with more than 10 years of experience and Senior Developers with more than 5 years of experience on their resumes along with all the cool-kid buzz words can't even tell me how to get the current date and time using T-SQL (and I'm dead serious there... no one could make something like that up).

    What recruiters and employers need to start doing is their homework instead of relying on buzz-word counts. It's a stupid way to sift out the candidates.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Dixie & J Livingston, sorry for the late reply - I had a rough weekend with a sick family (we're all sick).

    The Dixie Flatline (8/4/2016)


    I've read the Nasty Fast N-Grams article and the approach is very similar to Jeff's DelimitedSplit8K() function. The primary difference being that the N-grams would be fixed length strings starting at every Nth character of the input string while the words produced by a delimited split would be of varying length, using space(1) as a delimiter.

    By using fixed length strings the N-gram approach avoids having to use charindex to find the end of the word. That means each fixed-length N-gram requires slightly fewer CPU cycles to identify than would a word. But a much larger number of N-gram rows are being produced, as a opposed to word rows. These will have to be filtered which would add some cycles back. I'm sure the size of the input string(s) is important to the relative performance, but I haven't tested both techniques.

    First, thanks for reading the article! I have to agree with your analysis which was very insightful. I spent some time Friday and today doing some tests. I tested both approaches and the splitter approach generates many fewer rows but many more reads. I tried this a few different ways and, in the end, using delimitedSplit8K_LEAD (Eirikur Eiriksson's slightly faster version of DelimitedSplit8K referenced in my signature) yielded much better results.

    How would the N-gram handle a situation where the initial search identifies BOOKWORM and BOOKEND as two keywords to match? They have different lengths. Would two passes be necessary? I'm genuinely curious because, although the benefits of bigrams, trigrams, etc. are clear, the benefits of unigrams over words in this instance is muddy (to me).

    Yes, a pass for each Keyword, even if they are the same length. Let's take another look at the solution I put together and compare it to a solution that uses NGrams8K. First the sample data.

    /****************************************************************************************

    (1) Sample Data

    ****************************************************************************************/;

    -- Resume Table

    CREATE TABLE #tblResumeTag

    (

    int identity primary key,

    TextVersion varchar(8000) NOT NULL

    );

    -- Keywords table

    CREATE TABLE #keywords (kwid int identity , KeyWordName varchar(50) NOT NULL);

    INSERT #tblResumeTag(TextVersion)

    VALUES

    ('Joe Doe - Objective: To use my many years of SQL experience... blah blah SQL...

    blah SQL this & SQL that... SSIS.. SSIS.. SSIS..'),

    ('John Green, PMP, MSM - SQL Guru... yada yada SQL... blah blah blah... SSRS... SSAS'),

    ('Sue Jones Objective: To utilize my awesome skills and experience... yada SQL...

    blah SQL this & SQL that... more experience with SQL... SSIS... SSRS');

    INSERT #keywords VALUES('SQL'),('SSIS'),('SSRS'),('SSAS');

    Now a each solution (here I'm doing a grand total):

    SET STATISTICS IO ON;

    SET STATISTICS PROFILE ON;

    -- DelimitedSplit8K

    PRINT 'DelimitedSplit8K_LEAD:'

    SELECT

    r.,

    GrandTotal = COUNT(*)

    FROM #tblResumeTag r

    CROSS APPLY dbo.DelimitedSplit8K_LEAD(r.TextVersion,' ')

    CROSS JOIN #keywords k

    WHERE item LIKE '%'+KeyWordName+'%'

    GROUP BY r., r.TextVersion;

    -- NGrams8K

    PRINT 'NGrams8K:'

    SELECT

    r.,

    GrandTotal = COUNT(*)

    FROM #tblResumeTag r

    CROSS JOIN #keywords k

    CROSS APPLY dbo.NGrams8k(r.TextVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    GROUP BY , r.TextVersion;

    SET STATISTICS PROFILE OFF;

    SET STATISTICS IO OFF;

    I'm not going to post the STATISTICS PROFILE results but if you run it locally you'll see that the Splitter solution generates 364 rows behind the scenes while the NGrams solution generates 1,423 rows. Look at the I/O however:

    DelimitedSplit8K_LEAD:

    Table '#keywords'. Scan count 1, logical reads 65, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#tblResumeTag'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    NGrams8K:

    Table '#keywords'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#tblResumeTag'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    The splitter solution generates 68 reads vs 6 for the NGrams8K solution.

    Now for a 10K row performance test. Note that I test each solution with a serial execution plan using MAXDOP 1 and parallel execution plan using traceflag 8649.

    -- More sample data

    TRUNCATE TABLE #tblResumeTag;

    INSERT #tblResumeTag

    SELECT TOP (10000) x.t FROM

    (VALUES

    ('Joe Doe - Objective: To use my many years of SQL experience... blah blah SQL...

    blah SQL this & SQL that... SSIS.. SSIS.. SSIS..'),

    ('John Green, PMP, MSM - SQL Guru... yada yada SQL... blah blah blah... SSRS... SSAS'),

    ('Sue Jones Objective: To utilize my awesome skills and experience... yada SQL...

    blah SQL this & SQL that... more experience with SQL... SSIS... SSRS'),

    ('Billy bob- Objective: To use my many years of SQL experience... blah blah SQL...

    blah SQL this & SQL that... SSIS.. SSIS.. SSIS..'),

    ('Dr. Green, PMP, MSM - SQL Guru... yada yada SQL... blah blah blah... SSRS... SSAS'),

    ('Sally Smith Objective: To utilize my awesome skills and experience... yada SQL...

    blah SQL this & SQL that... more experience with SQL... SSIS... SSRS')) x(t),

    sys.all_columns;

    -- Splitter solution:

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    DECLARE @key int, @grandtotal int;

    PRINT REPLICATE('-----',10)+CHAR(13)+CHAR(10)+'DelimitedSplit8K_LEAD parallel';

    SELECT @key = , @grandtotal = COUNT(*)

    FROM #tblResumeTag r

    CROSS APPLY dbo.DelimitedSplit8K_LEAD(r.TextVersion,' ')

    CROSS JOIN #keywords k

    WHERE item LIKE '%'+KeyWordName+'%'

    GROUP BY r.

    OPTION (RECOMPILE, QUERYTRACEON 8649);

    PRINT REPLICATE('-----',10)+CHAR(13)+CHAR(10)+'DelimitedSplit8K_LEAD Serial';

    SELECT @key = , @grandtotal = COUNT(*)

    FROM #tblResumeTag r

    CROSS APPLY dbo.DelimitedSplit8K_LEAD(r.TextVersion,' ')

    CROSS JOIN #keywords k

    WHERE item LIKE '%'+KeyWordName+'%'

    GROUP BY r.

    OPTION (MAXDOP 1);

    -- NGrams solution

    PRINT REPLICATE('-----',10)+CHAR(13)+CHAR(10)+'NGrams8K parallel'+CHAR(13)+CHAR(10);

    SELECT @key = , @grandtotal = COUNT(*)

    FROM #tblResumeTag c

    CROSS JOIN #keywords k

    CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    GROUP BY

    OPTION (RECOMPILE, QUERYTRACEON 8649);

    -- NGrams solution

    PRINT REPLICATE('-----',10)+CHAR(13)+CHAR(10)+'NGrams8K serial'+CHAR(13)+CHAR(10);

    SELECT @key = , @grandtotal = COUNT(*)

    FROM #tblResumeTag c

    CROSS JOIN #keywords k

    CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    GROUP BY

    OPTION (MAXDOP 1);

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    GO

    Although the splitter solution generates ~twice as many reads , it is faster (I did not expect this).

    In the end, the fastest solution does not use a DelimitedSplit8K, DelimitedSplit8K_LEAD or NGrams8K. Some time this morning I had one of those "Oh duh!!!" moments and came up with this:

    SELECT

    r.,

    r.TextVersion,

    k.KeyWordName,

    SubTotal =

    (LEN(r.TextVersion) - LEN(REPLACE(r.TextVersion, k.KeyWordName,'')))/LEN(k.KeyWordName)

    FROM #tblResumeTag r

    LEFT JOIN #keywords k ON r.TextVersion LIKE '%'+k.KeyWordName+'%';

    -- Note: used a LEFT JOIN because it will get you the same results but with one less scalar operator

    -- in the execution plan

    -- OR, For a grand total:

    SELECT

    r.,

    r.TextVersion,

    GrandTotal =

    SUM((LEN(r.TextVersion) - LEN(REPLACE(r.TextVersion, k.KeyWordName,'')))/LEN(k.KeyWordName))

    FROM #tblResumeTag r

    LEFT JOIN #keywords k ON r.TextVersion LIKE '%'+k.KeyWordName+'%'

    GROUP BY r.,r.TextVersion;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • J Livingston SQL (8/4/2016)


    ALan....consider the following please

    DECLARE @tblResumeTag TABLE ( int identity primary key, TextVersion varchar(8000));

    INSERT @tblResumeTag(TextVersion) VALUES

    ('I accelerated the development of the system by making rationale decisions based on my collaboration with senior corporate managers')

    DECLARE @keywords TABLE (KeyWordName varchar(100));

    INSERT @keywords VALUES('Rat')

    SELECT , TextVersion, KeyWordName

    FROM @tblResumeTag c

    CROSS JOIN @keywords k

    CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    is there a way to use your ngrams for whole words?

    No way to do this cleanly or easily. Keeping in mind that NGrams8K returns the position of the substring, You could find whole words that begin with "Rat" like this. I also included how you would logic to determine if "rat" (or whatever search string) represents the beginning of the word.

    DECLARE @tblResumeTag TABLE ( int identity primary key, TextVersion varchar(8000));

    INSERT @tblResumeTag(TextVersion) VALUES

    ('I accelerated the development of the system by making rationale decisions based on my collaboration with senior corporate managers. Rats!')

    DECLARE @keywords TABLE (KeyWordName varchar(100));

    INSERT @keywords VALUES('Rat')

    SELECT , TextVersion, KeyWordName, position,

    SUBSTRING(TextVersion, position, ISNULL(NULLIF(CHARINDEX(' ',TextVersion,position+1),0),8000+position)-position),

    IsTruncated = IIF(SUBSTRING(TextVersion,position-1,1) LIKE '[a-zA-Z]', 1, 0)

    FROM @tblResumeTag c

    CROSS JOIN @keywords k

    CROSS APPLY dbo.NGrams8k(c.TextVersion, LEN(k.KeyWordName)) ng

    WHERE token = k.KeyWordName;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • .. I want to display the candidate rank as an avg of all the three columns .

    Using my sample data from earlier, the best way would be like so:

    SELECT

    r.,

    r.TextVersion,

    k.KeyWordName,

    GrandTotal =

    AVG((LEN(r.TextVersion) - LEN(REPLACE(r.TextVersion, k.KeyWordName,'')))/LEN(k.KeyWordName))

    FROM #tblResumeTag r

    LEFT JOIN #keywords k ON r.TextVersion LIKE '%'+k.KeyWordName+'%'

    GROUP BY r.,r.TextVersion, k.KeyWordName

    ORDER BY ;

    Jeff Moden (8/8/2016)


    shambhavi13 (8/8/2016)


    You are absolutely right... but the idea here is just to tag the candidate with skills like JAVA(6), C++(5) .. where 6, 5 is the number of keyword occurrence.

    That's mostly what I'm objecting to. Everyone knows how these things are "tagged" (rated, really, since there's a quantity involved) and that's how fakers and posers "game" the system. They use as many key words as possible trying to get the highest count possible. That also makes them a form of "cheater" and, possibly, "liar". When such idiots actually get an interview with me, I normally prove that theory. About 80% of the moroffs claiming to be Senior DBAs with more than 10 years of experience and Senior Developers with more than 5 years of experience on their resumes along with all the cool-kid buzz words can't even tell me how to get the current date and time using T-SQL (and I'm dead serious there... no one could make something like that up).

    What recruiters and employers need to start doing is their homework instead of relying on buzz-word counts. It's a stupid way to sift out the candidates.

    In the SQL/Data world a buzz-word count is pointless because you don't see that many resumes and it's pretty easy to filter out the potentially good from the very likely bad candidates by doing a quick "once over" of their resume. In situations where you receive a bunch of resumes I think this kind of thing can come in handy when used to help determine what order to read the resumes first/last. It's usually easy to identify the people who "tag" their resumes; for me or the recruiter/HR person I'm advising- obviously "tagged" or padded resumes should be discarded because the person is dishonest.

    On the flip side, If I'm hiring a MS BI Developer for an ETL-centric position, and the words: SSIS, CDC warehouse, model, SQL, ETL, Kimball, Star, Data, transform, plan, lead, tune, etc don't appear at all then perhaps those resumes belong at the bottom of the pile.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Yeah no doubt the below solution you gave is the fastest of all .. but after testing it on few resumes it doesn't seem to give exact count ..May be because of the replace function we are using to replace it with a space.

    SELECT

    r.ProfileID,

    k.KeyWordName,

    SubTotal =

    (LEN(r.TextVersion) - LEN(REPLACE(r.TextVersion, k.KeyWordName,'')))/LEN(k.KeyWordName)

    FROM dbo.tblResumeTag r

    LEFT JOIN dbo.tblkeywords k ON r.TextVersion LIKE '%'+k.KeyWordName+'%'

    WHERE k.KeyWordName IS NOT NULL;

  • BWAAA-HAAA!!!! Even the machine must sense it's the wrong thing to do. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • shambhavi13 (8/9/2016)


    Yeah no doubt the below solution you gave is the fastest of all .. but after testing it on few resumes it doesn't seem to give exact count ..

    SELECT

    r.ProfileID,

    k.KeyWordName,

    SubTotal =

    (LEN(r.TextVersion) - LEN(REPLACE(r.TextVersion, k.KeyWordName,'')))/LEN(k.KeyWordName)

    FROM dbo.tblResumeTag r

    LEFT JOIN dbo.tblkeywords k ON r.TextVersion LIKE '%'+k.KeyWordName+'%'

    WHERE k.KeyWordName IS NOT NULL;

    You would have to post some sample data to help figure out why you're not getting the correct count. One possible reason would be if there is overlap. In that case the NGrams solution would be the way to go.

    May be because of the replace function we are using to replace it with a space.

    It's replacing it with nothing (e.g. removing it). That's by design and how the solution works. Consider the following. We're taking the length of @string then subtracting the length of the string when we remove all instances of @searchstring Consider this code:

    DECLARE @String varchar(100) = 'xxx.yyy.zzz.abc', @SearchString varchar(100) = '.';

    SELECT SearchStringCount = LEN(@string)-LEN(REPLACE(@string,@SearchString,''));

    It returns a 3 which is correct. This worked because the length of @SearchString is 1. When @SearchString is longer than 1 you just divide the above results by the length of @SearchString.

    DECLARE @String varchar(100) = 'xxx..yyy..zzz', @SearchString varchar(100) = '..';

    SELECT SearchStringCount =

    (LEN(@string) - LEN(REPLACE(@string,@SearchString,''))) / LEN(@SearchString);

    This returns a 2. The length of the string is 13, the length when you remove all instances of @searchstring is 9. 13-9 = 4. The length of the search string is 2. (13-9)/2 = 4/2 = 2.

    This solution fails when there's overlap though. Consider the following code and comments:

    DECLARE @String varchar(100) = 'xxx..yyy...zzz', @SearchString varchar(100) = '..'; ;

    SELECT SearchStringCount =

    (LEN(@string) - LEN(REPLACE(@string,@SearchString,''))) / LEN(@SearchString);

    -- This returns a 2, the correct answer is 3, @searchstring exists in 3 places:

    -- SELECT SUBSTRING(@string, 4,2), SUBSTRING(@string, 9,2), SUBSTRING(@string, 10,2)

    The best way to deal with overlap would be to use NGrams8K:

    DECLARE @String varchar(100) = 'xxx..yyy...zzz', @SearchString varchar(100) = '..'; ;

    SELECT SearchStringCount = COUNT(*)

    FROM dbo.NGrams8k(@string,LEN(@SearchString))

    WHERE token = @SearchString;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • I am pretty satisfied with the NGrams8k solutions you gave . The results are accurate and since I only have to deal with one candidate at a time the processing time is 0:00 . Right now Since I have like three columns I am asking user to choose the column and applying if else condition . Do you suggest any better way to deal with the same?

    I know asking user which version (Text/Word/PDF) they want to perform search is not appropriate

    IF(@Columns ='TextVersion')

    Select rt.ProfileID, k.KeyWordName, SrchStrOccurances = COUNT(*), CandidateRank = RANK() OVER (ORDER BY COUNT(*) DESC, MIN(Position))

    from dbo.tblResumeTag rt

    INNER JOIN dbo.tblKeywords k

    ON rt.TextVersion LIKE '%' + k.KeyWordName + '%'

    CROSS APPLY dbo.NGrams8k(rt.TextVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    AND rt.ProfileID = @ProfileID

    GROUP BY ProfileID, k.KeyWordName

    ELSE IF (@Columns='WordVersion')

    Select rt.ProfileID, k.KeyWordName, SrchStrOccurances = COUNT(*), CandidateRank = RANK() OVER (ORDER BY COUNT(*) DESC, MIN(Position))

    from dbo.tblResumeTag rt

    INNER JOIN dbo.tblKeywords k

    ON rt.WordVersion LIKE '%' + k.KeyWordName + '%'

    CROSS APPLY dbo.NGrams8k(rt.WordVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    AND rt.ProfileID = @ProfileID

    GROUP BY ProfileID, k.KeyWordName

  • shambhavi13 (8/10/2016)


    I am pretty satisfied with the NGrams8k solutions you gave . The results are accurate and since I only have to deal with one candidate at a time the processing time is 0:00 . Right now Since I have like three columns I am asking user to choose the column and applying if else condition . Do you suggest any better way to deal with the same?

    I know asking user which version (Text/Word/PDF) they want to perform search is not appropriate

    IF(@Columns ='TextVersion')

    Select rt.ProfileID, k.KeyWordName, SrchStrOccurances = COUNT(*), CandidateRank = RANK() OVER (ORDER BY COUNT(*) DESC, MIN(Position))

    from dbo.tblResumeTag rt

    INNER JOIN dbo.tblKeywords k

    ON rt.TextVersion LIKE '%' + k.KeyWordName + '%'

    CROSS APPLY dbo.NGrams8k(rt.TextVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    AND rt.ProfileID = @ProfileID

    GROUP BY ProfileID, k.KeyWordName

    ELSE IF (@Columns='WordVersion')

    Select rt.ProfileID, k.KeyWordName, SrchStrOccurances = COUNT(*), CandidateRank = RANK() OVER (ORDER BY COUNT(*) DESC, MIN(Position))

    from dbo.tblResumeTag rt

    INNER JOIN dbo.tblKeywords k

    ON rt.WordVersion LIKE '%' + k.KeyWordName + '%'

    CROSS APPLY dbo.NGrams8k(rt.WordVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    AND rt.ProfileID = @ProfileID

    GROUP BY ProfileID, k.KeyWordName

    That should be fine. Another way (I do it like this if I need to have the query inside an inline table-valued function) would be like this:

    WITH

    TextVersion AS

    (

    Select rt.ProfileID, k.KeyWordName, SrchStrOccurances = COUNT(*), CandidateRank = RANK() OVER (ORDER BY COUNT(*) DESC, MIN(Position))

    from dbo.tblResumeTag rt

    INNER JOIN dbo.tblKeywords k

    ON rt.TextVersion LIKE '%' + k.KeyWordName + '%'

    CROSS APPLY dbo.NGrams8k(rt.TextVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    AND rt.ProfileID = @ProfileID

    GROUP BY ProfileID, k.KeyWordName

    ),

    WordVersion AS

    (

    Select rt.ProfileID, k.KeyWordName, SrchStrOccurances = COUNT(*), CandidateRank = RANK() OVER (ORDER BY COUNT(*) DESC, MIN(Position))

    from dbo.tblResumeTag rt

    INNER JOIN dbo.tblKeywords k

    ON rt.WordVersion LIKE '%' + k.KeyWordName + '%'

    CROSS APPLY dbo.NGrams8k(rt.WordVersion, LEN(k.KeyWordName))

    WHERE token = k.KeyWordName

    AND rt.ProfileID = @ProfileID

    GROUP BY ProfileID, k.KeyWordName

    )

    SELECT *

    FROM TextVersion

    WHERE @Columns ='TextVersion'

    UNION ALL

    SELECT *

    FROM WordVersion

    WHERE @Columns='WordVersion'

    OPTION (RECOMPILE);

    The OPTION (RECOMPILE) forces SQL Server to determine which of the two unioned queries to run and ignores the other.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thank you so much .

    Have one another question in mind .. what if I want to search more than one one word in a single document but in same search like suppose in below example

    -- Keyword parameter (a variable in this example)

    DECLARE @SearchString varchar(100) = 'SQL' AND 'SSIS'

    -- The solution:

    SELECT

    ProfileID,

    k.KeyWordName,

    SrchStrOccurances = COUNT(*), -- How many times was @SearchString found in ResumeTxt?

    FirstOccurance = MIN(Position), -- Location of the 1st occurance of @SearchString

    CandidateRank = RANK() OVER (ORDER BY COUNT(*) DESC, MIN(Position))

    FROM DBO.tblResumeTag br

    CROSS APPLY dbo.NGrams8k(TextVersion,LEN(@SearchString))

    INNER JOIN dbo.tblKeywords k

    ON br.TextVersion LIKE '%' + k.KeyWordName + '%'

    AND k.KeyWordName LIKE '%' + @SearchString + '%'

    WHERE token = @SearchString

    GROUP BY ProfileID, k.KeyWordName;

  • Alan, just out of curiousity, would you be so kind as to test this solution and compare it to your final?

    This would have been my knee-jerk solution before you raised the n-gram topic.

    with cte as (

    select t., replace(replace(value,'.',''),',','') as KeywordName

    from #tblResumeTag t

    cross apply string_split(TextVersion,' ') ss --(or delimited splitter of choice)

    where t.TextVersion like '%'+@searchstring+'%'

    and value like '%'+@searchstring+'%'

    and exists (select 1 from #keywords where KeywordName = replace(replace(value,'.',''),',',''))

    )

    select [Key], KeywordName, count(*)

    from cte

    group by ,KeywordName

    order by ,count(*) desc

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • shambhavi13 (8/11/2016)


    Thank you so much .

    Have one another question in mind .. what if I want to search more than one one word in a single document but in same search like suppose in below example

    -- Keyword parameter (a variable in this example)

    DECLARE @SearchString varchar(100) = 'SQL' AND 'SSIS'

    -- The solution:

    SELECT

    ProfileID,

    k.KeyWordName,

    SrchStrOccurances = COUNT(*), -- How many times was @SearchString found in ResumeTxt?

    FirstOccurance = MIN(Position), -- Location of the 1st occurance of @SearchString

    CandidateRank = RANK() OVER (ORDER BY COUNT(*) DESC, MIN(Position))

    FROM DBO.tblResumeTag br

    CROSS APPLY dbo.NGrams8k(TextVersion,LEN(@SearchString))

    INNER JOIN dbo.tblKeywords k

    ON br.TextVersion LIKE '%' + k.KeyWordName + '%'

    AND k.KeyWordName LIKE '%' + @SearchString + '%'

    WHERE token = @SearchString

    GROUP BY ProfileID, k.KeyWordName;

    From the looks of this (and your counts are coming back wrong) you are trying to add a filter to your Keyword. If that's the case, you may want to consider better/different keywords. Here's how you would analyze the resumes table against the keywords table AND be able to pass a comma-delimited string of additional values.

    Note that I used the sample data I had saved which has some slightly different column names but should be simple to sort out. Play around with this:

    -- Resume Table

    CREATE TABLE #tblResumeTag

    (

    int identity primary key,

    TextVersion varchar(8000) NOT NULL

    );

    -- Keywords table

    CREATE TABLE #keywords (kwid int identity , KeyWordName varchar(50) NOT NULL);

    INSERT #tblResumeTag(TextVersion)

    VALUES

    ('Joe Doe - Objective: To use my many years of SQL experience... blah blah SQL...

    blah SQL this & SQL that... SSIS.. SSIS.. SSIS..'),

    ('John Green, PMP, MSM - SQL Guru... yada yada SQL... blah blah blah... SSRS... SSAS'),

    ('Sue Jones Objective: To utilize my awesome skills and experience... yada SQL...

    blah SQL this & SQL that... more experience with SQL... SSIS... SSRS, SSAS');

    INSERT #keywords VALUES('SQL'),('SSIS');

    -- Solution

    DECLARE @SearchString varchar(100) = 'SSAS,SSRS'; -- multiple values

    SELECT

    ,

    WordsFoundCount,

    FirstOccurance,

    CandidateRank = RANK() OVER (ORDER BY WordsFoundCount DESC, FirstOccurance ASC)-- used to remove dups

    FROM

    (

    SELECT

    ,

    k.KeyWordName,

    ng.token,

    FirstOccurance = MIN(Position) OVER (PARTITION by ),

    WordsFoundCount = COUNT(*) OVER (PARTITION BY ),

    rn = ROW_NUMBER() OVER (PARTITION BY ORDER BY (SELECT NULL))

    FROM #tblResumeTag br

    CROSS JOIN

    (

    SELECT KeyWordName

    FROM #keywords

    UNION -- UNION (DISTINCT) to prevent duplicates in the search string

    SELECT item

    FROM dbo.DelimitedSplit8K(@SearchString, ',')

    ) k

    CROSS APPLY dbo.NGrams8k(TextVersion,LEN(k.KeyWordName)) ng

    WHERE br.TextVersion LIKE '%' + k.KeyWordName + '%' AND k.KeyWordName = ng.token

    ) X

    WHERE rn = 1;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (8/8/2016)


    .. I want to display the candidate rank as an avg of all the three columns .

    Using my sample data from earlier, the best way would be like so:

    SELECT

    r.,

    r.TextVersion,

    k.KeyWordName,

    GrandTotal =

    AVG((LEN(r.TextVersion) - LEN(REPLACE(r.TextVersion, k.KeyWordName,'')))/LEN(k.KeyWordName))

    FROM #tblResumeTag r

    LEFT JOIN #keywords k ON r.TextVersion LIKE '%'+k.KeyWordName+'%'

    GROUP BY r.,r.TextVersion, k.KeyWordName

    ORDER BY ;

    Jeff Moden (8/8/2016)


    shambhavi13 (8/8/2016)


    You are absolutely right... but the idea here is just to tag the candidate with skills like JAVA(6), C++(5) .. where 6, 5 is the number of keyword occurrence.

    That's mostly what I'm objecting to. Everyone knows how these things are "tagged" (rated, really, since there's a quantity involved) and that's how fakers and posers "game" the system. They use as many key words as possible trying to get the highest count possible. That also makes them a form of "cheater" and, possibly, "liar". When such idiots actually get an interview with me, I normally prove that theory. About 80% of the moroffs claiming to be Senior DBAs with more than 10 years of experience and Senior Developers with more than 5 years of experience on their resumes along with all the cool-kid buzz words can't even tell me how to get the current date and time using T-SQL (and I'm dead serious there... no one could make something like that up).

    What recruiters and employers need to start doing is their homework instead of relying on buzz-word counts. It's a stupid way to sift out the candidates.

    In the SQL/Data world a buzz-word count is pointless because you don't see that many resumes and it's pretty easy to filter out the potentially good from the very likely bad candidates by doing a quick "once over" of their resume. In situations where you receive a bunch of resumes I think this kind of thing can come in handy when used to help determine what order to read the resumes first/last. It's usually easy to identify the people who "tag" their resumes; for me or the recruiter/HR person I'm advising- obviously "tagged" or padded resumes should be discarded because the person is dishonest.

    On the flip side, If I'm hiring a MS BI Developer for an ETL-centric position, and the words: SSIS, CDC warehouse, model, SQL, ETL, Kimball, Star, Data, transform, plan, lead, tune, etc don't appear at all then perhaps those resumes belong at the bottom of the pile.

    As you say, finding good candidates is very rare and it's surprising how many good candidates can't actually write a decent resume or CV. It' equally surprising how many truly incompetent candidates will load up the R/CV with buzz-words because they can't actually even spell SQL. Lot's of folks list the tools that were available each environment they worked in, not what they're actually good at. I even had one guy state that he was "Extremely knowledgeable in the use of SQL Server 2005/2008/2012, 2014 and Stored Procedures". The trouble was, that's all he did was USE stored procedures. He had no clue how to write them. This same guy listed "SQL Server" as a tool for each job he had in his resume. He's also one of the folks that didn't even know how to get the current date and time nor did he know what the first two words of the syntax to create a stored procedure were.

    On the flip side, one of the best people I ever hired was someone that had an incredibly boring and seemingly unrelated resume owing mostly to no on-the-job experience and lacking most any related key word you might be looking for. It turned out that he was a recent graduate and had a 4 server lab setup at home and was able to absolutely smoke the competition during the interview both as a DBA and as a Database Developer.

    If you have a lot of resumes to go through, there's no sense in wasting time figuring out how to sort them. Hucker down, do the job, and interview them all because most of the ones you sort to the top of the pile are going to suck as bad or worse as the ones at the bottom.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 13 posts - 16 through 27 (of 27 total)

You must be logged in to reply to this topic. Login to reply