text mining

  • Can anybody assist with a text mining issue of mine? I have several news articles, and I'm trying to pinpoint essentially who the article is about. The associated names are in a separate table. Sometimes multiple names appear in an article, but I need to return the name of the person that appears the most in each article.

    I'm pretty new to SQL, and very new to text mining.

    I had a thought that seemed like it would take way too long and be way too much trouble, but I wondered if there was a fairly simple solution.

  • Full text indexing may help but I don't know enough about it or have enough details about your requirement to say for sure. I do have a function, however, which might get you started. Below is the code for a version of my

    ngrams function that splits strings up into groups of words. Note the comment section for how to use it. The function used Jeff Moden's splitter which you can read about in my signature line.

    CREATE FUNCTION dbo.nGramsAB2(@string varchar(8000), @k int)

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    /*******************************************************************************

    Purpose:

    Per Wikipedia (http://en.wikipedia.org/wiki/N-gram) an "n-gram" is defined as:

    "a contiguous sequence of n items from a given sequence of text or speech. The

    items can be phonemes, syllables, letters, words or base pairs according to the

    application. "

    nGrams2 breaks a string into words. The number of words per row is defined

    by @k.

    Usage:

    SELECT * FROM nGramsAB2('One two three four words',1);

    Returns:

    ItemNumber gram

    1 One

    2 two

    3 three

    4 four

    5 words

    SELECT * FROM nGramsAB2('One two three four words',2);

    Returns:

    ItemNumber gram

    1 One two

    2 two three

    3 three four

    4 four words

    SELECT * FROM nGramsAB2('One two three four words',3);

    Returns:

    ItemNumber gram

    1 One two three

    2 two three four

    3 three four words

    Programmer Notes:

    1. This function splits words by spaces; the text must be pre-formatted to address

    line breaks, carriage returns multiple spaces, etc.

    Revision History:

    Rev 00 - 5/25/2015 Initial conception - Alan Burstein

    Rev 01 - 8/19/2000 Changed to 8K limit for SSC demo - Alan Burstein

    *******************************************************************************/

    WITH grams AS (SELECT ItemNumber, Item FROM dbo.DelimitedSplit8K(@string,' '))

    SELECT TOP (SELECT max(ItemNumber)-(@k-1) FROM grams)

    ItemNumber,

    gram =

    (

    SELECT TOP (@k) item+' '

    FROM grams g2

    WHERE g2.ItemNumber >= g.ItemNumber

    FOR XML PATH(''), TYPE

    ).value('.', 'varchar(8000)')

    FROM grams g;

    GO

    Below is some sample data with the names of people you may be searching for (for my solution to work you would need to have the names stored in a table (temp table or temp variable would be fine). I'm going to use my function to break the sample text into bi-grams then join my people table to it...

    DECLARE @article varchar(8000) =

    'Video provides a powerful way to help you prove your point. When you click Online Video, you can paste in the embed code for the video you want to add. You can also type a keyword to search online for the video that best fits your document. Bob Smith

    To make your document look professionally produced, Word provides header, footer, cover page, and text box designs that complement each other. For example, you can add a matching cover page, header, and sidebar. Click Insert and then choose the elements you want from the different galleries.

    Themes and styles also help keep your document coordinated!!! When you click Design and choose a new Theme, the pictures, charts, and SmartArt graphics change to match your new theme. When you apply styles, your headings change to match the new theme. Bob Smith

    Save time in Word with new buttons that show up where you need them. Quinton Jackson. To change the way a picture fits in your document, click it and a button for layout options appears next to it. When you work on a table, click where you want to add a row or a column, and then click the plus sign.

    Reading is easier, too, in the new Reading view. You can collapse parts of the document and focus on the text you want. If you need to stop reading before you reach the end, Word remembers where you left off - even on another device. Jon Jones';

    SELECT ng.*

    FROM dbo.nGramsAB2(@article,2) ng

    JOIN dbo.people p

    ON gram LIKE '%'+p.firstname+'%' AND gram LIKE '%'+p.lastname+'%';

    You'll see that Bob Smith occurs the most often but it requires a visual inspection to determine that.

    ------------------------

    48Bob Smith To

    136Bob Smith Save

    151Quinton Jackson.

    242Jon Jones

    Let's do some pre-cleaning on the string. This code will replace punctuation (?.!;) with spaces as well as other whitespace with spaces (things such as carriage returns, tabs, etc). Then it replaces multiple spaces with single spaces. In the end we have a bunch of words separated by only one space.. After the pre-cleaning we'll feed the string to my function....

    DECLARE @article varchar(8000) =

    'Video provides a powerful way to help you prove your point. When you click Online Video, you can paste in the embed code for the video you want to add. You can also type a keyword to search online for the video that best fits your document. Bob Smith

    To make your document look professionally produced, Word provides header, footer, cover page, and text box designs that complement each other. For example, you can add a matching cover page, header, and sidebar. Click Insert and then choose the elements you want from the different galleries.

    Themes and styles also help keep your document coordinated!!! When you click Design and choose a new Theme, the pictures, charts, and SmartArt graphics change to match your new theme. When you apply styles, your headings change to match the new theme. Bob Smith

    Save time in Word with new buttons that show up where you need them. Quinton Jackson. To change the way a picture fits in your document, click it and a button for layout options appears next to it. When you work on a table, click where you want to add a row or a column, and then click the plus sign.

    Reading is easier, too, in the new Reading view. You can collapse parts of the document and focus on the text you want. If you need to stop reading before you reach the end, Word remembers where you left off - even on another device. Jon Jones';

    SET @article =

    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(

    RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(

    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(

    @article,

    CHAR(1),CHAR(32)),CHAR(9),CHAR(32)),CHAR(10),CHAR(32)),CHAR(11),CHAR(32)),

    CHAR(12),CHAR(32)),CHAR(13),CHAR(32)),CHAR(28),CHAR(32)),CHAR(29),CHAR(32)),

    CHAR(30),CHAR(32)),CHAR(31),CHAR(32)),CHAR(160),CHAR(32)),'.',CHAR(32)),

    '!',CHAR(32)),',',CHAR(32)),'?',CHAR(32)),';',CHAR(32)))),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ')

    SELECT ng.*

    FROM dbo.nGramsAB2(@article,2) ng

    JOIN dbo.people p

    ON gram LIKE '%'+p.firstname+'%' AND gram LIKE '%'+p.lastname+'%';

    Much better, now we have this:

    48Bob Smith

    138Bob Smith

    154Quinton Jackson

    246Jon Jones

    The final step is to do some grouping like so:

    SELECT gram, Occurances = COUNT(gram)

    FROM dbo.nGramsAB2(@article,2)

    JOIN dbo.people p

    ON gram LIKE '%'+p.firstname+'%' AND gram LIKE '%'+p.lastname+'%'

    GROUP BY gram

    ORDER BY COUNT(gram) DESC;

    Which gives us:

    gram Occurances

    ---------------------------------------------------------

    Bob Smith 2

    Jon Jones 1

    Quinton Jackson 1

    And there you have it. The whole solution would look like this...

    IF OBJECT_ID('dbo.people') IS NOT NULL DROP TABLE dbo.people

    CREATE TABLE dbo.people (firstname varchar(100), lastname varchar(100));

    INSERT dbo.people VALUES ('Jon','Jones'),('Bob','Smith'),('Quinton','Jackson')

    DECLARE @article varchar(8000) =

    'Video provides a powerful way to help you prove your point. When you click Online Video, you can paste in the embed code for the video you want to add. You can also type a keyword to search online for the video that best fits your document. Bob Smith

    To make your document look professionally produced, Word provides header, footer, cover page, and text box designs that complement each other. For example, you can add a matching cover page, header, and sidebar. Click Insert and then choose the elements you want from the different galleries.

    Themes and styles also help keep your document coordinated!!! When you click Design and choose a new Theme, the pictures, charts, and SmartArt graphics change to match your new theme. When you apply styles, your headings change to match the new theme. Bob Smith

    Save time in Word with new buttons that show up where you need them. Quinton Jackson. To change the way a picture fits in your document, click it and a button for layout options appears next to it. When you work on a table, click where you want to add a row or a column, and then click the plus sign.

    Reading is easier, too, in the new Reading view. You can collapse parts of the document and focus on the text you want. If you need to stop reading before you reach the end, Word remembers where you left off - even on another device. Jon Jones';

    SET @article =

    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(

    RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(

    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(

    @article,

    CHAR(1),CHAR(32)),CHAR(9),CHAR(32)),CHAR(10),CHAR(32)),CHAR(11),CHAR(32)),

    CHAR(12),CHAR(32)),CHAR(13),CHAR(32)),CHAR(28),CHAR(32)),CHAR(29),CHAR(32)),

    CHAR(30),CHAR(32)),CHAR(31),CHAR(32)),CHAR(160),CHAR(32)),'.',CHAR(32)),

    '!',CHAR(32)),',',CHAR(32)),'?',CHAR(32)),';',CHAR(32)))),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ')

    SELECT gram, Occurances = COUNT(gram)

    FROM dbo.nGramsAB2(@article,2)

    JOIN dbo.people p

    ON gram LIKE '%'+p.firstname+'%' AND gram LIKE '%'+p.lastname+'%'

    GROUP BY gram

    ORDER BY COUNT(gram) DESC;

    A final note: Jeff's splitter has an 8K limit. You'll probably need something that handles varchar(max). If you think this is a possible solution let me know and I can include an updated version of the ngrams function that uses a hacked version of jeff's splitter I created to handles more than 8K (I'd want to get people's take on it first before posting it here though).

    Final, final note: I almost forgot to give credit for the code I use that replaces multiple spaces with one.

    It came from a guy named Michael Meierruth somewhere in the comment portion of this article (also by Jeff Moden): REPLACE Multiple Spaces with One[/url].

    Edit: Credit to Michael Meierruth

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Like I said, I'm fairly new to SQL. Still, this makes a lot more sense than I would figure it should (to me). looking over it, it could be this or something derived from it that would work. I would essentially run this on a small scale, about 90 articles and a little under 1000 people. Each person mentioned in each article TYPICALLY is mentioned at least once firstname lastname, but usually will have several mentions by just first or last name. so I would have to filter the articles first by determining who is in each article to begin with (actually already have that piece of code, simple as it is), then run a code that would scrub for number of mentions of only those people (by first, last or first and last), and then set it to count who is mentioned the most.

    So my thought was to set a temporary table or CTE with the people mentioned in each article, then run the count based on the CTE, and whoever has most mentions in each article will be flagged.

    Its still in the beginning stages, but I appreciate the help, this will go a long way.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply