June 23, 2016 at 7:02 pm
Hi,
I have an application where we gather RSS news feed from around the world into this staging table.
From that staging table, we only need "important" news articles to move into the real RSSItems table.
I have been asked to exclude RSS items where words like Kardashian is in the description. What we would rather have are topics about disasters, terror, crime, politics, scandal.
My first thought is to create a filter table with the serious words but then I thought even articles about reality starts may have such words. But listing nonsense words are just as unreliable.
I am asking for just a idea of a common sense approach.
I want to left join the staging table with the filter table to fill the main RSSItem table with as much "real" news as possible.
Any Ideas?
Does anyone know of ready to purchase topical filter list?
Any ideas or direction will be appreciated.
June 24, 2016 at 5:02 pm
frontrunner148 (6/23/2016)
I have been asked to exclude RSS items where words like Kardashian is in the description. What we would rather have are topics about disasters, terror, crime, politics, scandal.
Odd... I thought the single word given was a synonym for the others. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
June 24, 2016 at 5:05 pm
On the serious side, how are these feeds stored in the staging table? VARCHAR, NVARCHAR, VARBINARY, XML or ???? Within that question, are they the MAX datatype? Is each feed contained entirely in a single row?
--Jeff Moden
Change is inevitable... Change for the better is not.
June 24, 2016 at 9:07 pm
Good question, I should have provided that.
Each RSSItem has a guid - varchar(512), AuthorName - varchar(256) , Title - NVARCHAR(500), Description - NVARCHAR(4096), PublishDate as a UTC datetime and a URL to the full news article - varchar(2048).
I want to filter the description.
Each row is a separate news summary with link.
I hope this helps.
June 24, 2016 at 9:08 pm
In most cases, that would be a correct assumption 😉
June 24, 2016 at 9:11 pm
In most cases, that would be a correct assumption 😉
June 25, 2016 at 8:49 am
My first inclination would be to use the UNIQUEIDENTIFIER datatype for the GUID instead of VARCHAR but that may be contrary to formatting of the data received. If it's not a correctly formatted GUID, it will cause rejection of the row during import.
Shifting gears to the subject at hand and with performance in mind, I'd be tempted to do a word level split of the title and description into a table that would take the GUID and the split-out word for each word, possibly not including such noise words as "and", "or", "a", "the", etc. Then it would be a simple matter to join that table to a "black-ball" list of words in a high performance fashion to return a unique list of GUIDs that contained black-balled words. Think of it as an easy-to-maintain, poor-mans version of "FULL TEXT" lookups that you can have some pretty exquisite control over.
You could, of course, do a LIKE '%____%" for every black-balled word but that will become really slow as the number of black-balled words and number of RSS feeds continues to grow.
If you want to try a really fast, T-SQL only splitter, look for the "DelimitedSplitN4K" splitter in the "Resources" section of the following article. The only thing that would be faster is a well-written, properly designed/operating SQLCLR to do the splits with. Sadly, many such SQLCLRs leave out many nuances as to how to provide correct returns for such splitters so be really careful in your testing if you decide to go that route. Even the SQLCLR splitter in the article I reference has such problems.
Here's the link to the article...
http://www.sqlservercentral.com/articles/Tally+Table/72993/
--Jeff Moden
Change is inevitable... Change for the better is not.
June 27, 2016 at 12:57 am
Jeff Moden (6/25/2016)
My first inclination would be to use the UNIQUEIDENTIFIER datatype for the GUID instead of VARCHAR but that may be contrary to formatting of the data received. If it's not a correctly formatted GUID, it will cause rejection of the row during import.Shifting gears to the subject at hand and with performance in mind, I'd be tempted to do a word level split of the title and description into a table that would take the GUID and the split-out word for each word, possibly not including such noise words as "and", "or", "a", "the", etc. Then it would be a simple matter to join that table to a "black-ball" list of words in a high performance fashion to return a unique list of GUIDs that contained black-balled words. Think of it as an easy-to-maintain, poor-mans version of "FULL TEXT" lookups that you can have some pretty exquisite control over.
You could, of course, do a LIKE '%____%" for every black-balled word but that will become really slow as the number of black-balled words and number of RSS feeds continues to grow.
If you want to try a really fast, T-SQL only splitter, look for the "DelimitedSplitN4K" splitter in the "Resources" section of the following article. The only thing that would be faster is a well-written, properly designed/operating SQLCLR to do the splits with. Sadly, many such SQLCLRs leave out many nuances as to how to provide correct returns for such splitters so be really careful in your testing if you decide to go that route. Even the SQLCLR splitter in the article I reference has such problems.
Here's the link to the article...
Thanks jeff...
It's like indexing each row in the two columns.
Great idea... and fast...
June 27, 2016 at 8:25 am
frontrunner148 (6/27/2016)
Jeff Moden (6/25/2016)
My first inclination would be to use the UNIQUEIDENTIFIER datatype for the GUID instead of VARCHAR but that may be contrary to formatting of the data received. If it's not a correctly formatted GUID, it will cause rejection of the row during import.Shifting gears to the subject at hand and with performance in mind, I'd be tempted to do a word level split of the title and description into a table that would take the GUID and the split-out word for each word, possibly not including such noise words as "and", "or", "a", "the", etc. Then it would be a simple matter to join that table to a "black-ball" list of words in a high performance fashion to return a unique list of GUIDs that contained black-balled words. Think of it as an easy-to-maintain, poor-mans version of "FULL TEXT" lookups that you can have some pretty exquisite control over.
You could, of course, do a LIKE '%____%" for every black-balled word but that will become really slow as the number of black-balled words and number of RSS feeds continues to grow.
If you want to try a really fast, T-SQL only splitter, look for the "DelimitedSplitN4K" splitter in the "Resources" section of the following article. The only thing that would be faster is a well-written, properly designed/operating SQLCLR to do the splits with. Sadly, many such SQLCLRs leave out many nuances as to how to provide correct returns for such splitters so be really careful in your testing if you decide to go that route. Even the SQLCLR splitter in the article I reference has such problems.
Here's the link to the article...
Thanks jeff...
It's like indexing each row in the two columns.
Great idea... and fast...
Thanks for the feedback. Actually, though, it's more like indexing each word in the RSS feed titles and description. And, as you imply, it's nasty fast.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 29, 2016 at 12:14 am
I will assume you are using a programming language that is more efficient with patterns and string search to collect all these RSS feeds. Why not do your search in the application/script layer prior to inserting the records in the database, eliminating two database steps (word search and one insert) from your process.
[font="Verdana"]Sal Young[/font]
[font="Verdana"]MCITP Database Administrator[/font]
June 29, 2016 at 4:07 pm
I feel sorry for millions of other Kardashians who live in and (mainly) outside of USA.
They do not have any chance to get on the news.
Unless they change their last name.
:hehe:
_____________
Code for TallyGenerator
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply