Need to intelligently concatenate words

Question

Need to intelligently concatenate words

Mark Shvarts 1

SSCertifiable

Points: 5702
More actions
October 23, 2007 at 9:30 am

#179525

I converted some database from our legacy system to SQL Server 2005 and now building full-text search engine on it, particularly on tables having notes fields. These notes fields come in 4 by fixed length of 80 chars for each field. I need to concatenate all these notes fields into one, but the problem is if I simply concatenate them like notes1 + notes2 I get a result when two sepate words are concatenated like "fixedincome" which should not be like this. However if I try to add a space between them like notes1 + ' ' + notes2 I get some other cases when a word is broken, for example "cor porate" which full-text search engine will consider as two separate words.
Is there some intelligent way to concatenate them, particularly in full-text search engine ? Table has about 500k records, and cases when at least one field in a record has length of 80 is about 30k, so I have potentially about 30k cases where concatenation could be wrong.
Thanks

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 398886 More actions · Answer 1

I've never heard of anything like that. Best guess I can come up with (and it stinks) is to run spell check on the fields after you concatenate them...

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Joel Ewald SSCertifiable Points: 6120 More actions · Answer 2

I would think you would only come across data such as

Old Note

__________________

blah blah blah.... COR

PORATE blah blah

when they reached the end of the first line.

so you might check against the old note field being totally filled --

len() = 80

then concatenate without a space otherwise do it with a space.

Obviously you will still have problems if they split it pre 80 or if a line is totally filled and ends with a full word.

You could also analyze some of the data and create a small dictionary table and when you see a line end with certain characters you apply one rule or another. Basically a word rule table.

You could also try to find some webservice or data dictionary you can search against. (much the same as above but more inclusive - perhaps us ms word api for spell checker)

There is really no easy way to do it that I can think of without analyzing the data for patterns.

Michael Earl-395764 SSC Guru Points: 53873 More actions · Answer 3

I think you trimmed your field by accident in your import.

You are saying:

If FieldA has 'Hot' and FieldB has 'Water', FieldA+FieldB is 'HotWater' (bad)

If FieldA has 'Hot Wa' and FieldB as 'ter', FieldA+FieldB is 'Hot Water' (good)

Make sure your original source did not really have 'Hot ' in one field and 'Water' in the next. You may have inadvertantly trimmed a trailing (or leading) space off of your data.

I assume the original system worked, and correctly concatenating what you have together is, in fact, impossible. If you have a field with 'race' in it and then another with 'car', you have two actual words that need to be put together because of context. I doubt that was really happening.

Mark Shvarts 1 SSCertifiable Points: 5702 More actions · Answer 4

I have to explain why we have such problems.

In our legacy application there were 4 lines for notes each by 80 chars, so there were 4 notes columns in our database table: notes1, notes2, etc. If they had just one column with 320 characters I would have such problem today.

So in their front-end applications users were entering data in different manner. Some entered "COR" on one line and "PORATE" on the following one (in this case I have to concatenate them without a space).

However in a case like for example "BOND TRADE" they could enter "BOND" in a first line where the word "BOND" occupies the last 4 places in note box, without space, then enter "TRADE" on the following line, without leading space. In such a case I have to concatenate them with space.

Even worse, in case of "CORPORATE" some were entering "COR-" in 1st line and "PORATE" on the 2nd one.

Now when I converted all these data to SQL Server, I have just one field with varchar(320). I am doing full-text indexing, and instead of just one CORPORATE it comes up with CORPORATE, COR, PORATE, COR-PORATE words.

Talking about using spell checker, how can I apply it so it will run programmatically ? Should I create CLR function for it or there are some available solutions?

SQLMAIN SSC Eights! Points: 936 More actions · Answer 5

FYI - Even Spell check is not going to be 100 % reliable for words like - MESSAGE - if it splits in two line like MESS being the last four on one line AGE being the first three on next line.

Actually - it this is a free form text how are you ensuring that the rest of the words are spelled correctly with in the 80 byte line?

One option can be to have your SEARCH ENGINE so built that it searches as a whole word and if not found - does a search where it splits the word into two (all combinations - like for a 6 letter word - 1 + 5 then 2 + 3 then 3 + 3 then 4 + 1 then 5 + 1) - Does not sound like the right solution - but an option to think it defferently.