March 13, 2013 at 1:01 pm
I have a list of 1 million+ products with numeric barcode values stored as VARCHAR(50) that can range in length in a table. I'm trying to find a way to do a fuzzy search query when a user enters a barcode to find a product. Let's say I have a barcode table like:
Barcodes
----------
98425103
98425104
99425103
99564312
If the user entered something like 98425102, I would expect it to return 98425103. However, if the user entered 97425103, I'd expect both 98425103 and 99425103 to be returned. I'm basically looking for the closest matches to the entered barcode, while factoring misspellings/mistypes as well. Would I implement this via RegEx CLR or is there a way to do this through the standard LIKE operator?
March 13, 2013 at 1:32 pm
the barcodes i'm used to, like at the grocery store are basically a concatenation of a vendor number plus a product number;
ie 41415 -00106 is Eggs, Large at Publix Supermarkets, if my memory is not off.
so the 41415 is the vendor, and the other half is a product code, which is assigned by the vendor and can be anything;
in your example, i can understand ignoring the last digit for similar items, but the modification of a number somewhere in the middle like your example implies to me that a different vendor with the same product? is that what your sku represents?
regardless, i think i'd consider generating a related skus table with the criteria you you can identify, insert the related items based on almost matches and use that for the query;
any string manipulation on a million row varchar is going to require a table scan every time, and performance will suffer.
but if you add another table so you know that 101 thru 109 are related to each other, it'd be possible to use indexed and speed things up.
back to my example:
41415-00106 would be cross joined against *105 (medium eggs),*107(extra large eggs) and *108 (Jumbo eggs)
then because i KNOW some other brand of Free Range Eggs are also related, i'd relate Free Range Eggs (99969-45789 )them to those as well, even though their numbers have nothing to do with each other, but their product IS related.
Lowell
March 13, 2013 at 3:13 pm
Unfortunately, the SKUs aren't all vendor specific, but for those that are I could implement narrowing down the product number to improve performance. This still leaves me with, from this pared down result, trying to match on say the last 5 numbers. And upon entry, if the user mis-entered one of the last 5 numbers and the product may not exist in the table. In this case, how would I bring back the closest matching results/products?
March 13, 2013 at 4:24 pm
I don't suppose you're going to make it easy on use and mention that all the skus are comprised of numbers, right?
If they are all numbers you can use something like this to get the closest matches.
declare @userinput varchar(50)
set @userinput = '965487312'
select barcode
from
(
select '965587312' barcode
union
select '965486312' barcode
union
select '965487322' barcode
union
select '965487311' barcode
) source
order by abs(cast(@userinput as bigint) - cast(barcode as bigint)) asc
March 14, 2013 at 7:47 am
Actually they are all numbers... and that's an interesting approach. I had always examined it as a pattern matching problem, but I could use the difference to calculate which number is closest to what the user has inputted. I'm going to test this on my data and get back to this thread later.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply