May 24, 2006 at 8:06 am
http://www.developmentnow.com/g/113_2005_5_0_0_527840/TRANSPOSE-QUERY-fat-fingers.htm
i have been trying to figure out a way of doing this but havnt had much luck. Any ideas?
May 24, 2006 at 8:42 am
Here's the text:
"I would like a query that pulls back pairs of numbers that should be the
same, but one was mistyped.
Specifically, I am looking for numbers where two neighboring digits were
transposed in one of the numbers."
Hmmm, I think I have some ideas...
Tim Wilkinson
"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant
May 24, 2006 at 8:51 am
Well, I can tell you very rapidly if any two numbers have a pair of digits transposed. The difference will always be divisible by 9. (Old accounting trick, I can show you why it works if you want.) But, that won't tell you which digits were transposed. It's also not a guaranteed check, as two totally different numbers can also have a difference that's divisible by 9.
The Levenshtein algorithm mentioned in a reply to that post would seem to work, but, again, you can't guarantee what difference is triggering the count.
My solution, off the top of my head, would be separate the digits into an array, and run a couple of loops for comparison. It would be extremely clunky, but, properly done, would have the advantage of maximum information (it would be able to tell you which two digits were transposed), and minimum false positives.
May 24, 2006 at 9:01 am
Yes, I found that mod 9 rule empirically using excel. there had to be something useful about the similarity and 'reverseness' of the numbers.
That is definitely the way to go. I think I've nearly(!) got a reasonable solution worked out... It's making my laptop creak a bit though at the moment.
Tim Wilkinson
"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant
May 24, 2006 at 9:47 am
Right I think I have it. I was hoping not to use any string functions, but I did in the end rather than using log10() a hundred times. I've used arithmetic to limit the rows as much as I can, then the string functions to do the final check. code follows...
There is a problem that leading zeros are not ahndled, but we are talking about integer literals, I have assumed.
Tim Wilkinson
"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant
May 24, 2006 at 9:59 am
@rows_vc = cast(power(cast(count(*) as bigint),2) as varchar(20)) from someintegers
'starting query over ' + @rows_vc + ' permutations: ' + convert(varchar,getdate(),113)
s1.integer, s2.integer
someintegers s1
someintegers s2
abs(s1.integer-s2.integer)%9 = 0
abs(s1.integer-s2.integer)/9.0/power(10,floor(log10(abs(s1.integer-s2.integer)/9)))
between 1 and 8
stuff(s1.integer,cast(floor(log10(s1.integer))-floor(log10(abs(s1.integer-s2.integer)/9)) as int),2,'')
= stuff(s2.integer,cast(floor(log10(s2.integer))-floor(log10(abs(s1.integer-s2.integer)/9)) as int),2,'')
= reverse(substring(cast(s2.integer as varchar),cast(floor(log10(s2.integer))-floor(log10(abs(s1.integer-s2.integer)/9)) as int),2))
s1.magnitude > 1
s2.magnitude > 1
by s1.magnitude, s1.integer
'query ends: ' + convert(varchar,getdate(),113)
Tim Wilkinson
"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant
May 24, 2006 at 10:29 am
May 24, 2006 at 10:38 am
Tim Wilkinson
"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant
May 24, 2006 at 11:41 am
Using the hash as a computed column is very nice. That should refine your result set pretty nicely.
---
I'm not sure what the point of the magnitude column is. I can't come up with a pair of numbers which would have the same hash, but different magnitudes. Especially as you then check the same condition again, with:
and floor(log10(s1.integer)) = floor(log10(s2.integer))
It would be easy enough to replace the reference to magnitude in the WHERE clause with:
s1.integer > 10 AND s2.integer > 10
---
With this line:
and abs(s1.integer-s2.integer)/9/power(10,floor(log10(abs(s1.integer-s2.integer)))) between 1 and 8
First of all, I think you should add a set of parens. I'm not sure what the order of precedence is for those divisions. But, I do know that (a/b)/c a/(b/c). Better to be explicit.
I'm also not sure what this line is supposed to accomplish. Suppose:
s1 = 54867, s2 = 58467
s2 - s1 = 3600
(s2 - s1)/9 = 400
floor(log10(s2 - s1)) = 3
power(10,floor(log10(s2 - s1))) = 1000
So, unless I'm calculating it wrong, the value comes out to .4. Which is not between 1 and 8. And, yet, this pair of numbers clearly should be included.
Even if the problem is being off by an order of magnitude, I'm still not sure what property you are trying to test for here. What would it prove if the value came out to 8.1?
---
All in all, a very tight solution. I like it. How long did it take to run on your laptop?
May 24, 2006 at 12:54 pm
zero isn't represented in the hash value (not enough digits in a 32b int), so hash(180)=hash(18)=hash(80100).
>and floor(log10(s1.integer)) = floor(log10(s2.integer))
>It would be easy enough to replace the reference to magnitude in the WHERE clause with:
>s1.integer > 10 AND s2.integer > 10
>With this line:
>and abs(s1.integer-s2.integer)/9/power(10,floor(log10(abs(s1.integer-s2.integer)))) between 1 and 8
>First of all, I think you should add a set of parens. I'm not sure what the order of precedence is for >those divisions. But, I do know that (a/b)/c <> a/(b/c). Better to be explicit.
>I'm also not sure what this line is supposed to accomplish. Suppose:
>s1 = 54867, s2 = 58467
>s2 - s1 = 3600
>(s2 - s1)/9 = 400
>floor(log10(s2 - s1)) = 3
>power(10,floor(log10(s2 - s1))) = 1000
>So, unless I'm calculating it wrong, the value comes out to .4.
>Even if the problem is being off by an order of magnitude,
Thanks
Tim Wilkinson
"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant
May 24, 2006 at 1:07 pm
Ah. See, when I was working out the hash in Excel, 0 was represented by the first digit to the right of the decimal point (the tenths place). I didn't realize your hash was an integer.
May 24, 2006 at 3:37 pm
I thought the hash was granular enough that it wasn't worth either using a bigint or devising some fiendishly economical but unreadable hash algorithm.
BTW, I've updated the code correcting the error you spotted (which also occurred elsewhere in the code). I needed to divide the difference by 9 to ensure that there would be only 1 significant digit, so that I could reliably calculate the possible location of a switched digit pair. I must say I'm quite pleased with this one, and I might smugly get down to some idle tinkering with execution plans - cheaper than turtlewaxing a Ferrari!
Tim Wilkinson
"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply