Tough problem

Question

Post reply

Tough problem

Jules Bonnot

SSCertifiable

Points: 5488
More actions
May 24, 2006 at 8:06 am

#113810

http://www.developmentnow.com/g/113_2005_5_0_0_527840/TRANSPOSE-QUERY-fat-fingers.htm
i have been trying to figure out a way of doing this but havnt had much luck. Any ideas?
www.sql-library.com[/url]

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

stax68 SSChampion Points: 11711 More actions · Answer 1

Here's the text:

"I would like a query that pulls back pairs of numbers that should be the

same, but one was mistyped.

Specifically, I am looking for numbers where two neighboring digits were

transposed in one of the numbers."

Hmmm, I think I have some ideas...

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

Marshall Smith SSCrazy Points: 2041 More actions · Answer 2

Well, I can tell you very rapidly if any two numbers have a pair of digits transposed. The difference will always be divisible by 9. (Old accounting trick, I can show you why it works if you want.) But, that won't tell you which digits were transposed. It's also not a guaranteed check, as two totally different numbers can also have a difference that's divisible by 9.

The Levenshtein algorithm mentioned in a reply to that post would seem to work, but, again, you can't guarantee what difference is triggering the count.

My solution, off the top of my head, would be separate the digits into an array, and run a couple of loops for comparison. It would be extremely clunky, but, properly done, would have the advantage of maximum information (it would be able to tell you which two digits were transposed), and minimum false positives.

stax68 SSChampion Points: 11711 More actions · Answer 3

Yes, I found that mod 9 rule empirically using excel. there had to be something useful about the similarity and 'reverseness' of the numbers.

That is definitely the way to go. I think I've nearly(!) got a reasonable solution worked out... It's making my laptop creak a bit though at the moment.

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

stax68 SSChampion Points: 11711 More actions · Answer 4

Right I think I have it. I was hoping not to use any string functions, but I did in the end rather than using log10() a hundred times. I've used arithmetic to limit the rows as much as I can, then the string functions to do the final check. code follows...

There is a problem that leading zeros are not ahndled, but we are talking about integer literals, I have assumed.

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

stax68 SSChampion Points: 11711 More actions · Answer 5

/***find pairs of numbers equal bar one transposition of adjacent digits**/

/*************************************************************************/

/*******code copyright (c) tim wilkinson (stax68) 2006********************/

/*******free for non-commercial use***************************************/

/*******compatibility: MS SQL Server v9.0 only ***************************/

/*******other versions can be done...*************************************/

/*************************************************************************/

SET

ANSI_NULLS ON

SET ANSI_PADDING ON

SET ANSI_WARNINGS ON

SET ARITHABORT ON

SET CONCAT_NULL_YIELDS_NULL ON

SET NUMERIC_ROUNDABORT OFF

SET QUOTED_IDENTIFIER ON

SET NOCOUNT ON

go

/********************DDL and population********************/

if exists(select * from sysobjects where name = 'someintegers' and xtype = 'U')

drop table someintegers

go

create

table someintegers(

integer int identity

,

char1 char(1)

,hash as power(10,(floor(integer/1000000000.0) % 10)-1)

+ power(10,(floor(integer/100000000.0) % 10)-1)

+ power(10,(floor(integer/10000000.0) % 10)-1)

+ power(10,(floor(integer/1000000.0) % 10)-1)

+ power(10,(floor(integer/100000.0) % 10)-1)

+ power(10,(floor(integer/10000.0) % 10)-1)

+ power(10,(floor(integer/1000.0) % 10)-1)

+ power(10,(floor(integer/100.0) % 10)-1)

+ power(10,(floor(integer/10.0) % 10)-1)

+ power(10,(floor(integer/1.0) % 10)-1)

persisted

,

magnitude as cast(floor(log10(integer)) + 1 as tinyint) persisted

)

go

create clustered index IXC_someintegers on someintegers(magnitude,hash,integer)

go

insert someintegers(char1) select top 50000 null from syscolumns s1 cross join syscolumns s2

go

declare @rows_vc varchar(20)

select

@rows_vc = cast(power(cast(count(*) as bigint),2) as varchar(20)) from someintegers

select

'starting query over ' + @rows_vc + ' permutations: ' + convert(varchar,getdate(),113)

START__________________________________________________________________

/********************The query:*************************/

select

s1.integer, s2.integer

from

someintegers s1

join

someintegers s2

on

s1.magnitude = s2.magnitude and s1.hash = s2.hash and

s1.integer < s2.integer

and

abs(s1.integer-s2.integer)%9 = 0

--subsequent correction to fault as reported below

and

abs(s1.integer-s2.integer)/9.0/power(10,floor(log10(abs(s1.integer-s2.integer)/9)))

between 1 and 8

and

stuff(s1.integer,cast(floor(log10(s1.integer))-floor(log10(abs(s1.integer-s2.integer)/9)) as int),2,'')

= stuff(s2.integer,cast(floor(log10(s2.integer))-floor(log10(abs(s1.integer-s2.integer)/9)) as int),2,'')

and substring(cast(s1.integer as varchar),cast(floor(log10(s1.integer))-floor(log10(abs(s1.integer-s2.integer)/9)) as int),2)

= reverse(substring(cast(s2.integer as varchar),cast(floor(log10(s2.integer))-floor(log10(abs(s1.integer-s2.integer)/9)) as int),2))

where

s1.magnitude > 1

and

s2.magnitude > 1

--and (s1.integer = 54687 and s2.integer = 54867) --test condition as below

order

by s1.magnitude, s1.integer

select

'query ends: ' + convert(varchar,getdate(),113)

END____________________________________________________________________

go

drop

table someintegers

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

Jules Bonnot SSCertifiable Points: 5488 More actions · Answer 6

OMG!

Thats ill. Ill, yet effective. Am still trying to get my head round how it works but somehow it does.

Nice one.

Thanks,

www.sql-library.com[/url]

stax68 SSChampion Points: 11711 More actions · Answer 7

sorry that it's v9 only. It's only because of the performance enhancements used, which could be removed.

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

Marshall Smith SSCrazy Points: 2041 More actions · Answer 8

Using the hash as a computed column is very nice. That should refine your result set pretty nicely.

---

I'm not sure what the point of the magnitude column is. I can't come up with a pair of numbers which would have the same hash, but different magnitudes. Especially as you then check the same condition again, with:

and floor(log10(s1.integer)) = floor(log10(s2.integer))

It would be easy enough to replace the reference to magnitude in the WHERE clause with:

s1.integer > 10 AND s2.integer > 10

---

With this line:

and abs(s1.integer-s2.integer)/9/power(10,floor(log10(abs(s1.integer-s2.integer)))) between 1 and 8

First of all, I think you should add a set of parens. I'm not sure what the order of precedence is for those divisions. But, I do know that (a/b)/c a/(b/c). Better to be explicit.

I'm also not sure what this line is supposed to accomplish. Suppose:

s1 = 54867, s2 = 58467

s2 - s1 = 3600

(s2 - s1)/9 = 400

floor(log10(s2 - s1)) = 3

power(10,floor(log10(s2 - s1))) = 1000

So, unless I'm calculating it wrong, the value comes out to .4. Which is not between 1 and 8. And, yet, this pair of numbers clearly should be included.

Even if the problem is being off by an order of magnitude, I'm still not sure what property you are trying to test for here. What would it prove if the value came out to 8.1?

---

All in all, a very tight solution. I like it. How long did it take to run on your laptop?

stax68 SSChampion Points: 11711 More actions · Answer 9

>I'm not sure what the point of the magnitude column is.
>I can't come up with a pair of numbers which would have the same hash, but different magnitudes.

zero isn't represented in the hash value (not enough digits in a 32b int), so hash(180)=hash(18)=hash(80100).

>Especially as you then check the same condition again, with: >and floor(log10(s1.integer)) = floor(log10(s2.integer))

yes - that was left over from development, could be deleted, though see below re optimiser.
>It would be easy enough to replace the reference to magnitude in the WHERE clause with: >s1.integer > 10 AND s2.integer > 10
yes, but I deliberately tried to use magnitude, e.g. where s1.magnitude > 1 and s2.magnitude > 1 as it is the leading key of the clustered index - and I (half) thought it might be useful for the optimiser to use as a coarser hashing key, along with the quite low cardinality hash value. I haven't really had time to tune it so just tried what I thought would work best.
>With this line: >and abs(s1.integer-s2.integer)/9/power(10,floor(log10(abs(s1.integer-s2.integer)))) between 1 and 8 >First of all, I think you should add a set of parens. I'm not sure what the order of precedence is for >those divisions. But, I do know that (a/b)/c <> a/(b/c). Better to be explicit.

It's left to right for identical operators (and all those of the same precedence?) - I don't like too many parenths, just as I allow SQL to use the default top-to-bottom order of parsing join clauses. Still you're probably right. >I'm also not sure what this line is supposed to accomplish. Suppose: >s1 = 54867, s2 = 58467 >s2 - s1 = 3600 >(s2 - s1)/9 = 400 >floor(log10(s2 - s1)) = 3 >power(10,floor(log10(s2 - s1))) = 1000 >So, unless I'm calculating it wrong, the value comes out to .4.
>Which is not between 1 and 8.
>And, yet, this pair of numbers clearly should be included. >Even if the problem is being off by an order of magnitude,
>I'm still not sure what property you are trying to test for here.
>What would it prove if the value came out to 8.1?
Yes, the problem is the order of magnitude. I need to use floor(log10(s2 - s1)) - 1.
This clause was intended to eliminate those pairs where although the difference is a multiple of 9, it represents, even modulo 10, 9 multiplied by more than 9. Our target numbers will have the property that the digits in the appropriate positions never represent a figure > 9*8. There is a bit of redundancy in the join predicate, as you noted - but I like to give the optimiser as many options as possible before getting rid of anything.
>All in all, a very tight solution. I like it.

Thanks

>How long did it take to run on your laptop?

14s overall, 6.5s for the query - that's with 'top 50000'.
500000 and you're talking a few minutes.

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

Marshall Smith SSCrazy Points: 2041 More actions · Answer 10

Ah. See, when I was working out the hash in Excel, 0 was represented by the first digit to the right of the decimal point (the tenths place). I didn't realize your hash was an integer.

stax68 SSChampion Points: 11711 More actions · Answer 11

I thought the hash was granular enough that it wasn't worth either using a bigint or devising some fiendishly economical but unreadable hash algorithm.

BTW, I've updated the code correcting the error you spotted (which also occurred elsewhere in the code). I needed to divide the difference by 9 to ensure that there would be only 1 significant digit, so that I could reliably calculate the possible location of a switched digit pair. I must say I'm quite pleased with this one, and I might smugly get down to some idle tinkering with execution plans - cheaper than turtlewaxing a Ferrari!

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant