Calculate and rank the records based on distinct in attributes values

Question

Calculate and rank the records based on distinct in attributes values

n_a_h_k

Valued Member

Points: 74
More actions
November 18, 2011 at 1:20 am

#263149

Hi Experts,
I have encountered the following problem and i am not sure if SQL server has the capability to solve it.
The problem is that when i query the the table i found that the some records are not totally different from other on values. Some records show one value that cause row1 different than row2. Some values have two attributes values different and some more as you can see in the example. Hence i am wondering if i can use SQL to find the highest records that have highest distinct and rank it in top :hehe:
a1 a2 a3 a4
0 1 1 0
0 1 1 1
0 1 0 1
1 0 0 0
I really appreciate your advice in advance .
Regards:cool:

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 1

based on what you posted...what results do you expect.

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Tom257 SSChampion Points: 11543 More actions · Answer 2

SQL Probably can do what you want, however its not completely clear what that is. As I understand it you want to know the number of differences to the previous row, then order the results showing the most differences first. The following code does that

-- create initial data

CREATE TABLE #Test1

( A1 INT,

A2 INT,

A3 INT,

A4 INT);

INSERT INTO #Test1

SELECT 0, 1, 1, 0 UNION ALL

SELECT 0, 1, 1, 1 UNION ALL

SELECT 0, 1, 0, 1 UNION ALL

SELECT 1, 0, 0, 0;

-- add a unique row identifier

SELECT IDENTITY(INT,1,1) AS ID , *

INTO #Test2

FROM #Test1;

-- join to the previous row and count differences

;WITH ShowDiffs AS (

SELECT T1.*,

CASE WHEN T1.A1 = TPrev.A1 THEN 0 ELSE 1 END AS A1Diff,

CASE WHEN T1.A2 = TPrev.A2 THEN 0 ELSE 1 END AS A2Diff,

CASE WHEN T1.A3 = TPrev.A3 THEN 0 ELSE 1 END AS A3Diff,

CASE WHEN T1.A4 = TPrev.A4 THEN 0 ELSE 1 END AS A4Diff

FROM #Test2 AS T1

JOIN #Test2 AS TPrev ON T1.ID -1 = TPrev.ID

), TotalDiff AS (

SELECT T.*, A1Diff + A2Diff + A3Diff + A4Diff AS TotalDiffs

FROM ShowDiffs AS T

)

-- Rank and order the results

SELECT RANK() OVER (ORDER BY TotalDiffs DESC) AS RANK,

TotalDiffs, A1, A2, A3, A4

FROM TotalDiff;

Results:

RANKTotalDiffsA1A2A3A4

131000

210111

210101

n_a_h_k Valued Member Points: 74 More actions · Answer 3

Thanks for replying to my question.

the out come should be something like this because you need to compare each record three times and sum the distinct values and reorder the records from the highest to the lowest:

a1 a2 a3 a4

0 1 1 0

0 1 1 1

0 1 0 1

1 0 0 0

the outcome should be

a1 a2 a3 a4 total distinct

1 0 0 0 10

0 1 0 1 6

0 1 1 1 6

0 1 1 0 6

because the different between

r1 and r2=1

r1 and r3= 2

r1 and r4= 3

total 6

😉

n_a_h_k Valued Member Points: 74 More actions · Answer 4

Thank you very much for your reply and your solution.

I am not quite sure because it seems that your solution compares the row to the next one which might provide different result from the following:

For example the sql code should compare each record three times and sum the distinct values and order the records from the highest to the lowest:

a1 a2 a3 a4

0 1 1 0

0 1 1 1

0 1 0 1

1 0 0 0

the outcome should be

a1 a2 a3 a4 total distinct

1 0 0 0 10

0 1 0 1 6

0 1 1 1 6

0 1 1 0 6

because the different between

r1 and r2=1

r1 and r3= 2

r1 and r4= 3

total 6

😉

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 5

will you only ever have 4 columns?

what happens when you have more than 4 rows?

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Tom257 SSChampion Points: 11543 More actions · Answer 6

OK, with a minor modification this creates the results you want. Instead of comparing each row to the previous row, I use a CROSS JOIN and compare each row to all the other rows (except itself). An additional step and Grouping is needed to sum the diffs.

CREATE TABLE #Test1

( A1 INT,

A2 INT,

A3 INT,

A4 INT);

INSERT INTO #Test1

SELECT 0, 1, 1, 0 UNION ALL

SELECT 0, 1, 1, 1 UNION ALL

SELECT 0, 1, 0, 1 UNION ALL

SELECT 1, 0, 0, 0;

-- first add a unique row identifier

SELECT IDENTITY(INT,1,1) AS ID , *

INTO #Test2

FROM #Test1;

-- join to the previous row and count differences

WITH ShowDiffs AS (

SELECT T1.*,

CASE WHEN T1.A1 = TPrev.A1 THEN 0 ELSE 1 END AS A1Diff,

CASE WHEN T1.A2 = TPrev.A2 THEN 0 ELSE 1 END AS A2Diff,

CASE WHEN T1.A3 = TPrev.A3 THEN 0 ELSE 1 END AS A3Diff,

CASE WHEN T1.A4 = TPrev.A4 THEN 0 ELSE 1 END AS A4Diff

FROM #Test2 AS T1

CROSS JOIN #Test2 AS TPrev

WHERE TPrev.ID <> T1.ID

), TotalDiff AS (

SELECT T.*, A1Diff + A2Diff + A3Diff + A4Diff AS TotalDiffs

FROM ShowDiffs AS T

), Grouped AS (

SELECT ID, A1, A2, A3, A4, SUM(TotalDiffs) AS TotalDiffs

FROM TotalDiff

GRoup BY ID, A1, A2, A3, A4

)

-- Rank and order the results

SELECT RANK() OVER (ORDER BY TotalDiffs DESC) AS RANK,

TotalDiffs, A1, A2, A3, A4

FROM Grouped;

With only 4 rows, its fine but if you start using this for a large number of rows, performance will get worse and worse

n_a_h_k Valued Member Points: 74 More actions · Answer 7

Thanks for your help. 🙂

As my data is large, from your experiences, is there any alternative that helps me to find the highest distinct records with out having the problem of running time. :doze:

Tom257 SSChampion Points: 11543 More actions · Answer 8

The main performance problem is in this part:

FROM #Test2 AS T1

CROSS JOIN #Test2 AS TPrev

WHERE T1.ID <> TPrev.ID

So if you have 4 rows, it creates 4 X (4-1) = 12 results

and 1000 rows gives 999000 results.

The general rule would be N(N-1).

Any graph of N Squared gets progressively steeper.

If you can cut down on the number of rows in this part. e.g. by removing duplicate rows it would help.

Other strategies would be pre-sorting the data and getting only the differences with the previous row, then summing the differences. It becomes a little more complex.

Can you give an indication of the number of rows you would be dealing with?

n_a_h_k Valued Member Points: 74 More actions · Answer 9

at least 100000.

I have to solve the problem by using group by to solve this problem (i have just post the question in this forum titled

Calculation with group by ). Because there is not time problem using group by , i decide to use it to overcome the problem of finding distinct rows. I divide the master table to two small tables but i was not able to calculate the count-row in table B/ count row in table A in away that values in table A and table B must be in master table .