Calculate and rank the records based on distinct in attributes values

  • Hi Experts,

    I have encountered the following problem and i am not sure if SQL server has the capability to solve it.

    The problem is that when i query the the table i found that the some records are not totally different from other on values. Some records show one value that cause row1 different than row2. Some values have two attributes values different and some more as you can see in the example. Hence i am wondering if i can use SQL to find the highest records that have highest distinct and rank it in top :hehe:

    a1 a2 a3 a4

    0 1 1 0

    0 1 1 1

    0 1 0 1

    1 0 0 0

    I really appreciate your advice in advance .

    Regards:cool:

  • based on what you posted...what results do you expect.

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • SQL Probably can do what you want, however its not completely clear what that is. As I understand it you want to know the number of differences to the previous row, then order the results showing the most differences first. The following code does that

    -- create initial data

    CREATE TABLE #Test1

    ( A1 INT,

    A2 INT,

    A3 INT,

    A4 INT);

    INSERT INTO #Test1

    SELECT 0, 1, 1, 0 UNION ALL

    SELECT 0, 1, 1, 1 UNION ALL

    SELECT 0, 1, 0, 1 UNION ALL

    SELECT 1, 0, 0, 0;

    -- add a unique row identifier

    SELECT IDENTITY(INT,1,1) AS ID , *

    INTO #Test2

    FROM #Test1;

    -- join to the previous row and count differences

    ;WITH ShowDiffs AS (

    SELECT T1.*,

    CASE WHEN T1.A1 = TPrev.A1 THEN 0 ELSE 1 END AS A1Diff,

    CASE WHEN T1.A2 = TPrev.A2 THEN 0 ELSE 1 END AS A2Diff,

    CASE WHEN T1.A3 = TPrev.A3 THEN 0 ELSE 1 END AS A3Diff,

    CASE WHEN T1.A4 = TPrev.A4 THEN 0 ELSE 1 END AS A4Diff

    FROM #Test2 AS T1

    JOIN #Test2 AS TPrev ON T1.ID -1 = TPrev.ID

    ), TotalDiff AS (

    SELECT T.*, A1Diff + A2Diff + A3Diff + A4Diff AS TotalDiffs

    FROM ShowDiffs AS T

    )

    -- Rank and order the results

    SELECT RANK() OVER (ORDER BY TotalDiffs DESC) AS RANK,

    TotalDiffs, A1, A2, A3, A4

    FROM TotalDiff;

    Results:

    RANKTotalDiffsA1A2A3A4

    131000

    210111

    210101

  • Thanks for replying to my question.

    the out come should be something like this because you need to compare each record three times and sum the distinct values and reorder the records from the highest to the lowest:

    a1 a2 a3 a4

    0 1 1 0

    0 1 1 1

    0 1 0 1

    1 0 0 0

    the outcome should be

    a1 a2 a3 a4 total distinct

    1 0 0 0 10

    0 1 0 1 6

    0 1 1 1 6

    0 1 1 0 6

    because the different between

    r1 and r2=1

    r1 and r3= 2

    r1 and r4= 3

    total 6

    😉

  • Thank you very much for your reply and your solution.

    I am not quite sure because it seems that your solution compares the row to the next one which might provide different result from the following:

    For example the sql code should compare each record three times and sum the distinct values and order the records from the highest to the lowest:

    a1 a2 a3 a4

    0 1 1 0

    0 1 1 1

    0 1 0 1

    1 0 0 0

    the outcome should be

    a1 a2 a3 a4 total distinct

    1 0 0 0 10

    0 1 0 1 6

    0 1 1 1 6

    0 1 1 0 6

    because the different between

    r1 and r2=1

    r1 and r3= 2

    r1 and r4= 3

    total 6

    😉

  • will you only ever have 4 columns?

    what happens when you have more than 4 rows?

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • OK, with a minor modification this creates the results you want. Instead of comparing each row to the previous row, I use a CROSS JOIN and compare each row to all the other rows (except itself). An additional step and Grouping is needed to sum the diffs.

    CREATE TABLE #Test1

    ( A1 INT,

    A2 INT,

    A3 INT,

    A4 INT);

    INSERT INTO #Test1

    SELECT 0, 1, 1, 0 UNION ALL

    SELECT 0, 1, 1, 1 UNION ALL

    SELECT 0, 1, 0, 1 UNION ALL

    SELECT 1, 0, 0, 0;

    -- first add a unique row identifier

    SELECT IDENTITY(INT,1,1) AS ID , *

    INTO #Test2

    FROM #Test1;

    -- join to the previous row and count differences

    WITH ShowDiffs AS (

    SELECT T1.*,

    CASE WHEN T1.A1 = TPrev.A1 THEN 0 ELSE 1 END AS A1Diff,

    CASE WHEN T1.A2 = TPrev.A2 THEN 0 ELSE 1 END AS A2Diff,

    CASE WHEN T1.A3 = TPrev.A3 THEN 0 ELSE 1 END AS A3Diff,

    CASE WHEN T1.A4 = TPrev.A4 THEN 0 ELSE 1 END AS A4Diff

    FROM #Test2 AS T1

    CROSS JOIN #Test2 AS TPrev

    WHERE TPrev.ID <> T1.ID

    ), TotalDiff AS (

    SELECT T.*, A1Diff + A2Diff + A3Diff + A4Diff AS TotalDiffs

    FROM ShowDiffs AS T

    ), Grouped AS (

    SELECT ID, A1, A2, A3, A4, SUM(TotalDiffs) AS TotalDiffs

    FROM TotalDiff

    GRoup BY ID, A1, A2, A3, A4

    )

    -- Rank and order the results

    SELECT RANK() OVER (ORDER BY TotalDiffs DESC) AS RANK,

    TotalDiffs, A1, A2, A3, A4

    FROM Grouped;

    With only 4 rows, its fine but if you start using this for a large number of rows, performance will get worse and worse

  • Thanks for your help. 🙂

    As my data is large, from your experiences, is there any alternative that helps me to find the highest distinct records with out having the problem of running time. :doze:

  • The main performance problem is in this part:

    FROM #Test2 AS T1

    CROSS JOIN #Test2 AS TPrev

    WHERE T1.ID <> TPrev.ID

    So if you have 4 rows, it creates 4 X (4-1) = 12 results

    and 1000 rows gives 999000 results.

    The general rule would be N(N-1).

    Any graph of N Squared gets progressively steeper.

    If you can cut down on the number of rows in this part. e.g. by removing duplicate rows it would help.

    Other strategies would be pre-sorting the data and getting only the differences with the previous row, then summing the differences. It becomes a little more complex.

    Can you give an indication of the number of rows you would be dealing with?

  • at least 100000.

    I have to solve the problem by using group by to solve this problem (i have just post the question in this forum titled

    Calculation with group by ). Because there is not time problem using group by , i decide to use it to overcome the problem of finding distinct rows. I divide the master table to two small tables but i was not able to calculate the count-row in table B/ count row in table A in away that values in table A and table B must be in master table .

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply