November 18, 2011 at 1:20 am
Hi Experts,
I have encountered the following problem and i am not sure if SQL server has the capability to solve it.
The problem is that when i query the the table i found that the some records are not totally different from other on values. Some records show one value that cause row1 different than row2. Some values have two attributes values different and some more as you can see in the example. Hence i am wondering if i can use SQL to find the highest records that have highest distinct and rank it in top :hehe:
a1 a2 a3 a4
0 1 1 0
0 1 1 1
0 1 0 1
1 0 0 0
I really appreciate your advice in advance .
Regards:cool:
November 18, 2011 at 1:49 am
based on what you posted...what results do you expect.
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
November 18, 2011 at 2:28 am
SQL Probably can do what you want, however its not completely clear what that is. As I understand it you want to know the number of differences to the previous row, then order the results showing the most differences first. The following code does that
-- create initial data
CREATE TABLE #Test1
( A1 INT,
A2 INT,
A3 INT,
A4 INT);
INSERT INTO #Test1
SELECT 0, 1, 1, 0 UNION ALL
SELECT 0, 1, 1, 1 UNION ALL
SELECT 0, 1, 0, 1 UNION ALL
SELECT 1, 0, 0, 0;
-- add a unique row identifier
SELECT IDENTITY(INT,1,1) AS ID , *
INTO #Test2
FROM #Test1;
-- join to the previous row and count differences
;WITH ShowDiffs AS (
SELECT T1.*,
CASE WHEN T1.A1 = TPrev.A1 THEN 0 ELSE 1 END AS A1Diff,
CASE WHEN T1.A2 = TPrev.A2 THEN 0 ELSE 1 END AS A2Diff,
CASE WHEN T1.A3 = TPrev.A3 THEN 0 ELSE 1 END AS A3Diff,
CASE WHEN T1.A4 = TPrev.A4 THEN 0 ELSE 1 END AS A4Diff
FROM #Test2 AS T1
JOIN #Test2 AS TPrev ON T1.ID -1 = TPrev.ID
), TotalDiff AS (
SELECT T.*, A1Diff + A2Diff + A3Diff + A4Diff AS TotalDiffs
FROM ShowDiffs AS T
)
-- Rank and order the results
SELECT RANK() OVER (ORDER BY TotalDiffs DESC) AS RANK,
TotalDiffs, A1, A2, A3, A4
FROM TotalDiff;
Results:
RANKTotalDiffsA1A2A3A4
131000
210111
210101
November 18, 2011 at 3:34 am
Thanks for replying to my question.
the out come should be something like this because you need to compare each record three times and sum the distinct values and reorder the records from the highest to the lowest:
a1 a2 a3 a4
0 1 1 0
0 1 1 1
0 1 0 1
1 0 0 0
the outcome should be
a1 a2 a3 a4 total distinct
1 0 0 0 10
0 1 0 1 6
0 1 1 1 6
0 1 1 0 6
because the different between
r1 and r2=1
r1 and r3= 2
r1 and r4= 3
total 6
😉
November 18, 2011 at 3:40 am
Thank you very much for your reply and your solution.
I am not quite sure because it seems that your solution compares the row to the next one which might provide different result from the following:
For example the sql code should compare each record three times and sum the distinct values and order the records from the highest to the lowest:
a1 a2 a3 a4
0 1 1 0
0 1 1 1
0 1 0 1
1 0 0 0
the outcome should be
a1 a2 a3 a4 total distinct
1 0 0 0 10
0 1 0 1 6
0 1 1 1 6
0 1 1 0 6
because the different between
r1 and r2=1
r1 and r3= 2
r1 and r4= 3
total 6
😉
November 18, 2011 at 3:44 am
will you only ever have 4 columns?
what happens when you have more than 4 rows?
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
November 18, 2011 at 3:55 am
OK, with a minor modification this creates the results you want. Instead of comparing each row to the previous row, I use a CROSS JOIN and compare each row to all the other rows (except itself). An additional step and Grouping is needed to sum the diffs.
CREATE TABLE #Test1
( A1 INT,
A2 INT,
A3 INT,
A4 INT);
INSERT INTO #Test1
SELECT 0, 1, 1, 0 UNION ALL
SELECT 0, 1, 1, 1 UNION ALL
SELECT 0, 1, 0, 1 UNION ALL
SELECT 1, 0, 0, 0;
-- first add a unique row identifier
SELECT IDENTITY(INT,1,1) AS ID , *
INTO #Test2
FROM #Test1;
-- join to the previous row and count differences
WITH ShowDiffs AS (
SELECT T1.*,
CASE WHEN T1.A1 = TPrev.A1 THEN 0 ELSE 1 END AS A1Diff,
CASE WHEN T1.A2 = TPrev.A2 THEN 0 ELSE 1 END AS A2Diff,
CASE WHEN T1.A3 = TPrev.A3 THEN 0 ELSE 1 END AS A3Diff,
CASE WHEN T1.A4 = TPrev.A4 THEN 0 ELSE 1 END AS A4Diff
FROM #Test2 AS T1
CROSS JOIN #Test2 AS TPrev
WHERE TPrev.ID <> T1.ID
), TotalDiff AS (
SELECT T.*, A1Diff + A2Diff + A3Diff + A4Diff AS TotalDiffs
FROM ShowDiffs AS T
), Grouped AS (
SELECT ID, A1, A2, A3, A4, SUM(TotalDiffs) AS TotalDiffs
FROM TotalDiff
GRoup BY ID, A1, A2, A3, A4
)
-- Rank and order the results
SELECT RANK() OVER (ORDER BY TotalDiffs DESC) AS RANK,
TotalDiffs, A1, A2, A3, A4
FROM Grouped;
With only 4 rows, its fine but if you start using this for a large number of rows, performance will get worse and worse
November 18, 2011 at 4:03 am
Thanks for your help. 🙂
As my data is large, from your experiences, is there any alternative that helps me to find the highest distinct records with out having the problem of running time. :doze:
November 18, 2011 at 4:25 am
The main performance problem is in this part:
FROM #Test2 AS T1
CROSS JOIN #Test2 AS TPrev
WHERE T1.ID <> TPrev.ID
So if you have 4 rows, it creates 4 X (4-1) = 12 results
and 1000 rows gives 999000 results.
The general rule would be N(N-1).
Any graph of N Squared gets progressively steeper.
If you can cut down on the number of rows in this part. e.g. by removing duplicate rows it would help.
Other strategies would be pre-sorting the data and getting only the differences with the previous row, then summing the differences. It becomes a little more complex.
Can you give an indication of the number of rows you would be dealing with?
November 18, 2011 at 4:46 am
at least 100000.
I have to solve the problem by using group by to solve this problem (i have just post the question in this forum titled
Calculation with group by ). Because there is not time problem using group by , i decide to use it to overcome the problem of finding distinct rows. I divide the master table to two small tables but i was not able to calculate the count-row in table B/ count row in table A in away that values in table A and table B must be in master table .
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply