counting common values of a field within super-fields

  • I'm trying to count common items between pairs of sets. Basically, the job is to rationalise redundant sets of items by making appropriate supersets based on sets with high degrees of commonality. In essence, my data table looks something like this:

    set1   item1

    set1   item2

    set1   item3

    set2   item4

    set3   item5

    ...      ...

    setN   itemM

    The 'items' column includes considerable redundancy. For example, the value of item4 might be identical to that of item1, meaning that this same value is contained in both set1 and set2 (and maybe a lot of other sets). N is ~ 10,000 and M is ~ 60,000.

    I want to build an N x N table in which each cell (i,j) contains the percentage of items in set_i that are also contained in set_j. This table would then serve as a lookup during the subsequent rationalisation process.

    I can write an SQL statement that calculates the number of items from set2 that are common to set1, by hard-coding the values of these fields, but I can't figure out how to do this in a general way (i.e. by rows, or in a single shot for the whole array). I am also unsure of how to populate the table using code (in MS Access).

    I suppose someone out there has done something like this, but my search keywords have brought me no joy.

    Thanks! 

  • Will this code do?

    -- prepare test data

    declare @table table (s int, item int)

    insert @table

    select 1, 1 union all

    select 1, 2 union all

    select 1, 3 union all

    select 2, 2 union all

    select 3, 5

    -- do the work

    select      t1.s s1,

                t2.s s2,

                1.0 * sum(case when t1.item = t2.item then 1 else 0 end) / count(distinct t1.item) matching

    from        @table t1

    inner join  @table t2 on t2.s <> t1.s

    group by    t1.s,

                t2.s

    order by    t1.s,

                t2.s

       

    Good luck! Be aware that this will produce a 100 million row resultset.


    N 56°04'39.16"
    E 12°55'05.25"

  • This will produce only the pairs where there are matches, which will be much smaller resultset that previous

    -- prepare test data

    declare @table table (s int, item int)

    insert @table

    select 1, 1 union all

    select 1, 2 union all

    select 1, 3 union all

    select 2, 2 union all

    select 3, 5

    -- do the work

    select      t1.s s1,

                t2.s s2,

                1.0 * sum(case when t1.item = t2.item then 1 else 0 end) / count(distinct t1.item) matching

    from        @table t1

    inner join  @table t2 on t2.s <> t1.s

    inner join  (

                    select      t1.s s1,

                                t2.s s2

                    from        @table t1

                    inner join  @table t2 on t2.s <> t1.s and t2.item = t1.item

                ) z on t1.s = s1 and t2.s = s2

    group by    t1.s,

                t2.s

    order by    t1.s,

                t2.s

     


    N 56°04'39.16"
    E 12°55'05.25"

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply