Thoughts on comparing two result sets

  • Hi,

    I have extracted the metadata from most of the reports in our reporting system for the purpose of comparing their field usage to see if any are identical or (even better), to determine a % of matching fields for those reports that match on at least one field. Does anyone have any thoughts on an effective approach to this? I know there are a ton of ways to skin this cat, so I'm trying to see if anyone has done it before. Here's a sample of how my table is structured.

    ReportID, FieldName, TotalNumberOfFields

    1.ID,5

    1,FirstName,5

    1,MiddleName,5

    1,LastName,5

    1,BirthDate,5

    2,ID,2

    2,LastNAme,2

    3,ID,1

    4,HairColor,3

    4,EyeColor,3

    4,Ethnicity,3

    You can see that 100% of the fields on reports 2 and 3 are encompassed by report 1. There is no overlap for report 4. Again, just trying to get some ideas on how I might approach this.

    Thanks in advance,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Quick suggestion

    😎

    /* Safe place */

    USE tempdb;

    GO

    SET NOCOUNT ON;

    GO

    DECLARE @METADATA TABLE

    (

    ReportID INT NOT NULL

    ,FieldName NVARCHAR(128) NOT NULL

    ,TotalNumberOfFields INT NOT NULL

    );

    INSERT INTO @METADATA(ReportID,FieldName ,TotalNumberOfFields)

    VALUES (1 ,N'ID' ,5 )

    ,(1 ,N'FirstName' ,5 )

    ,(1 ,N'MiddleName' ,5 )

    ,(1 ,N'LastName' ,5 )

    ,(1 ,N'BirthDate' ,5 )

    ,(2 ,N'ID' ,2 )

    ,(2 ,N'LastNAme' ,2 )

    ,(3 ,N'ID' ,1 )

    ,(4 ,N'HairColor' ,3 )

    ,(4 ,N'EyeColor' ,3 )

    ,(4 ,N'Ethnicity' ,3 )

    ;

    ;WITH BASE_DATA AS

    (

    SELECT

    MF.FieldName

    ,MD.ReportID

    ,DENSE_RANK() OVER

    (

    ORDER BY MF.FieldName

    ) AS FN_DRNK

    FROM @METADATA MD

    CROSS APPLY

    (

    SELECT

    DISTINCT M.FieldName

    FROM @METADATA M

    ) AS MF(FieldName)

    WHERE MD.FieldName = MF.FieldName

    )

    ,REPORTS AS

    (

    SELECT

    DISTINCT MDR.ReportID

    ,N'REPORT ' + CONVERT(NVARCHAR(10),MDR.ReportID,0) AS R_NAME

    FROM @METADATA MDR

    )

    ,REPORT_FIELD_LIST AS

    (

    SELECT DISTINCT

    BD.ReportID

    ,RS.R_NAME

    ,(SELECT

    NCHAR(124) + CONVERT(NVARCHAR(10),SBD.FN_DRNK,0)

    FROM BASE_DATA SBD

    WHERE SBD.ReportID = BD.ReportID

    FOR XML PATH(''), TYPE).value('.[1]','NVARCHAR(1000)') AS FLD_LIST

    FROM BASE_DATA BD

    INNER JOIN REPORTS RS

    ON BD.ReportID = RS.ReportID

    )

    SELECT

    RF.R_NAME

    ,RF.FLD_LIST

    ,R2.R_NAME

    ,R2.FLD_LIST

    FROM REPORT_FIELD_LIST RF

    OUTER APPLY REPORT_FIELD_LIST R2

    WHERE CHARINDEX(R2.FLD_LIST,RF.FLD_LIST,1) > 0

    AND RF.ReportID <> R2.ReportID;

    Results

    R_NAME FLD_LIST R_NAME FLD_LIST

    --------- ----------- --------- ---------

    REPORT 1 |1|4|6|7|8 REPORT 2 |6|7

    REPORT 1 |1|4|6|7|8 REPORT 3 |6

    REPORT 2 |6|7 REPORT 3 |6

  • Thanks, Eirikur!

    I see what you're doing but I'm not sure how I can use that last column to determine the overlap of fields between reports. Perhaps I'm missing something. Can you explain?

    Thanks,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Here's an alternate approach...

    IF OBJECT_ID('tempdb..#ReportFields') IS NOT NULL

    DROP TABLE #ReportFields;

    CREATE TABLE #ReportFields (

    ReportID INT,

    FieldName VARCHAR(20),

    TotalNumberOfFields INT

    );

    INSERT #ReportFields (ReportID, FieldName, TotalNumberOfFields) VALUES

    (1,'ID',5),

    (1,'FirstName',5),

    (1,'MiddleName',5),

    (1,'LastName',5),

    (1,'BirthDate',5),

    (2,'ID',2),

    (2,'LastNAme',2),

    (3,'ID',1),

    (4,'HairColor',3),

    (4,'EyeColor',3),

    (4,'Ethnicity',3);

    SELECT

    rf1.ReportID AS Report_1,

    rf2.ReportID AS Report_2,

    COUNT(*) AS CountOfCommonFields,

    COUNT(*) * 1.0 / MIN(rf1.TotalNumberOfFields) * 100 AS Report_1_Percent,

    COUNT(*) * 1.0 / MIN(rf2.TotalNumberOfFields) * 100 AS Report_2_Percent

    FROM

    #ReportFields rf1

    JOIN #ReportFields rf2

    ON rf1.ReportID < rf2.ReportID

    AND rf1.FieldName = rf2.FieldName

    GROUP BY

    rf1.ReportID,

    rf2.ReportID

    Results...

    Report_1 Report_2 CountOfCommonFields Report_1_Percent Report_2_Percent

    ----------- ----------- ------------------- --------------------------------------- ---------------------------------------

    1 2 2 40.000000000000 100.000000000000

    1 3 1 20.000000000000 100.000000000000

    2 3 1 50.000000000000 100.000000000000

  • This is incredibly helpful. Thank you for your input!!

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Grad to help. 🙂

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply