Identify mismatched pair sequences in two datasets

Question

Post reply

Identify mismatched pair sequences in two datasets

SQLServerQuery

Old Hand

Points: 333

December 31, 2022 at 9:32 pm

Go to Answer

#4132294

Hi All,

I’ve got two datasets which should match pairs perfectly. There are however some pairs which have their sequences swapped around. I’d like a way to identify these mismatched pairs (swapped around). CTE?

Please see the below code which replicates the issue as well as below screenshots.

Thanks in advance for the help.

SSQ

DECLARE @Dataset1 AS TABLE
(
[RowID] INT NOT NULL,
[PairID] NVARCHAR(15) NOT NULL,
[SequenceInt] INT NOT NULL,
[FullName] NVARCHAR(200) NOT NULL
)
DECLARE @Dataset2 AS TABLE
(
[RowID] INT NOT NULL,
[PairID] NVARCHAR(15) NOT NULL,
[SequenceInt] INT NOT NULL,
[FullName] NVARCHAR(200) NOT NULL
)

INSERT INTO @Dataset1
(
[RowID],
[PairID],
[SequenceInt],
[FullName]
)


VALUES
(1,'P1',1,'Chandra Omdahl'),
(2,'P1',2,'Ji-U Brun'),
(3,'P2',1,'Jeong-Hui Stankić'),
(4,'P2',2,'Ciel Samuel'),
(5,'P3',1,'Randy Shafir'),
(6,'P3',2,'Dana Mullins'),
(7,'P4',1,'Tola Sierżant'),
(8,'P4',2,'Malone Erős'),
(9,'P5',1,'Odalis Reuter'),
(10,'P5',2,'Chang Niemczyk'),
(11,'P6',1,'Lacey Willems'),
(12,'P6',2,'Haven Zanetti')

INSERT INTO @Dataset2
(
[RowID],
[PairID],
[SequenceInt],
[FullName]
)

VALUES
(1,'P1',1,'Chandra Omdahl'),
(2,'P1',2,'Ji-U Brun'),
(3,'P2',1,'Ciel Samuel'),
(4,'P2',2,'Jeong-Hui Stankić'),
(5,'P3',1,'Randy Shafir'),
(6,'P3',2,'Dana Mullins'),
(7,'P4',1,'Malone Erős'),
(8,'P4',2,'Tola Sierżant'),
(9,'P5',1,'Odalis Reuter'),
(10,'P5',2,'Chang Niemczyk'),
(11,'P6',1,'Lacey Willems'),
(12,'P6',2,'Haven Zanetti')


SELECT
[D1].[RowID] AS [D1_RowID],
[D1].[PairID] AS [D1_PairID],
[D1].[SequenceInt] AS [D1_SequenceInt],
[D1].[FullName] AS [D1_FullName],
[D2].[RowID] AS [D2_RowID],
[D2].[PairID] AS [D2_PairID],
[D2].[SequenceInt] AS [D2_SequenceInt],
[D2].[FullName] AS [D2_FullName]
FROM
@Dataset1 AS [D1]
INNER JOIN @Dataset2 AS [D2] ON [D1].[PairID] = [D2].[PairID] AND [D1].[SequenceInt] = [D2].[SequenceInt]
ORDER BY
[D1].[RowID]

Attachments:

You must be logged in to view attached files.

Jonathan AC Roberts

SSCoach

Points: 18408

January 1, 2023 at 2:37 am

Answer

#4132330

This will identify rows that don't match:

SELECT [D1].[RowID] AS [D1_RowID],
       [D1].[PairID] AS [D1_PairID],
       [D1].[SequenceInt] AS [D1_SequenceInt],
       [D1].[FullName] AS [D1_FullName],
       [D2].[RowID] AS [D2_RowID],
       [D2].[PairID] AS [D2_PairID],
       [D2].[SequenceInt] AS [D2_SequenceInt],
       [D2].[FullName] AS [D2_FullName]
  FROM @Dataset1 AS [D1]
 INNER JOIN @Dataset2 AS [D2] 
         ON [D1].[PairID] = [D2].[PairID]
        AND [D1].[SequenceInt] = [D2].[SequenceInt]
        and d1.FullName <> d2.FullName
 ORDER BY [D1].[RowID];

If you want to also check that the names are swapped then you need to use LEAD/LAG

;WITH CTE AS
(
    SELECT [D1].[RowID]       AS [D1_RowID],
           [D1].[PairID]      AS [D1_PairID],
           [D1].[SequenceInt] AS [D1_SequenceInt],
           [D1].[FullName]    AS [D1_FullName],
           [D2].[RowID]       AS [D2_RowID],
           [D2].[PairID]      AS [D2_PairID],
           [D2].[SequenceInt] AS [D2_SequenceInt],
           [D2].[FullName]    AS [D2_FullName],
           LEAD(D1.[FullName]) OVER (PARTITION BY D1.PairId ORDER BY D1.SequenceInt) AS D1_Lead_FullName,
           LEAD(D2.[FullName]) OVER (PARTITION BY D2.PairId ORDER BY D2.SequenceInt) AS D2_Lead_FullName,
           LAG(D1.[FullName])  OVER (PARTITION BY D1.PairId ORDER BY D1.SequenceInt) AS D1_Lag_FullName,
           LAG(D2.[FullName])  OVER (PARTITION BY D2.PairId ORDER BY D2.SequenceInt) AS D2_Lag_FullName
      FROM @Dataset1 AS [D1]
     INNER JOIN @Dataset2 AS [D2] 
             ON [D1].[PairID] = [D2].[PairID]
            AND [D1].[SequenceInt] = [D2].[SequenceInt]
            AND d1.FullName <> d2.FullName
)
SELECT [D1_RowID],
       [D1_PairID],
       [D1_SequenceInt],
       [D1_FullName],
       [D2_RowID],
       [D2_PairID],
       [D2_SequenceInt],
       [D2_FullName]
  FROM CTE
 WHERE (D1_FullName = D2_Lead_FullName AND D2_FullName = D1_Lead_FullName)
    OR (D1_FullName = D2_Lag_FullName  AND D2_FullName = D1_Lag_FullName);

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

SQLServerQuery Old Hand Points: 333 More actions · Answer 1

Hi Jonathan AC Roberts,

Thanks for your input. Your LEAD/LAG CTE Solution was pretty much exactly what I wanted.

Thanks

Dennis Jensen SSChasing Mays Points: 606 More actions · Answer 2

Dang that Edit feature is really picky, and to whomever just because I edit the post does not mean it is spam. I just like clean posts.

Okay I was asked by Jonathan AC Roberts to post this here. Saying that in case it reiterates anything already said. But this is what we use as Unit Tests for our stored procedures to make sure the results have not changed when they are not supposed to change.

Now first to find out if you have any records that exist in one table that do not exist in another you need to do the following.

SELECT [d1].[RowId]
      ,[d1].[PairId]
--      ,[d1].[SequenceInt]
      ,[d1].[FullName]
  FROM #DataSet1 AS [d1]
EXCEPT
SELECT [d2].[RowId]
      ,[d2].[PairId]
--      ,[d2].[SequenceInt]
      ,[d2].[FullName]
  FROM #DataSet2 AS [d2]

SELECT [d2].[RowId]
      ,[d2].[PairId]
--      ,[d2].[SequenceInt]
      ,[d2].[FullName]
  FROM #DataSet1 AS [d2]
EXCEPT
SELECT [d1].[RowId]
      ,[d1].[PairId]
--      ,[d1].[SequenceInt]
      ,[d1].[FullName]
  FROM #DataSet2 AS [d1]

If you get any results then not all are duplicated and those should be excluded going forward. However, if because one of these Columns is like an Identity value which means it will most likely be different without actually being so just comment out that row from the EXCEPT queries as demonstrated. Once you understand what you need to exclude perhaps putting the EXCEPT results into a temporary table or whatever means you want to use then you can progress. What this process tells you is what fields to look at when making your comparisons for instance in your example [SequenceInt] is the out of sync field and the rest have repeated data then we just do the following:

SELECT [d2].[RowId]
      ,[d2].[PairId]
      ,[d1].[SequenceInt] AS [SequenceInt_1]
      ,[d2].[SequenceInt] AS [SequenceInt_2]
      ,[d2].[FullName]
  FROM #DataSet1      AS [d1]
 INNER JOIN #DataSet2 AS [d2]
    ON [d2].[RowId]    = [d1].[RowId]
   AND [d2].[PairId]   = [d1].[PairId]
   AND [d2].[FullName] = [d1].[FullName]

This thus shows what rows in DataSet1 match what rows in DataSet2 and what the out of sync values are. This can then be used to correct the out of sync data if that is what is desired. However, if you do this second query without establishing the first query, then you may get a lot of false positives

The key is learning to know your data and what you are dealing with prior to trying to deal with it.