Comparing Stored Procedures, Part 6
This picks up from an earlier article on trying to quantify the comparison of two stored procedures.
For a more complete explanation, please go to my blog entry here. This excerpt from my blog helps to explain what this does:
"Initially using an iterative approach to this problem made a lot of sense, since we need to control the order of the subsequences checked (longer before shorter). There didn't seem to be an obvious way to do this with a CTE. The real power of the CTE comes from its feature of recursion - the ability to have a set of data in the CTE refer to itself (please go here for an excellent article on this). I developed a hybrid approach that would combine iterative and recursive code. The recursive CTE in it will select off the longest matching subsequence between the two sequences it can find whose values are all unmatched, save it in a temp table, and then update the two sequences to show the matching values. It would then continue doing this until no more matching subsequences can be found."
Written by Jesse McLain
jesse@jessemclain.com
www.jessemclain.com
http://jessesql.blogspot.com
/*************************************************************************************************
** Proc: "spd_SequenceCompare_Version3"
** Desc: This stored procedure will compare the items in two sequences. The sequences should
** be stored in two tables, Seq1 and Seq2, which should have the following structure
** (the data type for CodeLineTxt can be anything, but the others should not change):
**
** CREATE TABLE Seq1 (
** CodeLineTxt varchar(max), /* stores the original line of code */** CodeLineNum int not null identity(1,1), /* stores the line number */** MatchLineNum int /* stores the matching line of code from spd #2 */** )
**
** This spd will compare smaller and smaller subsequences in the tables, and when it
** finds a match, it will note it in the column MatchLineNum.
**
**
**
** Return values: none
**
** Called by:
**
** Parameters:
** Input
** ----------
** none
**
** Output
** -----------
** none
**
** Auth: Jesse McLain
** Email: jesse@jessemclain.com
** Web: www.jessemclain.com
** Blog: http://jessesql.blogspot.com/2009/02/comparing-stored-procedures-part-6.html
**
** Date: 02/16/2008
**
***************************************************************************************************
** Change History
***************************************************************************************************
** Date: Author: Description:
** -------- -------- -------------------------------------------
** 20080216 Jesse McLain Created script
** 20080223 Jesse McLain Added CTE code to better determine @SubSeqLen value
** 20080225 Jesse McLain Rewrote sequence matching to use recursive CTE code
** 20080226 Jesse McLain Added MAXRECURSION option to recursive CTE code.
** Chgd UPDATEs to set MatchLineNum to CodeLineNum of first
** matching value in other sequence (to show matching blocks)
**
***************************************************************************************************
** Notes
***************************************************************************************************
** 2/25/09 - This excellent article helped with this development:
** http://www.sqlservercentral.com/articles/Development/recursivequeriesinsqlserver2005/1760/
**************************************************************************************************/
CREATE PROCEDURE spd_SequenceCompare_Version3 AS
SET NOCOUNT ON
/* variable @SubSeqLen will store the length of the most recent subsequence matched in the loop */DECLARE @SubSeqLen int
/* set @SubSeqLen to a "dummy" value to ensure that we enter loop the first time */SET @SubSeqLen = -1
WHILE @SubSeqLen <> 0
BEGIN
/* In order to match the subsequences from longest first to shortest last, we're going
to wrap this CTE in a loop. The CTE will pull off the first subsequence it finds,
ordered by length and start position. The reason why we have to pull off one at a
time is because we might have duplicate matches; let's say that we're matching the
strings (sequences of letters) "ABCD" and "ABXYAB". Obviously the longest matching
substrings are "AB", of length 2. If we pulled off all matching substrings of length 2,
then we would get two matches of "AB" to "AB". We only want one, and we arbitrarily
take the first one found. The CTE works such that it will omit any values found to
be a part of a prior match, and this prevents the 2nd "AB" in "ABXYAB" from falsely
matching to anything. */
;WITH Matches (MatchingValue, Seq1Start, Seq1End, Seq2Start, Seq2End, SeqLen) AS (
SELECT
Seq1.CodeLineTxt
,Seq1.CodeLineNum
,Seq1.CodeLineNum
,Seq2.CodeLineNum
,Seq2.CodeLineNum
,1
FROM Seq1
JOIN Seq2
ON Seq1.CodeLineTxt = Seq2.CodeLineTxt
WHERE Seq1.MatchLineNum = 0
AND Seq2.MatchLineNum = 0
UNION ALL
SELECT
Seq1.CodeLineTxt
,Matches.Seq1Start
,Seq1.CodeLineNum
,Matches.Seq2Start
,Seq2.CodeLineNum
,Matches.SeqLen + 1
FROM Seq1
JOIN Seq2
ON Seq1.CodeLineTxt = Seq2.CodeLineTxt
JOIN Matches
ON Seq1.CodeLineNum = Matches.Seq1End + 1
AND Seq2.CodeLineNum = Matches.Seq2End + 1
WHERE Seq1.MatchLineNum = 0
AND Seq2.MatchLineNum = 0
)
SELECT TOP 1 *
INTO #Matched
FROM Matches
ORDER BY SeqLen DESC, Seq1Start, Seq2Start
OPTION (MAXRECURSION 0)
/* after we process the last subsequence, table #Matched will be empty, and
so we assign @SubSeqLen to 0 to indicate this, and our loop will terminate */ SET @SubSeqLen = ISNULL((SELECT SeqLen FROM #Matched), 0)
/* update the sequences to note any matches - if we're out of matches, this
will do nothing, and we'll proceed out of the loop */ UPDATE S
SET MatchLineNum = M.Seq2Start
FROM Seq1 S
JOIN #Matched M
ON S.CodeLineNum BETWEEN M.Seq1Start AND M.Seq1End
UPDATE S
SET MatchLineNum = M.Seq1Start
FROM Seq2 S
JOIN #Matched M
ON S.CodeLineNum BETWEEN M.Seq2Start AND M.Seq2End
DROP TABLE #Matched
END
RETURN