Character matching into patterns plus counts

Question

Post reply

Character matching into patterns plus counts

TCcool

Valued Member

Points: 52
More actions
June 26, 2015 at 11:07 am

#150918

In my case I am trying to see if there has been anyone who has devised a way to obtain an output that would traverse through all records for a given column and find all the patterns (whether it returns REGEX or the VALUES), then count how many times that exists. I know a lot of 1 counts would come back, so I more than likely would include a HAVING > 5 for instance. I do realize I need to have this formatted a certain way and have failed to do so, but I'm under the gun and promise the next time I will format correctly (or come back and edit this).
Table name = dbo.PEOPLE
Columns =
LogID [int] IDENTITY(1,1) NOT NULL
,SourceData [varchar] (50) NOT NULL --which could include values that have numeric, alpha, and special characters
,Account [nvarchar] (50) NULL
EXAMPLE DATA SET of table (pipe delimited)
1|123X-456|56789
2|456/123|A345
3|ABC 345|X567
4|456 HELP-123|YCHB
5|456/987|2345
6|CDE 345|XX987
7|345X-456|YY6767
So what I would like my output to look like is two columns - PATTERN|COUNT
%X-456|2 --pulling in LogID rows 1 and 7, finding the like pattern and omitting everything else replacing with wildcard
4%|3 --pulling in LogID rows 2,4 & 5
45%|3 --pulling in LogID rows 2,4 & 5
456%|3 --pulling in LogID rows 2,4 & 5
456/%|2 --pulling in LogID rows 2 and 5
% 345|2 --pulling in LogID rows 3 and 6
%-4%|2 --pulling in LogID rows 1 and 7
%-45%|2 --pulling in LogID rows 1 and 7
%-456|2 --pulling in LogID rows 1 and 7
Now, if someone has something that returns REGEX pattern vs. actual pattern values, then I would work with that.
Essentially I figured I would need to traverse through each character of a field, after reading the LEN of the field, and starting with position/char 1 reading each character until the end and writing each character to a TEMP with a cursor/CTE. Then coming back through and seeing how many times I can match that pattern where it is, for instance > 5 as stated at first. I know the execution plan will be robust and consume a lot of my SQL server resources, but it has to be easier than hiring people to find patterns.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Y.B. SSChampion Points: 11649 More actions · Answer 1

Since your using 2005...

DECLARE @people TABLE (Logid INT IDENTITY(1,1), SourceData NVARCHAR(50), Account NVARCHAR(50))

INSERT INTO @people (SourceData, Account)

SELECT '123X-456','56789' UNION ALL

SELECT '456/123','A345' UNION ALL

SELECT 'ABC 345','X567' UNION ALL

SELECT '456 HELP-123','YCHB' UNION ALL

SELECT '456/987','2345' UNION ALL

SELECT 'CDE 345','XX987' UNION ALL

SELECT '345X-456','YY6767'

SELECT * FROM @people

You're welcome. 😉

SELECT quote FROM brain WHERE original = 1
0 rows returned

Y.B. SSChampion Points: 11649 More actions · Answer 2

Here's what I came up with. You'll have to populate a table with your search patterns.

DECLARE @people TABLE (Logid INT IDENTITY(1,1), SourceData NVARCHAR(50), Account NVARCHAR(50))

DECLARE @check TABLE (Pattern NVARCHAR(10))

INSERT INTO @check

SELECT '%X-456' UNION ALL

SELECT '4%' UNION ALL

SELECT '45%' UNION ALL

SELECT '456%' UNION ALL

SELECT '456/%' UNION ALL

SELECT '% 345' UNION ALL

SELECT '%-4%' UNION ALL

SELECT '%-45%' UNION ALL

SELECT '%-456'

INSERT INTO @people (SourceData, Account)

SELECT '123X-456','56789' UNION ALL

SELECT '456/123','A345' UNION ALL

SELECT 'ABC 345','X567' UNION ALL

SELECT '456 HELP-123','YCHB' UNION ALL

SELECT '456/987','2345' UNION ALL

SELECT 'CDE 345','XX987' UNION ALL

SELECT '345X-456','YY6767'

--SELECT * FROM @people

--SELECT * FROM @check

SELECT

c.pattern,

COUNT(p.LogID) AS [Count]

FROM

@people p

LEFT JOIN @check c ON p.SourceData LIKE c.pattern

GROUP BY

c.pattern

Cheers,

SELECT quote FROM brain WHERE original = 1
0 rows returned

TCcool Valued Member Points: 52 More actions · Answer 3

Thank you for posting the code formatted correctly - do you, by chance, have a link to the formatting code for this forum regarding "How-To", so that I can do just as you did?

TCcool Valued Member Points: 52 More actions · Answer 4

To create a table that houses all the patterns would take much too long. The rows in the actual tables are in the millions and I am looking for the logic that would do the work and traverse through finding all the characters and building all the patterns. Sort of like the Monte-Carlo simulation. http://westclintech.com/Blog/tabid/132/EntryId/28/Monte-Carlo-Simulation-in-SQL-Server.aspx

Y.B. SSChampion Points: 11649 More actions · Answer 5

TCcool (6/26/2015)
Thank you for posting the code formatted correctly - do you, by chance, have a link to the formatting code for this forum regarding "How-To", so that I can do just as you did?

Not a problem...I really should add it to my signature...so I did. 😀

SELECT quote FROM brain WHERE original = 1
0 rows returned

Y.B. SSChampion Points: 11649 More actions · Answer 6

TCcool (6/26/2015)
To create a table that houses all the patterns would take much too long. The rows in the actual tables are in the millions and I am looking for the logic that would do the work and traverse through finding all the characters and building all the patterns. Sort of like the Monte-Carlo simulation. http://westclintech.com/Blog/tabid/132/EntryId/28/Monte-Carlo-Simulation-in-SQL-Server.aspx

There is no good way I know of doing this and certainly not very efficiently. You want to extract every possible pattern derived from your column. Now multiply that by every possible way you could extract a pattern from a single string. So let's take a simple 3 letter string 'ABC', some of your patterns could look like.

ABC, _BC, A_C, AB_, A__, _B_, __C, ___, A%, AB%, A%C, %C, %BC, %B% ***Remember even though AB_ & AB% would return the same for three letter words you want to compare them to all other strings with varying lengths so now AB_ & AB% would yield different results.

So with that in mind ONE string up to 50 characters long (in your case) could generate hundreds of possible patterns. Multiply that out by the millions of rows you indicated and you end up with the mother of all cursors that would take forever to finish. That's before you even eliminated your duplicate patterns and then compared them back to the original strings so you could see how many each of those appear.

That is if I understand you correctly...

SELECT quote FROM brain WHERE original = 1
0 rows returned

TCcool Valued Member Points: 52 More actions · Answer 7

You understood me correctly - now, I can say that the percentage of varying matches, where I would put a having clause > X to reduce what is returned, would be about 10% of all records - which are the distinct values where there would be a solid pattern match occurring multiple times. I am in the same boat as many others are, which is why I have posed this question to many forums, with no good answer. If I had a good starting point though I could work off that.