June 26, 2015 at 11:07 am
In my case I am trying to see if there has been anyone who has devised a way to obtain an output that would traverse through all records for a given column and find all the patterns (whether it returns REGEX or the VALUES), then count how many times that exists. I know a lot of 1 counts would come back, so I more than likely would include a HAVING > 5 for instance. I do realize I need to have this formatted a certain way and have failed to do so, but I'm under the gun and promise the next time I will format correctly (or come back and edit this).
Table name = dbo.PEOPLE
Columns =
LogID [int] IDENTITY(1,1) NOT NULL
,SourceData [varchar] (50) NOT NULL --which could include values that have numeric, alpha, and special characters
,Account [nvarchar] (50) NULL
EXAMPLE DATA SET of table (pipe delimited)
1|123X-456|56789
2|456/123|A345
3|ABC 345|X567
4|456 HELP-123|YCHB
5|456/987|2345
6|CDE 345|XX987
7|345X-456|YY6767
So what I would like my output to look like is two columns - PATTERN|COUNT
%X-456|2 --pulling in LogID rows 1 and 7, finding the like pattern and omitting everything else replacing with wildcard
4%|3 --pulling in LogID rows 2,4 & 5
45%|3 --pulling in LogID rows 2,4 & 5
456%|3 --pulling in LogID rows 2,4 & 5
456/%|2 --pulling in LogID rows 2 and 5
% 345|2 --pulling in LogID rows 3 and 6
%-4%|2 --pulling in LogID rows 1 and 7
%-45%|2 --pulling in LogID rows 1 and 7
%-456|2 --pulling in LogID rows 1 and 7
Now, if someone has something that returns REGEX pattern vs. actual pattern values, then I would work with that.
Essentially I figured I would need to traverse through each character of a field, after reading the LEN of the field, and starting with position/char 1 reading each character until the end and writing each character to a TEMP with a cursor/CTE. Then coming back through and seeing how many times I can match that pattern where it is, for instance > 5 as stated at first. I know the execution plan will be robust and consume a lot of my SQL server resources, but it has to be easier than hiring people to find patterns.
June 26, 2015 at 11:35 am
Since your using 2005...
DECLARE @people TABLE (Logid INT IDENTITY(1,1), SourceData NVARCHAR(50), Account NVARCHAR(50))
INSERT INTO @people (SourceData, Account)
SELECT '123X-456','56789' UNION ALL
SELECT '456/123','A345' UNION ALL
SELECT 'ABC 345','X567' UNION ALL
SELECT '456 HELP-123','YCHB' UNION ALL
SELECT '456/987','2345' UNION ALL
SELECT 'CDE 345','XX987' UNION ALL
SELECT '345X-456','YY6767'
SELECT * FROM @people
You're welcome. 😉
June 26, 2015 at 11:50 am
Here's what I came up with. You'll have to populate a table with your search patterns.
DECLARE @people TABLE (Logid INT IDENTITY(1,1), SourceData NVARCHAR(50), Account NVARCHAR(50))
DECLARE @check TABLE (Pattern NVARCHAR(10))
INSERT INTO @check
SELECT '%X-456' UNION ALL
SELECT '4%' UNION ALL
SELECT '45%' UNION ALL
SELECT '456%' UNION ALL
SELECT '456/%' UNION ALL
SELECT '% 345' UNION ALL
SELECT '%-4%' UNION ALL
SELECT '%-45%' UNION ALL
SELECT '%-456'
INSERT INTO @people (SourceData, Account)
SELECT '123X-456','56789' UNION ALL
SELECT '456/123','A345' UNION ALL
SELECT 'ABC 345','X567' UNION ALL
SELECT '456 HELP-123','YCHB' UNION ALL
SELECT '456/987','2345' UNION ALL
SELECT 'CDE 345','XX987' UNION ALL
SELECT '345X-456','YY6767'
--SELECT * FROM @people
--SELECT * FROM @check
SELECT
c.pattern,
COUNT(p.LogID) AS [Count]
FROM
@people p
LEFT JOIN @check c ON p.SourceData LIKE c.pattern
GROUP BY
c.pattern
Cheers,
June 26, 2015 at 11:51 am
Thank you for posting the code formatted correctly - do you, by chance, have a link to the formatting code for this forum regarding "How-To", so that I can do just as you did?
June 26, 2015 at 11:53 am
To create a table that houses all the patterns would take much too long. The rows in the actual tables are in the millions and I am looking for the logic that would do the work and traverse through finding all the characters and building all the patterns. Sort of like the Monte-Carlo simulation. http://westclintech.com/Blog/tabid/132/EntryId/28/Monte-Carlo-Simulation-in-SQL-Server.aspx
June 26, 2015 at 11:59 am
TCcool (6/26/2015)
Thank you for posting the code formatted correctly - do you, by chance, have a link to the formatting code for this forum regarding "How-To", so that I can do just as you did?
Not a problem...I really should add it to my signature...so I did. 😀
June 26, 2015 at 1:37 pm
TCcool (6/26/2015)
To create a table that houses all the patterns would take much too long. The rows in the actual tables are in the millions and I am looking for the logic that would do the work and traverse through finding all the characters and building all the patterns. Sort of like the Monte-Carlo simulation. http://westclintech.com/Blog/tabid/132/EntryId/28/Monte-Carlo-Simulation-in-SQL-Server.aspx
There is no good way I know of doing this and certainly not very efficiently. You want to extract every possible pattern derived from your column. Now multiply that by every possible way you could extract a pattern from a single string. So let's take a simple 3 letter string 'ABC', some of your patterns could look like.
ABC, _BC, A_C, AB_, A__, _B_, __C, ___, A%, AB%, A%C, %C, %BC, %B% ***Remember even though AB_ & AB% would return the same for three letter words you want to compare them to all other strings with varying lengths so now AB_ & AB% would yield different results.
So with that in mind ONE string up to 50 characters long (in your case) could generate hundreds of possible patterns. Multiply that out by the millions of rows you indicated and you end up with the mother of all cursors that would take forever to finish. That's before you even eliminated your duplicate patterns and then compared them back to the original strings so you could see how many each of those appear.
That is if I understand you correctly...
June 26, 2015 at 2:01 pm
You understood me correctly - now, I can say that the percentage of varying matches, where I would put a having clause > X to reduce what is returned, would be about 10% of all records - which are the distinct values where there would be a solid pattern match occurring multiple times. I am in the same boat as many others are, which is why I have posed this question to many forums, with no good answer. If I had a good starting point though I could work off that.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply