December 1, 2016 at 10:41 am
i have below sample data, i try to extract valid credit card data. But i was not able to do so ,since data is not in good format. Can any one help?
below sample data have a transaction code which is also 16 numbers or more which should be considered as credit card number
DECLARE @test-2 TABLE (memo varchar(400) );
INSERT INTO @test-2
SELECT 'abc this account used visa :1234567890123456' UNION ALL
SELECT 'this account visa 1234567890123456 request for receipt' UNION ALL
SELECT 'this account visa 123456789012345612/13 exp' UNION ALL
SELECT 'invoice card used 4123 4567 8901 2345 some text' UNION ALL
SELECT 'paid using 4123 4567 8901 2345 thank you' UNION ALL
SELECT 'Transaction code 123456454636667445 copy invoice'UNION ALL
SELECT 'mastercard cc#5987-6543-2109-8765. this order will be processed' UNION ALL
SELECT 'mastercard 5987-6543-2109-8765, thank you' union all
SELECT ' period 1999-10-18 - -2000-10-17 '; ;
SELECT *
FROM @test-2
WHERE --master or visa 16
patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0
or patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9]%', memo) <>0
or patindex( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]%', memo) <>0
--visa 13
or patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0
-- amex 15
or patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0
or patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9] [0-9][0-9] [0-9][0-9][0-9]%', memo) <>0
or patindex( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9]%', memo) <>0
December 1, 2016 at 11:06 am
PATINDEX gives you the position of the start of the match. So just SUBSTRING them out. The sneaky part here is doing it in a single pass. Once again my favorite TSQL thingy comes into play - CASE.
SELECT CASE WHEN PATINDEX( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <> 0 THEN
SUBSTRING(memo, PATINDEX( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo), 16)
WHEN patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9]%', memo) <> 0 THEN
SUBSTRING(memo, patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9]%', memo), 19)
--the rest are left to the inquisitive reader :)
END AS CCNum
FROM @test-2
That code isn't quite complete but I will let you flesh it out. You will need to test the character after the last digit in the sequence (only) to see if it is a number in order to determine if it is a 16 VISA/MC or 15 AMEX. I don't know of a VISA that is just 13 but same thing applies there.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
December 1, 2016 at 11:11 am
This might give invalid card numbers, but it's worth a try.
WITH ctePatterns AS(
SELECT *
FROM (VALUES
( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', 16)
,( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9]%', 19)
,( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]%', 19)
--visa 13
,( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', 13)
-- amex 15
,( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', 15)
,( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9] [0-9][0-9] [0-9][0-9][0-9]%', 18)
,( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9]%', 18) )p(Pattern, Length)
)
SELECT t.memo, SUBSTRING( t.memo, PATINDEX( p.Pattern, t.memo), p.Length)
FROM @test-2 t
CROSS APPLY (SELECT TOP 1 *
FROM ctePatterns p
WHERE t.memo LIKE p.Pattern
ORDER BY Length DESC)p;
December 1, 2016 at 11:17 am
Hey thanks for your reply, another issue i have is i dont want to transaction ID which is not a credit card number it has more than 16 numbers.
December 1, 2016 at 11:34 am
How can you differentiate it from the visa card that has the expiration date together?
Here's another option that works with your sample data. The function is posted and explained in here: http://www.sqlservercentral.com/articles/String+Manipulation/94365/
SELECT memo,
CASE WHEN Item LIKE '%/__' THEN LEFT( Item, 16)
ELSE Item END AS ccard
FROM @test-2
CROSS APPLY dbo.PatternSplitCM( REPLACE( REPLACE( memo, ' ', ''), '-', ''), '%[0-9/]%')x
WHERE Matched = 1
AND( LEN(Item) <= 16
OR Item LIKE '%/__')
December 1, 2016 at 12:42 pm
thanks luis for your reply i was testing your solution and found there are more scenarios in my bad data. the system i am working on SQL 2005 so change function as below using a tally table.
DECLARE @StartTime DATETIME --Timer to measure total duration
SET @StartTime = GETDATE() --Start the timer
--=============================================================================
-- Create and populate a Tally table
--=============================================================================
--===== Conditionally drop
IF OBJECT_ID('dbo.Tally') IS NOT NULL
DROP TABLE dbo.Tally
--===== Create and populate the Tally table on the fly
SELECT TOP 11000000 --equates to more than 30 years of dates
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
--===== Let the public use it
GRANT SELECT, REFERENCES ON dbo.Tally TO PUBLIC
--===== Display the total duration
SELECT STR(DATEDIFF(ms,@StartTime,GETDATE())) + ' Milliseconds duration'
create FUNCTION [dbo].[PatternSplitCM]
(
@List VARCHAR(8000) = NULL
,@Pattern VARCHAR(50)
) RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
WITH numbers AS (
SELECT TOP(ISNULL(DATALENGTH(@List), 0))
n
FROM dbo.Tally t
WHERE N <= LEN(@List)+1)
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY MIN(n)),
Item = SUBSTRING(@List,MIN(n),1+MAX(n)-MIN(n)),
[Matched]
FROM (
SELECT n, y.[Matched], Grouper = n - ROW_NUMBER() OVER(ORDER BY y.[Matched],n)
FROM numbers
CROSS APPLY (
SELECT [Matched] = CASE WHEN SUBSTRING(@List,n,1) LIKE @Pattern THEN 1 ELSE 0 END
) y
) d
GROUP BY [Matched], Grouper
As you were how do you know whether the number is a true credit card or not, i was reading one of the article in the forum where it is mentioned all visa cards will start with number sequence 4(http://www.sqlservercentral.com/Forums/Topic1437053-391-1.aspx). i was think adding this additional filter will retrieve correct results.
If you look at the test data i added just before posting this thread where a dob is showing in the result because it matches with the pattern.
December 1, 2016 at 1:06 pm
You could have also changed it like this and keep it as a function with zero reads.
CREATE FUNCTION [dbo].[PatternSplitCM](@List [varchar](8000) = NULL, @Pattern [varchar](50))
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
WITH cterows AS(
SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n UNION ALL
SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n
),
numbers AS (
SELECT TOP(ISNULL(DATALENGTH(@List), 0))
n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM cterows d, cterows e, cterows f, cterows g)
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY MIN(n)),
Item = SUBSTRING(@List,MIN(n),1+MAX(n)-MIN(n)),
[Matched]
FROM (
SELECT n, y.[Matched], Grouper = n - ROW_NUMBER() OVER(ORDER BY y.[Matched],n)
FROM numbers
CROSS APPLY (
SELECT [Matched] = CASE WHEN SUBSTRING(@List,n,1) LIKE @Pattern COLLATE Latin1_General_Bin THEN 1 ELSE 0 END
) y
) d
GROUP BY [Matched], Grouper;
It depends on what resource you want to use (RAM or disk)
December 1, 2016 at 1:09 pm
mxy (12/1/2016)
i have below sample data, i try to extract valid credit card data. But i was not able to do so ,since data is not in good format. Can any one help?below sample data have a transaction code which is also 16 numbers or more which should be considered as credit card number
DECLARE @test-2 TABLE (memo varchar(400) );
INSERT INTO @test-2
SELECT 'abc this account used visa :1234567890123456' UNION ALL
SELECT 'this account visa 1234567890123456 request for receipt' UNION ALL
SELECT 'this account visa 123456789012345612/13 exp' UNION ALL
SELECT 'invoice card used 4123 4567 8901 2345 some text' UNION ALL
SELECT 'paid using 4123 4567 8901 2345 thank you' UNION ALL
SELECT 'Transaction code 123456454636667445 copy invoice'UNION ALL
SELECT 'mastercard cc#5987-6543-2109-8765. this order will be processed' UNION ALL
SELECT 'mastercard 5987-6543-2109-8765, thank you' union all
SELECT ' period 1999-10-18 - -2000-10-17 '; ;
SELECT *
FROM @test-2
WHERE --master or visa 16
patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0
or patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9]%', memo) <>0
or patindex( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]%', memo) <>0
--visa 13
or patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0
-- amex 15
or patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0
or patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9] [0-9][0-9] [0-9][0-9][0-9]%', memo) <>0
or patindex( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9]%', memo) <>0
You might have a different problem all together... Last I knew, financial regulations ABSOLUTELY REQUIRE that credit card numbers are stored encrypted at ALL TIMES. Hopefully, the numbers posted are entirely made up samples, and the field in question is stored encrypted.
As to the problem field, that's just a really incredibly bad idea, especially given current financial regulations. Having to try and parse that kind of data from some kind of memo field that might not get the appropriate protection is probably gross negligence, at the very least. At best, it's encrypted, but it's still a bad idea. Having to parse out that kind of information with no consistency to formatting is going to be prone to failure, because there just can't be an automated pattern that can't be fooled by bad data.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
December 1, 2016 at 1:24 pm
this is a legacy system i am working , trying to mask all the credit card data that is in plain text. But the challenge itself is the way data is stored i am not able to find a way to display only credit card data.
December 1, 2016 at 1:29 pm
SQL might not be the best tool for this.
You should look at regular expressions processing in another language, they are usually a lot more robust that what Patindex allows, and you can find a lot of samples online.
December 1, 2016 at 8:53 pm
Manic Star (12/1/2016)
SQL might not be the best tool for this.You should look at regular expressions processing in another language, they are usually a lot more robust that what Patindex allows, and you can find a lot of samples online.
I'd be surprised if regex would accurately help much here.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 1, 2016 at 9:37 pm
mxy (12/1/2016)
this is a legacy system i am working , trying to mask all the credit card data that is in plain text. But the challenge itself is the way data is stored i am not able to find a way to display only credit card data.
Pattern matching for 15 or 16 digits is just the start here (not to mention that it's also an incorrect assumption) and the pattern matching can be made simpler by changing all dashes and spaces to empty strings before you make the extraction attempt. Once you've extracted such digits from the fray of the rest of the text, you then have to test for several things to ensure that it's actually a credit card number. Those steps would be as follows...
1. Extract the credit card type from the leading digits of the number you've extracted.
2. Ensure that the credit card type is a currently active credit card type.
3. Knowing the credit card type, validate the length of the number against all the possible length that credit card type can have.
4. Perform a LUHN 10 Checksum validation and compare it against the check-digit of the credit card number, which is the last digit of the credit card type.
Although unlikely, it will NOT guarantee that the transaction number will fail the credit card test above. You could use other hints in the text to do that. For example, if the credit card type digits indicate a Master Card and the word "Master" is in the text, that's a pretty strong indication that it's actually a credit card number that you've extracted. The appearance of "CC#" would be another strong indication of it being a credit card number.
Yes, you could probably do all of this with RegEx but I suspect it will be a whole lot slower than what you might be able to do in T-SQL. Why do I say that? Because here's a list of the credit card type and length rules.
https://en.wikipedia.org/wiki/Payment_card_number
Here's an explanation of the LUHN 10 Check Sum.
https://en.wikipedia.org/wiki/Luhn_algorithm
A very important part of calculating LUHN 10 Checksums is the understanding that padding with leading zeros does NOT change the outcome. If you always left zero pad out to 20 digits, it will make calculating the LUHN 10 Check Digit a whole lot easier than doing a REVERSE and trying to calculate from the right. It will also guaranteed that all credit card numbers are handled exactly the same way regardless of length, which will help in areas such as parameter sniffing and single execution plan reuse.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 1, 2016 at 11:55 pm
SQL CLR Regex would be the way I would go honestly. IIRC someone actually already built a regular expression system for SQL Server using that.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
December 2, 2016 at 6:59 am
TheSQLGuru (12/1/2016)
SQL CLR Regex would be the way I would go honestly. IIRC someone actually already built a regular expression system for SQL Server using that.
A lot of folks say that but it's been my experience in many posts that measured performance of different methods that if it can be done using T-SQL, it will beat SQLCLR REGEX. As with any experience, there are always exceptions. I'll see if I can find a couple of the threads where such testing occurred. There was one really big one that I have in mind but don't have the URL for it handy.
This is a DR/BCP weekend for me so I might not be able to get to this thread this weekend but I'll try.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 2, 2016 at 7:36 am
SQLsharp has a validation for credit cards. It will only validate a number and won't extract it from a string. If you want to try it, feel free to do it.
Viewing 15 posts - 1 through 15 (of 30 total)
You must be logged in to reply to this topic. Login to reply