Is there a way to extract only credit card numbers from text

Question

Is there a way to extract only credit card numbers from text

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply

Alan Burstein SSC Guru Points: 61141 More actions · Answer 1

I'm a little late here but actually started a solution on Friday that I just now had time to finish.

First and foremost: unless someone has an example of how to do this another with Regex I'm going to emphatically aver that good ol' T-SQL will be as elegant and will outperform any CLR/Regex solution.

What I came up with is a way to extract all possible credit card numbers from a string. Say N is a number, by "possible" credit card number I'm talking about:

1) NNNNNNNNNNNNNNNN

2) NNNN-NNNN-NNNN-NNNN

3) NNNN NNNN NNNN NNNN

Let's say that X is any of these patterns. We're looking for one of these scenarios:

1) X surrounded by nothing; e.g. the value = '1234123412341234', '1234-1234-1234-1234' or '1234 1234 1234 1234'

2) X surrounded by non-numeric; e.g. 'blah 1234 1234 1234 1234 blah, blah...'

Note: 17 consecutive digits != CC number

3) A string beginning with X followed by non-numeric; e.g. '1234-1234-1234-1234 is the number....'

4) A string that ends with non-numeric followed by X; e.g. 'blah blah cc:1234123412341234'

With all this in mind....

Here's a pattern to identify 16 consecutive digits:

SELECT p = REPLICATE('[0-9]',16);

Now let's identify the following patterns: NNNNNNNNNNNNNNNN, NNNN-NNNN-NNNN-NNNN & NNNN NNNN NNNN NNNN

SELECT p = STUFF(STUFF(STUFF(REPLICATE('[0-9]',16), 61,0,x), 41,0,x), 21,0,x)

FROM (VALUES ('[ ]'),('[-]'),('')) nx(x);

These patterns represent X from earlier - let's create all of the aforementioned scenarios...

SELECT p

FROM

( -- "x" is: 4 sets of 4 numbers separated by (1) a space, (2) a hyphen, (3) nothing

SELECT STUFF(STUFF(STUFF(REPLICATE('[0-9]',16), 61,0,x), 41,0,x), 21,0,x)

FROM (VALUES ('[ ]'),('[-]'),('')) nx(x)

) nx(x)

CROSS APPLY (VALUES

(x), -- x = 16 numbers, or 4 sets of 4 numbers separated by a space or a hyphen

('%[^0-9]'+x+'[^0-9]%'), -- string contains x surrounded by non-numeric characters

('%[^0-9]'+x), -- The string ends a non-numeric character followed by x

(x+'[^0-9]%') -- the string begins with x followed by a non-numeric character

) p(p);

Results:

p

-------------------------------------------------------------------------------------------------------

%[^0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][^0-9]%

%[^0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]

[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][^0-9]%

[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]

%[^0-9][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][^0-9]%

%[^0-9][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9]

[0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][^0-9]%

[0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9]

%[^0-9][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][^0-9]%

%[^0-9][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9]

[0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][^0-9]%

[0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9]

Now that we have that, we can easily identify the location of every match in the string along with what pattern that string matched using the following query:

DECLARE @test-2 TABLE (memoID int identity, memo varchar(400) );

INSERT INTO @test-2(memo)

SELECT '1234567890123456 abc this account used visa :1234567890123456' UNION ALL --1x

SELECT 'this account visa 1234567890123456 request for receipt' UNION ALL --2x

SELECT 'this account visa 123456789012345612/13 exp' UNION ALL --3

SELECT 'invoice card used 4123 4567 8901 2345 some text' UNION ALL --4x

SELECT 'paid using 4123 4567 8901 2345 thank you' UNION ALL --5x

SELECT 'Transaction code 123456454636667445 copy invoice' UNION ALL --6

SELECT 'mastercard cc#5987-6543-2109-8765. this order will be processed' UNION ALL --7x

SELECT 'mastercard 5987-6543-2109-8765, thank you' UNION ALL --8x

SELECT ' period 1999-10-18 - -2000-10-17 ' UNION ALL --9

SELECT '1234567890123456 is the preferred card' UNION ALL --10x

SELECT '9876543210123456' UNION ALL --11x

SELECT '65432101234567890' UNION ALL --12

SELECT '9876 5432 1012 3456' UNION ALL --13x

SELECT '9876-5432-10123456'; --14

-- GET memoID, memo text, the pattern that was matched and the location of the match

WITH patterns(pattern, xx) AS

(

SELECT p, delim

FROM

(

SELECT STUFF(STUFF(STUFF(REPLICATE('[0-9]',16), 61,0,x), 41,0,x), 21,0,x), x

FROM (VALUES ('[ ]'),('[-]'),('')) nx(x)

) nx(x, delim)

CROSS APPLY (VALUES (x), ('%[^0-9]'+x+'[^0-9]%'), ('%[^0-9]'+x), (x+'[^0-9]%')) p(p)

)

SELECT

memoID,

location = PATINDEX(pattern, memo),

pattern,

[matched] = SUBSTRING

(

memo,

PATINDEX(pattern, memo)+PATINDEX('[%]%',pattern),

16 + IIF(LEN(xx)=0,0,3)

)

FROM @test-2

CROSS JOIN patterns

WHERE memo LIKE pattern;

Results:

memoID location pattern matched

----------- ----------- ------------------------------------------------------------------------------------------------------- -------------------

13 1 [0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9] 9876 5432 1012 3456

4 18 %[^0-9][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][^0-9]% 4123 4567 8901 2345

5 11 %[^0-9][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][ ][0-9][0-9][0-9][0-9][^0-9]% 4123 4567 8901 2345

7 14 %[^0-9][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][^0-9]% 5987-6543-2109-8765

8 11 %[^0-9][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9][0-9][^0-9]% 5987-6543-2109-8765

11 1 [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] 9876543210123456

2 18 %[^0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][^0-9]% 1234567890123456

1 45 %[^0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] 1234567890123456

1 1 [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][^0-9]% 1234567890123456

10 1 [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][^0-9]% 1234567890123456

Now that you use this in a subquery which can further analyze each possible CC number (e.g. the LUHN 10 Check Sum and some of the other things Jeff was talking about like so:

WITH patterns(pattern, xx) AS

(

SELECT p, delim

FROM

(

SELECT STUFF(STUFF(STUFF(REPLICATE('[0-9]',16), 61,0,x), 41,0,x), 21,0,x), x

FROM (VALUES ('[ ]'),('[-]'),('')) nx(x)

) nx(x, delim)

CROSS APPLY (VALUES (x), ('%[^0-9]'+x+'[^0-9]%'), ('%[^0-9]'+x), (x+'[^0-9]%')) p(p)

),

matches AS

(

SELECT memoID, [matched] =

SUBSTRING

(

memo,

PATINDEX(pattern, memo)+PATINDEX('[%]%',pattern),

16 + IIF(LEN(xx)=0,0,3)

)

FROM @test-2

CROSS JOIN patterns

WHERE memo LIKE pattern

)

SELECT m.MemoID, m.[matched], ma.IsValidCC

FROM matches m

CROSS APPLY dbo.CCAnalysis(m.[matched]) ma; -- another function that analyzes each possible match

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

sgmunson SSC Guru Points: 110640 More actions · Answer 2

Alan,

Nice work. I did get the impression that the original poster had indicated that the data isn't necessarily that consistent or conforming, so I don't know if it's going to solve every problem they ran into, but it sure looks like a pretty good start.

Steve (aka sgmunson)
Rent Servers for Income (picks and shovels strategy)

mxy SSCarpal Tunnel Points: 4036 More actions · Answer 3

thanks alan for your time and providing a solution i will test it on my data and make necessary changes if needed based on my real data

Jeff Moden SSC Guru Points: 1004556 More actions · Answer 4

mxy (12/5/2016)
thanks alan for your time and providing a solution i will test it on my data and make necessary changes if needed based on my real data

Careful, now... Not all credit card numbers contain 16 digits.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 5

Lots of what looks like painstaking and tedious work there Alan. Kudos for that. It does seem that it only processes 16 digit numbers, which as Jeff pointed out is insufficient.

I will bet that properly crafted CLR and RegEx will be faster than TSQL for this data processing need, quite possibly significantly so.

Here are my favorite two resources to provide discussion and demonstrations and code for just how complex this space is:

https://www.schkerke.com/wps/2015/08/c-finding-credit-cards-within-free-form-text/

https://www.codeproject.com/Articles/20271/Ultimate-NET-Credit-Card-Utility-Class

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Joe Celko SSCertifiable Points: 5749 More actions · Answer 6

i try to extract valid credit card data. But i was not able to do so ,since data is not in good format.

SQL was never meant to be used as a text processing language or for doing data scrubbing. This order processing should be done outside the database layer in your tiered architecture and not in the database at all. Do you have a budget to get a package for doing this? If so, spend the money and save yourself lots of time and trouble doing something that somebody else has already done for you. This same philosophy applies to scrubbing mailing data.

Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL

mxy SSCarpal Tunnel Points: 4036 More actions · Answer 7

this is a legacy system not in use currently, we are trying to find true credit card numbers so that we can mask with xxxx characters.

other thing application used to accept only master, visa and amex.

ManicStar SSCoach Points: 17992 More actions · Answer 8

mxy (12/6/2016)
this is a legacy system not in use currently, we are trying to find true credit card numbers so that we can mask with xxxx characters.
other thing application used to accept only master, visa and amex.

I had to do the exact same thing before. You won't catch them all. You will get about maybe 80% of them, especially if they are in free-flowing note fields.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 9

Jeff Moden (12/2/2016)
TheSQLGuru (12/1/2016)
SQL CLR Regex would be the way I would go honestly. IIRC someone actually already built a regular expression system for SQL Server using that.
A lot of folks say that but it's been my experience in many posts that measured performance of different methods that if it can be done using T-SQL, it will beat SQLCLR REGEX. As with any experience, there are always exceptions. I'll see if I can find a couple of the threads where such testing occurred. There was one really big one that I have in mind but don't have the URL for it handy.
This is a DR/BCP weekend for me so I might not be able to get to this thread this weekend but I'll try.

Jeff - on that note: if you find it, it may be worth fully redoing the comparison using 2014 or 2016. Based on some of the recent revalidation I did for a recent thread, I am not sure the previous findings will hold (check your PMs - I sent you the details a while back).

Of course - that still doesn't get around some of the other objections (i.e. external items running within the core, secondary toolsets/skillsets etc...), but the perf answer may not be the same.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 10

As a few people have mentioned, this is fairly easy to accomplish with a Regular Expression. The following pattern identifies the variations found in your data with respect to being constrained to just Visa, MasterCard, and AMEX (the frowny face is a colon followed by a left-parenthesis):

\b(?:(?:\d{4}(?:[ -]?\d{4}){3}|\d{4}[ -]?\d{6}[ -]?\d{5})(?=\d{2}/\d{2}|[^/0-9]|$)|\d{16,}(?=[^/0-9]|$))

What that pattern looks for is:

MATCH starts on a word boundary (i.e. no leading digits so it won't capture the last 16 digits of a 20 digit sequence!)

[h2]-- AND --[/h2]

(handle CC numbers)

MATCH contains:
4 digits followed by three sets of "0 or 1 instances of a space or dash followed by 4 digits" (this covers all 16-digit variations)
OR
4 digits, followed by 0 or 1 instances of a space or dash, followed by 6 digits, followed by 0 or 1 instances of a space or dash, followed by 5 digits (this covers all 15-digit variations)
PATTERN requires but does not include in MATCH:
MATCH followed by one of these three patterns:
* 2 digits followed by "/" followed by 2 digits
* any character that is not a "/" or digit
* end of string

[h3]-- OR -- [/h3]

(handle transaction codes)

MATCH contains 16 or more digits
PATTERN requires but does not include in MATCH:
MATCH followed by one of these two patterns:
* any character that is not a "/" or digit
* end of string

Hence, this approach does actually extract the CC number, regardless of formatting, regardless of there being an expiration date following the CC number.

This pattern also prevents matching invalid sequences such as an AMEX number that is missing a digit (hence 14 digits) that is followed by an expiration date, which starts with 2 digits for a total of 16 digits, and hence can be a false-positive for any of the three card types assuming no expiration date.

The example below shows this behavior. The example uses a RegEx function, RegEx_MatchSimple4k, that is available in the free version of SQL#[/url], a library of SQLCLR functions that I wrote.

DECLARE @test-2 TABLE ([memo] VARCHAR(400) );

INSERT INTO @test-2 ([memo])

SELECT 'abc this account used visa :1234567890123456' UNION ALL

SELECT 'this account visa 1234567890123456 request for receipt' UNION ALL

SELECT 'this account visa 123456789012345612/13 exp' UNION ALL

SELECT 'invoice card used 4123 4567 8901 2345 some text' UNION ALL

SELECT 'paid using 4123 4567 8901 2345 thank you' UNION ALL

SELECT 'Transaction code 123456454636667445 copy invoice'UNION ALL -- this is 18 digits, not 16!

SELECT 'mastercard cc#5987-6543-2109-8765. this order will be processed' UNION ALL

SELECT 'mastercard 5987-6543-2109-8765, thank you' union all

SELECT 'AMEX code 123454636667445 copy invoice'UNION ALL -- AMEX with no spacers

SELECT 'AMEX two 1234 546366 67445 copy invoice'UNION ALL -- AMEX with spacers

SELECT 'AMEX with exp date 12345463666744599/12 copy invoice'UNION ALL -- AMEX with no spacers + exp

SELECT 'another AMEX with exp date 1234 546366 6744599/12 copy invoice'UNION ALL -- AMEX with spacers + exp

SELECT 'AMEX missing 1 digit 1234567890123412/13 exp' UNION ALL -- INVALID, but 16 digits before the "/"

SELECT 'tran code 1234567890123412234234213' UNION ALL

SELECT 'any card 123456789012341223423421345/45 exp' UNION ALL -- INVALID, too long

SELECT ' period 1999-10-18 - -2000-10-17 ';

DECLARE @Pattern NVARCHAR(120);

SET @Pattern = N'\b(?:(?:\d{4}(?:[ -]?\d{4}){3}|\d{4}[ -]?\d{6}[ -]?\d{5})(?=\d{2}/\d{2}|[^/0-9]|$)|\d{16,}(?=[^/0-9]|$))';

SELECT [memo], SQL#.RegEx_MatchSimple4k([memo], @Pattern, 1, NULL) AS [CC]

FROM @test-2;

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 11

Very slick! Sadly I seem to be unable to get to the SQLSharp.com website via multiple browsers.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 12

TheSQLGuru (12/10/2016)
Very slick! Sadly I seem to be unable to get to the SQLSharp.com website via multiple browsers.

D'oh! My hosting provider did a server migration recently and I thought the DNS servers were set to the correct ones, but I just checked and they were still pointing to the old DNS servers. So, I just updated the domain record to point to the correct servers and that change should propagate throughout the day. Thanks for pointing that out!

Please try again at: http://SQLsharp.com/

Take care, Solomon...

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 13

All-better-fixed!

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 14

I am replying to both of the following posts together as there is a lot of overlap between them:

Luis Cazares (12/2/2016)
SQLsharp has a validation for credit cards. It will only validate a number and won't extract it from a string. If you want to try it, feel free to do it.

and:

TheSQLGuru (12/6/2016)
I will bet that properly crafted CLR and RegEx will be faster than TSQL for this data processing need, quite possibly significantly so.
Here are my favorite two resources to provide discussion and demonstrations and code for just how complex this space is:
https://www.schkerke.com/wps/2015/08/c-finding-credit-cards-within-free-form-text/
https://www.codeproject.com/Articles/20271/Ultimate-NET-Credit-Card-Utility-Class

Hi Luis and Kevin. Yes, SQL#[/url] does have a function, Util_IsValidCC, that validates credit card numbers using regular expressions (and yes, it is in the Free version ). The validation covers both the BIN / IIN prefixes as well as the Luhn calculation. No, it does not extract the CC numbers from a string. It can handle dashes in the numbers, but not spaces.

I took a look at the two resources posted by Kevin and then reviewed my code to see if I was missing anything. I found that Util_IsValidCC was both missing some definitions and had some room for improvement regarding performance. But, I also found that between the two resources posted by Kevin, and numerous others that I checked, absolutely nobody has an accurate set of RegEx patterns to describe the various IIN / BIN prefixes. And, nobody (not even Wikipedia) seems to even have an accurate description of what those prefix ranges are. Discover (which covers 5 card types) has some nuances that nobody seems to be aware of, and only some are aware of the new 2 series of Mastercard numbers, coming out in a couple of weeks ( https://www.mastercard.us/en-us/issuers/get-support/2-series-bin-expansion.html ).

So, I have now completed updating Util_IsValidCC (for the next release of SQL#) to not only be much faster and handle dashes, but also to handle the following card types: JCB, Carte Blanche, PayPal, Union Pay, MIR, UATP, Dankort, InterPayment, and Maestro (the range that doesn't overlap with other cards). I also added functions to breakout the separate validations to do either the prefixes-only or the Luhn calculation only. The next release will be published within the next few weeks.

And if I ever get the time, I will post the basic .NET CC validation code and RegEx expressions on GitHub :).

Take care,

Solomon..

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 15

And I just thought of another variation of what I proposed above, taking into account that the goal is to replace the offending card numbers and transaction IDs with "x"s. The following example is mostly the same as before, but shows using RegEx_Replace in two ways to facilitate an UPDATE statement: the first way, ReplaceExtractedCC, is a simple replace (using RegEx_Replace instead of RegEx_MatchSimple) that replaces whatever value is matched with a static string of 16 "x"s; the second way, ReplaceEachDigit, is a two-step approach that first uses RegEx_MatchSimple to extract the offending value, and then runs that through RegEx_Replace to replace each digit with an "x". The first approach is obviously simpler, but it can't make the "masked" value match the format of the original value. The second approach, by replacing each individual digit in the original number, matches the original format in terms of both number of digits and embedded spaces and/or dashes, if there are any. Enjoy...

(frowny face in the RegEx expression is a colon followed by a left-parenthesis)