Extract data from a string

Question

Extract data from a string

knakka99

Say Hey Kid

Points: 671
More actions
March 23, 2012 at 11:33 am

#270721

Can any assist me with how to extract data from column2. This has a number of .pdf as well as .txt and .xls and the next issue is the files names have spaces so not sure how to use charindex or patindex or substring.. or if there is anyother way to solve this.
column1 column2
1 wo 99808944|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF
2 |F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS
Indeed to get the files from these sample columns
Column1 Column2
1 WO 99808944.PDF
1 99825525.PDF
2 UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT
2 NOTLOCLTHRMC JAN 10 (2).DOC
2 COPY OF HRMC 1-10 NOT LOCATED.XLS

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 1

Without a predefined pattern, it is tough to find out the file names.

It is possible though, if we have the list of file extensions, but still, the whole string should be subjected to multiple parsing for each file extension.

So, question to u, do you have the list of file possible file extensions that may be present in a row?

knakka99 Say Hey Kid Points: 671 More actions · Answer 2

yes the extensions present in the column are

.tif, .jpg,.txt,.pdf,.xls and .doc

knakka99 Say Hey Kid Points: 671 More actions · Answer 3

I have noticed that the file name starts after '\' or |F| and ends with an extension ..

Seeing this I think I need to first figure out how to seperate files + extension then arrange them with a line break or sequence

Lowell SSC Guru Points: 323518 More actions · Answer 4

the spaces in the file names makes it hard to do also;

i tried splitting the data on the slahses, and i'm not getting all teh possible results so far.

with mySampleData(id,val)

AS

(

SELECT '1','wo 99808944|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF' UNION ALL

SELECT '2','|F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS'

)

select

* from mySampleData

cross apply dbo.DelimitedSplit8K(val,'\') myf

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

knakka99 Say Hey Kid Points: 671 More actions · Answer 5

knakka99

Say Hey Kid

Points: 671

March 23, 2012 at 12:28 pm

#1463679

yes with the spaces it makes it even harder

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 6

How about this?

DECLARE @tab TABLE

(

column1 INT

,column2 VARCHAR(2000)

)

INSERT INTO @tab

SELECT 1, 'wo 99808944|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF'

UNION ALL SELECT 2, '|F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS'

; WITH CTE AS

(

SELECT T.column1

,T.column2

,RvrsStr = REVERSE(T.column2)

FROM @tab T

)

, Idxs AS

(

SELECT T.RvrsStr

,PDFIdx = CHARINDEX('FDP.', T.RvrsStr)

,TXTIdx = CHARINDEX('TXT.', T.RvrsStr)

,JPGIdx = CHARINDEX('GPJ.', T.RvrsStr)

,TIFIdx = CHARINDEX('FIT.', T.RvrsStr)

,XLSIdx = CHARINDEX('SLX.', T.RvrsStr)

,DOCIdx = CHARINDEX('COD.', T.RvrsStr)

FROM CTE T

)

, SlshIdx AS

(

SELECT

I.RvrsStr

, I.PDFIdx ,NxtSlshAfterPDF = CHARINDEX('\',I.RvrsStr,I.PDFIdx)

, I.TXTIdx ,NxtSlshAfterTXT = CHARINDEX('\',I.RvrsStr,I.TXTIdx)

, I.JPGIdx ,NxtSlshAfterJPG = CHARINDEX('\',I.RvrsStr,I.JPGIdx)

, I.TIFIdx ,NxtSlshAfterTIF = CHARINDEX('\',I.RvrsStr,I.TIFIdx)

, I.XLSIdx ,NxtSlshAfterXLS = CHARINDEX('\',I.RvrsStr,I.XLSIdx)

, I.DOCIdx ,NxtSlshAfterDOC = CHARINDEX('\',I.RvrsStr,I.DOCIdx)

FROM Idxs I

)

SELECT REVERSE ( S.RvrsStr )

, REVERSE ( SUBSTRING(S.RvrsStr , S.PDFIdx , S.NxtSlshAfterPDF - S.PDFIdx) )

, REVERSE ( SUBSTRING(S.RvrsStr , S.TXTIdx , S.NxtSlshAfterTXT - S.TXTIdx) )

, REVERSE ( SUBSTRING(S.RvrsStr , S.JPGIdx , S.NxtSlshAfterJPG - S.JPGIdx) )

, REVERSE ( SUBSTRING(S.RvrsStr , S.TIFIdx , S.NxtSlshAfterTIF - S.TIFIdx) )

, REVERSE ( SUBSTRING(S.RvrsStr , S.XLSIdx , S.NxtSlshAfterXLS - S.XLSIdx) )

, REVERSE ( SUBSTRING(S.RvrsStr , S.DOCIdx , S.NxtSlshAfterDOC - S.DOCIdx) )

FROM SlshIdx S

knakka99 Say Hey Kid Points: 671 More actions · Answer 7

This is almost close, except for one document is missing 99808944.PDF from the first column and also duplicates...

Thank you for the quick response

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 8

knakka 14475 (3/23/2012)
This is almost close, except for one document is missing 99808944.PDF from the first column and also duplicates...
Thank you for the quick response

That is because the 99808944.PDF did not have .PDF in the sample you gave. Now this makes me tell you the rules

1. If there are multiple documents with the same extension within a row, then we need to do some looping (set-based looping) to get the indexes of each extension and the parse the string.

2. This relies on the presence of "\" before every file name

3. The file name does not contain the reverse of file extensions within any other part of the query

4. if there are files with extensions that are not in the list then this query will not find them

5. If run over a million plus table, this query will bring the server down to its knees.

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 9

Slightly improved version

DECLARE @tab TABLE

(

column1 INT

,column2 VARCHAR(2000)

)

INSERT INTO @tab

SELECT 1, 'wo 99808944.PDF|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF |F|d:\APPS\Myphoto.JPG'

UNION ALL SELECT 2, '|F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS'

; WITH CTE AS

(

SELECT T.column1

,T.column2

,RvrsStr = REVERSE(T.column2)

FROM @tab T

)

, Idxs AS

(

SELECT T.RvrsStr

,PDFIdx = CHARINDEX('FDP.', T.RvrsStr)

,TXTIdx = CHARINDEX('TXT.', T.RvrsStr)

,JPGIdx = CHARINDEX('GPJ.', T.RvrsStr)

,TIFIdx = CHARINDEX('FIT.', T.RvrsStr)

,XLSIdx = CHARINDEX('SLX.', T.RvrsStr)

,DOCIdx = CHARINDEX('COD.', T.RvrsStr)

FROM CTE T

)

, SlshIdx AS

(

SELECT

I.RvrsStr

, I.PDFIdx ,NxtSlshAfterPDF = CASE WHEN I.PDFIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.PDFIdx) ELSE 0 END

, I.TXTIdx ,NxtSlshAfterTXT = CASE WHEN I.TXTIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.TXTIdx) ELSE 0 END

, I.JPGIdx ,NxtSlshAfterJPG = CASE WHEN I.JPGIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.JPGIdx) ELSE 0 END

, I.TIFIdx ,NxtSlshAfterTIF = CASE WHEN I.TIFIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.TIFIdx) ELSE 0 END

, I.XLSIdx ,NxtSlshAfterXLS = CASE WHEN I.XLSIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.XLSIdx) ELSE 0 END

, I.DOCIdx ,NxtSlshAfterDOC = CASE WHEN I.DOCIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.DOCIdx) ELSE 0 END

FROM Idxs I

)

--select * from SlshIdx

SELECT REVERSE ( S.RvrsStr )

, REVERSE ( SUBSTRING(S.RvrsStr , S.PDFIdx , S.NxtSlshAfterPDF - S.PDFIdx) ) AS PDF

, REVERSE ( SUBSTRING(S.RvrsStr , S.TXTIdx , S.NxtSlshAfterTXT - S.TXTIdx) ) AS TXT

, REVERSE ( SUBSTRING(S.RvrsStr , S.JPGIdx , S.NxtSlshAfterJPG - S.JPGIdx) ) AS JPG

, REVERSE ( SUBSTRING(S.RvrsStr , S.TIFIdx , S.NxtSlshAfterTIF - S.TIFIdx) ) AS TIF

, REVERSE ( SUBSTRING(S.RvrsStr , S.XLSIdx , S.NxtSlshAfterXLS - S.XLSIdx) ) AS XLS

, REVERSE ( SUBSTRING(S.RvrsStr , S.DOCIdx , S.NxtSlshAfterDOC - S.DOCIdx) ) AS DOC

FROM SlshIdx S

knakka99 Say Hey Kid Points: 671 More actions · Answer 10

knakka99

Say Hey Kid

Points: 671

March 23, 2012 at 2:44 pm

#1463808

Thank you.. that was really nice...

knakka99 Say Hey Kid Points: 671 More actions · Answer 11

knakka99

Say Hey Kid

Points: 671

March 23, 2012 at 2:45 pm

#1463809

Thank you.. That was really nice..