Extract data from a string

  • Can any assist me with how to extract data from column2. This has a number of .pdf as well as .txt and .xls and the next issue is the files names have spaces so not sure how to use charindex or patindex or substring.. or if there is anyother way to solve this.

    column1 column2

    1 wo 99808944|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF

    2 |F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS

    Indeed to get the files from these sample columns

    Column1 Column2

    1 WO 99808944.PDF

    1 99825525.PDF

    2 UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT

    2 NOTLOCLTHRMC JAN 10 (2).DOC

    2 COPY OF HRMC 1-10 NOT LOCATED.XLS

  • Without a predefined pattern, it is tough to find out the file names.

    It is possible though, if we have the list of file extensions, but still, the whole string should be subjected to multiple parsing for each file extension.

    So, question to u, do you have the list of file possible file extensions that may be present in a row?

  • yes the extensions present in the column are

    .tif, .jpg,.txt,.pdf,.xls and .doc

  • I have noticed that the file name starts after '\' or |F| and ends with an extension ..

    Seeing this I think I need to first figure out how to seperate files + extension then arrange them with a line break or sequence

  • the spaces in the file names makes it hard to do also;

    i tried splitting the data on the slahses, and i'm not getting all teh possible results so far.

    with mySampleData(id,val)

    AS

    (

    SELECT '1','wo 99808944|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF' UNION ALL

    SELECT '2','|F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS'

    )

    select

    * from mySampleData

    cross apply dbo.DelimitedSplit8K(val,'\') myf

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • yes with the spaces it makes it even harder

  • How about this?

    DECLARE @tab TABLE

    (

    column1 INT

    ,column2 VARCHAR(2000)

    )

    INSERT INTO @tab

    SELECT 1, 'wo 99808944|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF'

    UNION ALL SELECT 2, '|F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS'

    ; WITH CTE AS

    (

    SELECT T.column1

    ,T.column2

    ,RvrsStr = REVERSE(T.column2)

    FROM @tab T

    )

    , Idxs AS

    (

    SELECT T.RvrsStr

    ,PDFIdx = CHARINDEX('FDP.', T.RvrsStr)

    ,TXTIdx = CHARINDEX('TXT.', T.RvrsStr)

    ,JPGIdx = CHARINDEX('GPJ.', T.RvrsStr)

    ,TIFIdx = CHARINDEX('FIT.', T.RvrsStr)

    ,XLSIdx = CHARINDEX('SLX.', T.RvrsStr)

    ,DOCIdx = CHARINDEX('COD.', T.RvrsStr)

    FROM CTE T

    )

    , SlshIdx AS

    (

    SELECT

    I.RvrsStr

    , I.PDFIdx ,NxtSlshAfterPDF = CHARINDEX('\',I.RvrsStr,I.PDFIdx)

    , I.TXTIdx ,NxtSlshAfterTXT = CHARINDEX('\',I.RvrsStr,I.TXTIdx)

    , I.JPGIdx ,NxtSlshAfterJPG = CHARINDEX('\',I.RvrsStr,I.JPGIdx)

    , I.TIFIdx ,NxtSlshAfterTIF = CHARINDEX('\',I.RvrsStr,I.TIFIdx)

    , I.XLSIdx ,NxtSlshAfterXLS = CHARINDEX('\',I.RvrsStr,I.XLSIdx)

    , I.DOCIdx ,NxtSlshAfterDOC = CHARINDEX('\',I.RvrsStr,I.DOCIdx)

    FROM Idxs I

    )

    SELECT REVERSE ( S.RvrsStr )

    , REVERSE ( SUBSTRING(S.RvrsStr , S.PDFIdx , S.NxtSlshAfterPDF - S.PDFIdx) )

    , REVERSE ( SUBSTRING(S.RvrsStr , S.TXTIdx , S.NxtSlshAfterTXT - S.TXTIdx) )

    , REVERSE ( SUBSTRING(S.RvrsStr , S.JPGIdx , S.NxtSlshAfterJPG - S.JPGIdx) )

    , REVERSE ( SUBSTRING(S.RvrsStr , S.TIFIdx , S.NxtSlshAfterTIF - S.TIFIdx) )

    , REVERSE ( SUBSTRING(S.RvrsStr , S.XLSIdx , S.NxtSlshAfterXLS - S.XLSIdx) )

    , REVERSE ( SUBSTRING(S.RvrsStr , S.DOCIdx , S.NxtSlshAfterDOC - S.DOCIdx) )

    FROM SlshIdx S

  • This is almost close, except for one document is missing 99808944.PDF from the first column and also duplicates...

    Thank you for the quick response

  • knakka 14475 (3/23/2012)


    This is almost close, except for one document is missing 99808944.PDF from the first column and also duplicates...

    Thank you for the quick response

    That is because the 99808944.PDF did not have .PDF in the sample you gave. Now this makes me tell you the rules

    1. If there are multiple documents with the same extension within a row, then we need to do some looping (set-based looping) to get the indexes of each extension and the parse the string.

    2. This relies on the presence of "\" before every file name

    3. The file name does not contain the reverse of file extensions within any other part of the query

    4. if there are files with extensions that are not in the list then this query will not find them

    5. If run over a million plus table, this query will bring the server down to its knees.

  • Slightly improved version

    DECLARE @tab TABLE

    (

    column1 INT

    ,column2 VARCHAR(2000)

    )

    INSERT INTO @tab

    SELECT 1, 'wo 99808944.PDF|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF |F|d:\APPS\Myphoto.JPG'

    UNION ALL SELECT 2, '|F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS'

    ; WITH CTE AS

    (

    SELECT T.column1

    ,T.column2

    ,RvrsStr = REVERSE(T.column2)

    FROM @tab T

    )

    , Idxs AS

    (

    SELECT T.RvrsStr

    ,PDFIdx = CHARINDEX('FDP.', T.RvrsStr)

    ,TXTIdx = CHARINDEX('TXT.', T.RvrsStr)

    ,JPGIdx = CHARINDEX('GPJ.', T.RvrsStr)

    ,TIFIdx = CHARINDEX('FIT.', T.RvrsStr)

    ,XLSIdx = CHARINDEX('SLX.', T.RvrsStr)

    ,DOCIdx = CHARINDEX('COD.', T.RvrsStr)

    FROM CTE T

    )

    , SlshIdx AS

    (

    SELECT

    I.RvrsStr

    , I.PDFIdx ,NxtSlshAfterPDF = CASE WHEN I.PDFIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.PDFIdx) ELSE 0 END

    , I.TXTIdx ,NxtSlshAfterTXT = CASE WHEN I.TXTIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.TXTIdx) ELSE 0 END

    , I.JPGIdx ,NxtSlshAfterJPG = CASE WHEN I.JPGIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.JPGIdx) ELSE 0 END

    , I.TIFIdx ,NxtSlshAfterTIF = CASE WHEN I.TIFIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.TIFIdx) ELSE 0 END

    , I.XLSIdx ,NxtSlshAfterXLS = CASE WHEN I.XLSIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.XLSIdx) ELSE 0 END

    , I.DOCIdx ,NxtSlshAfterDOC = CASE WHEN I.DOCIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.DOCIdx) ELSE 0 END

    FROM Idxs I

    )

    --select * from SlshIdx

    SELECT REVERSE ( S.RvrsStr )

    , REVERSE ( SUBSTRING(S.RvrsStr , S.PDFIdx , S.NxtSlshAfterPDF - S.PDFIdx) ) AS PDF

    , REVERSE ( SUBSTRING(S.RvrsStr , S.TXTIdx , S.NxtSlshAfterTXT - S.TXTIdx) ) AS TXT

    , REVERSE ( SUBSTRING(S.RvrsStr , S.JPGIdx , S.NxtSlshAfterJPG - S.JPGIdx) ) AS JPG

    , REVERSE ( SUBSTRING(S.RvrsStr , S.TIFIdx , S.NxtSlshAfterTIF - S.TIFIdx) ) AS TIF

    , REVERSE ( SUBSTRING(S.RvrsStr , S.XLSIdx , S.NxtSlshAfterXLS - S.XLSIdx) ) AS XLS

    , REVERSE ( SUBSTRING(S.RvrsStr , S.DOCIdx , S.NxtSlshAfterDOC - S.DOCIdx) ) AS DOC

    FROM SlshIdx S

  • Thank you.. that was really nice...

  • Thank you.. That was really nice..

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply