March 23, 2012 at 11:33 am
Can any assist me with how to extract data from column2. This has a number of .pdf as well as .txt and .xls and the next issue is the files names have spaces so not sure how to use charindex or patindex or substring.. or if there is anyother way to solve this.
column1 column2
1 wo 99808944|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF
2 |F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS
Indeed to get the files from these sample columns
Column1 Column2
1 WO 99808944.PDF
1 99825525.PDF
2 UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT
2 NOTLOCLTHRMC JAN 10 (2).DOC
2 COPY OF HRMC 1-10 NOT LOCATED.XLS
March 23, 2012 at 11:48 am
Without a predefined pattern, it is tough to find out the file names.
It is possible though, if we have the list of file extensions, but still, the whole string should be subjected to multiple parsing for each file extension.
So, question to u, do you have the list of file possible file extensions that may be present in a row?
March 23, 2012 at 12:13 pm
yes the extensions present in the column are
.tif, .jpg,.txt,.pdf,.xls and .doc
March 23, 2012 at 12:19 pm
I have noticed that the file name starts after '\' or |F| and ends with an extension ..
Seeing this I think I need to first figure out how to seperate files + extension then arrange them with a line break or sequence
March 23, 2012 at 12:25 pm
the spaces in the file names makes it hard to do also;
i tried splitting the data on the slahses, and i'm not getting all teh possible results so far.
with mySampleData(id,val)
AS
(
SELECT '1','wo 99808944|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF' UNION ALL
SELECT '2','|F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS'
)
select
* from mySampleData
cross apply dbo.DelimitedSplit8K(val,'\') myf
Lowell
March 23, 2012 at 12:28 pm
yes with the spaces it makes it even harder
March 23, 2012 at 12:44 pm
How about this?
DECLARE @tab TABLE
(
column1 INT
,column2 VARCHAR(2000)
)
INSERT INTO @tab
SELECT 1, 'wo 99808944|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF'
UNION ALL SELECT 2, '|F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS'
; WITH CTE AS
(
SELECT T.column1
,T.column2
,RvrsStr = REVERSE(T.column2)
FROM @tab T
)
, Idxs AS
(
SELECT T.RvrsStr
,PDFIdx = CHARINDEX('FDP.', T.RvrsStr)
,TXTIdx = CHARINDEX('TXT.', T.RvrsStr)
,JPGIdx = CHARINDEX('GPJ.', T.RvrsStr)
,TIFIdx = CHARINDEX('FIT.', T.RvrsStr)
,XLSIdx = CHARINDEX('SLX.', T.RvrsStr)
,DOCIdx = CHARINDEX('COD.', T.RvrsStr)
FROM CTE T
)
, SlshIdx AS
(
SELECT
I.RvrsStr
, I.PDFIdx ,NxtSlshAfterPDF = CHARINDEX('\',I.RvrsStr,I.PDFIdx)
, I.TXTIdx ,NxtSlshAfterTXT = CHARINDEX('\',I.RvrsStr,I.TXTIdx)
, I.JPGIdx ,NxtSlshAfterJPG = CHARINDEX('\',I.RvrsStr,I.JPGIdx)
, I.TIFIdx ,NxtSlshAfterTIF = CHARINDEX('\',I.RvrsStr,I.TIFIdx)
, I.XLSIdx ,NxtSlshAfterXLS = CHARINDEX('\',I.RvrsStr,I.XLSIdx)
, I.DOCIdx ,NxtSlshAfterDOC = CHARINDEX('\',I.RvrsStr,I.DOCIdx)
FROM Idxs I
)
SELECT REVERSE ( S.RvrsStr )
, REVERSE ( SUBSTRING(S.RvrsStr , S.PDFIdx , S.NxtSlshAfterPDF - S.PDFIdx) )
, REVERSE ( SUBSTRING(S.RvrsStr , S.TXTIdx , S.NxtSlshAfterTXT - S.TXTIdx) )
, REVERSE ( SUBSTRING(S.RvrsStr , S.JPGIdx , S.NxtSlshAfterJPG - S.JPGIdx) )
, REVERSE ( SUBSTRING(S.RvrsStr , S.TIFIdx , S.NxtSlshAfterTIF - S.TIFIdx) )
, REVERSE ( SUBSTRING(S.RvrsStr , S.XLSIdx , S.NxtSlshAfterXLS - S.XLSIdx) )
, REVERSE ( SUBSTRING(S.RvrsStr , S.DOCIdx , S.NxtSlshAfterDOC - S.DOCIdx) )
FROM SlshIdx S
March 23, 2012 at 12:51 pm
This is almost close, except for one document is missing 99808944.PDF from the first column and also duplicates...
Thank you for the quick response
March 23, 2012 at 1:53 pm
knakka 14475 (3/23/2012)
This is almost close, except for one document is missing 99808944.PDF from the first column and also duplicates...Thank you for the quick response
That is because the 99808944.PDF did not have .PDF in the sample you gave. Now this makes me tell you the rules
1. If there are multiple documents with the same extension within a row, then we need to do some looping (set-based looping) to get the indexes of each extension and the parse the string.
2. This relies on the presence of "\" before every file name
3. The file name does not contain the reverse of file extensions within any other part of the query
4. if there are files with extensions that are not in the list then this query will not find them
5. If run over a million plus table, this query will bring the server down to its knees.
March 23, 2012 at 1:59 pm
Slightly improved version
DECLARE @tab TABLE
(
column1 INT
,column2 VARCHAR(2000)
)
INSERT INTO @tab
SELECT 1, 'wo 99808944.PDF|F|Q:\Links\WO 99808944.PDF 99825525|F|Q:\LINKS\99825525.PDF |F|d:\APPS\Myphoto.JPG'
UNION ALL SELECT 2, '|F|HRMC\UNABLE TO LOCATE REPORT FOR JANUARY 2010.TXT not located memo|F|HRMC\NOTLOCLTHRMC JAN 10 (2).DOC not located list|F|HRMC\COPY OF HRMC 1-10 NOT LOCATED.XLS'
; WITH CTE AS
(
SELECT T.column1
,T.column2
,RvrsStr = REVERSE(T.column2)
FROM @tab T
)
, Idxs AS
(
SELECT T.RvrsStr
,PDFIdx = CHARINDEX('FDP.', T.RvrsStr)
,TXTIdx = CHARINDEX('TXT.', T.RvrsStr)
,JPGIdx = CHARINDEX('GPJ.', T.RvrsStr)
,TIFIdx = CHARINDEX('FIT.', T.RvrsStr)
,XLSIdx = CHARINDEX('SLX.', T.RvrsStr)
,DOCIdx = CHARINDEX('COD.', T.RvrsStr)
FROM CTE T
)
, SlshIdx AS
(
SELECT
I.RvrsStr
, I.PDFIdx ,NxtSlshAfterPDF = CASE WHEN I.PDFIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.PDFIdx) ELSE 0 END
, I.TXTIdx ,NxtSlshAfterTXT = CASE WHEN I.TXTIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.TXTIdx) ELSE 0 END
, I.JPGIdx ,NxtSlshAfterJPG = CASE WHEN I.JPGIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.JPGIdx) ELSE 0 END
, I.TIFIdx ,NxtSlshAfterTIF = CASE WHEN I.TIFIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.TIFIdx) ELSE 0 END
, I.XLSIdx ,NxtSlshAfterXLS = CASE WHEN I.XLSIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.XLSIdx) ELSE 0 END
, I.DOCIdx ,NxtSlshAfterDOC = CASE WHEN I.DOCIdx <> 0 THEN CHARINDEX('\',I.RvrsStr,I.DOCIdx) ELSE 0 END
FROM Idxs I
)
--select * from SlshIdx
SELECT REVERSE ( S.RvrsStr )
, REVERSE ( SUBSTRING(S.RvrsStr , S.PDFIdx , S.NxtSlshAfterPDF - S.PDFIdx) ) AS PDF
, REVERSE ( SUBSTRING(S.RvrsStr , S.TXTIdx , S.NxtSlshAfterTXT - S.TXTIdx) ) AS TXT
, REVERSE ( SUBSTRING(S.RvrsStr , S.JPGIdx , S.NxtSlshAfterJPG - S.JPGIdx) ) AS JPG
, REVERSE ( SUBSTRING(S.RvrsStr , S.TIFIdx , S.NxtSlshAfterTIF - S.TIFIdx) ) AS TIF
, REVERSE ( SUBSTRING(S.RvrsStr , S.XLSIdx , S.NxtSlshAfterXLS - S.XLSIdx) ) AS XLS
, REVERSE ( SUBSTRING(S.RvrsStr , S.DOCIdx , S.NxtSlshAfterDOC - S.DOCIdx) ) AS DOC
FROM SlshIdx S
March 23, 2012 at 2:44 pm
Thank you.. that was really nice...
March 23, 2012 at 2:45 pm
Thank you.. That was really nice..
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply