Trouble with getting FileName from .pdf file

  • Hello All,

    This is what i have to do:

    1. Read about a few thousand .pdf files

    2. Store their FileName in a SQL Table

    3. Run a few queries to look them against a database

    4. Move them in the end to a different folder.

    I have limited experience using SSIS, however the last 2 steps are fairly easy.

    I tried implementing according to what the author mentioned here in this article:

    http://www.sqlservercentral.com/articles/Integration+Services+%28SSIS%29/67871/

    However it fails because of the number of files, i am only successful in reading one file.

    This is what i have done:

    1. Foreach Loop to look into the folder.

    a. Retrieve file name using Name and extension

    b. Variable(MyFileValue) Value - *.pdf

    2. Use a script in the foreach loop to get the file name

    Public Sub Main()

    Dts.Variables("MyFileValue").Value = System.IO.Path.GetFileName(Dts.Variables("MyFileValue").Value.ToString())

    Dts.TaskResult = Dts.Results.Success

    End Sub

    3. Use a Dataflow task - And this is where i am confused, how do i add a variable as an input.

    I am using a Derived Column to capture the information and OLE DB Destination thereafter.

    As you can tell i am confused, any help would be appreciated.

  • Are you using a flat file source?

    You need to go into the Expressions section for the flat file source and set the file name/path to your file-path variable using an Expression. Then your loop should work - depending on what is in your dataflow. If there is anything in there which is file-format dependent and all of the files are not in this format, you will get errors.

    At this point you should probably also set DelayValidation to true for the flat file source.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • If I'm not mistaken, you just need to store the filenames of the PDF file into a table, not the actual data.

    In your for each loop, you can set it to a For Each File Enumerator. Point it to the folder and set *.pdf to files.

    In the variable mappings, you can map a variable to the currently processed filename.

    Select a variable and put 0 for the index.

    Now you have a variable that contains the filename of the currently processed PDF file.

    Add an Execute SQL Task in the For Each Loop and type the following query:

    INSERT INTO dbo.myPDFFiles(filename) VALUES(?)

    The question mark is a placeholder for a parameter. Map the variable in the parameter mapping pane of the Execute SQL Task (parameter name should be 0).

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Hello Koen,

    Wonderful option, worked like a charm. I guess i was over thinking into the problem.

    Now the other confusing part.

    So this is what i am doing:

    1. Read file name and write the file name in a table. (7000+ rows)

    2. Run a query to see if the file exists in the system. ( 10 + minutes)

    3. If it does i need to archive this file in a different folder.

    The 3rd step is confusing me, how should i tie this filename back to the original file in the folder?

    Also is there a way to reduce the execution time?

    These are the queries i am using:

    Query 1 :

    INSERT INTO dbo.FileName(FILE_NAME) VALUES(?)

    Query 2 :

    select

    REPLACE(substring(FILE_NAME,25,10) ,'.PDF','') as FILENAME,LOAD_DATE

    from dbo.FileName f

    WHERE

    REPLACE(substring(FILE_NAME,25,10) ,'.PDF','') IN

    (SELECT BatchNo FROM dbo.vwBatchNumber)

  • So this what i have changed again

    1. Read file name and write the file name in a table. (7000+ rows)

    2. Run a query to see if the file exists in the system. ( 10 seconds)

    3. Write the successful files into a table and append the file name with the archive folder path

    Now how do i pass this row to a variable to move this file from the actual destination to the archive folder?

    Thanks!!

  • I'll look into your problem after the weekend. Just so that you know I haven't forgot you 🙂

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • What exactly are you doing in step 2?

    When I look at your query, I guess you are selecting the filenames from the table populated in step 1, but only those who also have a record in another table. Am I right?

    To speed up this query, you can try the following:

    1. Create a computed column to store the filename. It has the following expression: REPLACE(substring(FILE_NAME,25,10) ,'.PDF',''). Let's call it filename_short.

    2. Create a non-clustered index on that computed column. (optional)

    3. Rewrite the query:

    SELECT

    [Filename_Short],[LOAD_DATE]

    FROM dbo.[FileName] f

    WHERE EXISTS (SELECT 1 FROM dbo.vwBatchNumber bn WHERE f.[Filename_Short] = bn.BatchNo);

    Normally, this should speed things up a little.

    Step 3:

    Store the results of the above query into a full resultset. Store this resultset into an object variable.

    Loop over this object variable with a For Each Loop, using the ADO.NET collection. You can map the value of the current filename to a variable in the variable mappings pane of the For Each Loop. Inside the loop, place a FileSystem Task to archive the file. You can use the variable to create dynamic expressions.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Thanks for the update Koen, i will work on this solution right away.

    Thanks!

    Abhi

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply