May 7, 2013 at 10:29 pm
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition[/url] would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.
May 8, 2013 at 1:11 am
dawnbrown243 (5/7/2013)
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition[/url] would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.
Even though this thread is getting old, it was never fully resolved and remains interesting.
Are you able to suggest how to "use a mature and structure (sic.) pdf reading app" in SSIS to solve this problem?
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
May 9, 2013 at 9:51 am
Phil Parkin (5/8/2013)
dawnbrown243 (5/7/2013)
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition[/url] would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.Even though this thread is getting old, it was never fully resolved and remains interesting.
Are you able to suggest how to "use a mature and structure (sic.) pdf reading app" in SSIS to solve this problem?
The first thing I would try is invoking the application that extracts the text from the PDF file from a command line (I would expect a "mature" PDF processing application to have a command line interface), probably with an Execute Process task. It would likely be easiest to have the app write the text to a flat file, then use the appropriate connection managers, etc. to ETL those files.
Jason Wolfkill
Viewing 3 posts - 16 through 17 (of 17 total)
You must be logged in to reply to this topic. Login to reply