How to convert PDF data to database table in SQL server

Question

How to convert PDF data to database table in SQL server

keddula

SSC Veteran

Points: 243
More actions
April 28, 2018 at 12:47 am

#410465

Hi,
Can any one suggest on conveting PDF file data to SQL table in SQL server.
Thanks.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 1

keddula - Saturday, April 28, 2018 12:47 AM
Hi,
Can any one suggest on conveting PDF file data to SQL table in SQL server.
Thanks.

It depends, the PDF can have different file formats, there are several versions and implementations of the PDF standards and the data can be structured in many different ways. We will need a lot more information to answer this question.
😎
The worst case would be if the data is an embedded image such as scan of a printed page.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 2

keddula - Saturday, April 28, 2018 12:47 AM
Hi,
Can any one suggest on conveting PDF file data to SQL table in SQL server.
Thanks.

You need to identify what you mean by "conveting [sic] PDF file data to SQL table in SQL server".

Are you talking about importing the entire document as a single blob or are you talking about somehow extracting pieces of the document?

If the former, use an OPENROWSET to import the file as a BLOB and store it in a VARBINARY(MAX) column.

If the latter, don't try it in SQL Server. It's just not worth the time. Buy some 3rd party software that will parse the PDF according to a "form" layout and import the results of that as a CSV or TSV.

The real key here is... why not just leave the file on the file system and have SQL Server store the path to it?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 3

Jeff Moden - Sunday, April 29, 2018 7:29 AM
keddula - Saturday, April 28, 2018 12:47 AM
Hi,
Can any one suggest on conveting PDF file data to SQL table in SQL server.
Thanks.
You need to identify what you mean by "conveting [sic] PDF file data to SQL table in SQL server".
Are you talking about importing the entire document as a single blob or are you talking about somehow extracting pieces of the document?
If the former, use an OPENROWSET to import the file as a BLOB and store it in a VARBINARY(MAX) column.
If the latter, don't try it in SQL Server. It's just not worth the time. Buy some 3rd party software that will parse the PDF according to a "form" layout and import the results of that as a CSV or TSV.
The real key here is... why not just leave the file on the file system and have SQL Server store the path to it?

Good points Jeff, I made the mistake of assuming that the data was a tabular set, my bad :Whistling:
😎
I have written a parser or few for digesting PDF in SQL and also PDF generators, nothing too complex when one understands the format.

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 4

Jeff Moden - Sunday, April 29, 2018 7:29 AM
keddula - Saturday, April 28, 2018 12:47 AM
Hi,
Can any one suggest on conveting PDF file data to SQL table in SQL server.
Thanks.
You need to identify what you mean by "conveting [sic] PDF file data to SQL table in SQL server".
Are you talking about importing the entire document as a single blob or are you talking about somehow extracting pieces of the document?
If the former, use an OPENROWSET to import the file as a BLOB and store it in a VARBINARY(MAX) column.
If the latter, don't try it in SQL Server. It's just not worth the time. Buy some 3rd party software that will parse the PDF according to a "form" layout and import the results of that as a CSV or TSV.
The real key here is... why not just leave the file on the file system and have SQL Server store the path to it?

Good points Jeff, I made the mistake of assuming that the data was a tabular set, my bad :Whistling:
😎
I have written a parser or few for digesting PDF in SQL and also PDF generators, nothing too complex when one understands the format.

keddula SSC Veteran Points: 243 More actions · Answer 5

Eirikur Eiriksson - Sunday, April 29, 2018 7:39 AM
Jeff Moden - Sunday, April 29, 2018 7:29 AM
keddula - Saturday, April 28, 2018 12:47 AM
Hi,
Can any one suggest on conveting PDF file data to SQL table in SQL server.
Thanks.
You need to identify what you mean by "conveting [sic] PDF file data to SQL table in SQL server".
Are you talking about importing the entire document as a single blob or are you talking about somehow extracting pieces of the document?
If the former, use an OPENROWSET to import the file as a BLOB and store it in a VARBINARY(MAX) column.
If the latter, don't try it in SQL Server. It's just not worth the time. Buy some 3rd party software that will parse the PDF according to a "form" layout and import the results of that as a CSV or TSV.
The real key here is... why not just leave the file on the file system and have SQL Server store the path to it?
Good points Jeff, I made the mistake of assuming that the data was a tabular set, my bad :Whistling:
😎
I have written a parser or few for digesting PDF in SQL and also PDF generators, nothing too complex when one understands the format.

Thanks for your reply..ðŸ˜€

Requirement is to import the entire PDF (it has columns) into database as a table. I will try using OPENROWSET to import.

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 6

keddula - Sunday, April 29, 2018 8:23 AM
Eirikur Eiriksson - Sunday, April 29, 2018 7:39 AM
Jeff Moden - Sunday, April 29, 2018 7:29 AM
keddula - Saturday, April 28, 2018 12:47 AM
Hi,
Can any one suggest on conveting PDF file data to SQL table in SQL server.
Thanks.
You need to identify what you mean by "conveting [sic] PDF file data to SQL table in SQL server".
Are you talking about importing the entire document as a single blob or are you talking about somehow extracting pieces of the document?
If the former, use an OPENROWSET to import the file as a BLOB and store it in a VARBINARY(MAX) column.
If the latter, don't try it in SQL Server. It's just not worth the time. Buy some 3rd party software that will parse the PDF according to a "form" layout and import the results of that as a CSV or TSV.
The real key here is... why not just leave the file on the file system and have SQL Server store the path to it?
Good points Jeff, I made the mistake of assuming that the data was a tabular set, my bad :Whistling:
😎
I have written a parser or few for digesting PDF in SQL and also PDF generators, nothing too complex when one understands the format.
Thanks for your reply..ðŸ˜€
Requirement is to import the entire PDF (it has columns) into database as a table. I will try using OPENROWSET to import.

Can you provide more information please?
😎
The openrowset will produce a binary blob, not tabular data.

keddula SSC Veteran Points: 243 More actions · Answer 7

Eirikur Eiriksson - Sunday, April 29, 2018 8:39 AM
keddula - Sunday, April 29, 2018 8:23 AM
Eirikur Eiriksson - Sunday, April 29, 2018 7:39 AM
Jeff Moden - Sunday, April 29, 2018 7:29 AM
keddula - Saturday, April 28, 2018 12:47 AM
Hi,
Can any one suggest on conveting PDF file data to SQL table in SQL server.
Thanks.
You need to identify what you mean by "conveting [sic] PDF file data to SQL table in SQL server".
Are you talking about importing the entire document as a single blob or are you talking about somehow extracting pieces of the document?
If the former, use an OPENROWSET to import the file as a BLOB and store it in a VARBINARY(MAX) column.
If the latter, don't try it in SQL Server. It's just not worth the time. Buy some 3rd party software that will parse the PDF according to a "form" layout and import the results of that as a CSV or TSV.
The real key here is... why not just leave the file on the file system and have SQL Server store the path to it?
Good points Jeff, I made the mistake of assuming that the data was a tabular set, my bad :Whistling:
😎
I have written a parser or few for digesting PDF in SQL and also PDF generators, nothing too complex when one understands the format.
Thanks for your reply..ðŸ˜€
Requirement is to import the entire PDF (it has columns) into database as a table. I will try using OPENROWSET to import.
Can you provide more information please?
😎
The openrowset will produce a binary blob, not tabular data.

PDF has data in table format (headers and columns). Want to import this to DB, like we do for CSV files using SQL loader(in oracle).
Need to import the entire data of PDF into SQL table in tabular format, which will be having entire data of PDF including with its columns headers details.

Thank You.

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 8

keddula - Sunday, April 29, 2018 8:59 AM
Eirikur Eiriksson - Sunday, April 29, 2018 8:39 AM
keddula - Sunday, April 29, 2018 8:23 AM
Eirikur Eiriksson - Sunday, April 29, 2018 7:39 AM
Jeff Moden - Sunday, April 29, 2018 7:29 AM
keddula - Saturday, April 28, 2018 12:47 AM
Hi,
Can any one suggest on conveting PDF file data to SQL table in SQL server.
Thanks.
You need to identify what you mean by "conveting [sic] PDF file data to SQL table in SQL server".
Are you talking about importing the entire document as a single blob or are you talking about somehow extracting pieces of the document?
If the former, use an OPENROWSET to import the file as a BLOB and store it in a VARBINARY(MAX) column.
If the latter, don't try it in SQL Server. It's just not worth the time. Buy some 3rd party software that will parse the PDF according to a "form" layout and import the results of that as a CSV or TSV.
The real key here is... why not just leave the file on the file system and have SQL Server store the path to it?
Good points Jeff, I made the mistake of assuming that the data was a tabular set, my bad :Whistling:
😎
I have written a parser or few for digesting PDF in SQL and also PDF generators, nothing too complex when one understands the format.
Thanks for your reply..ðŸ˜€
Requirement is to import the entire PDF (it has columns) into database as a table. I will try using OPENROWSET to import.
Can you provide more information please?
😎
The openrowset will produce a binary blob, not tabular data.
PDF has data in table format (headers and columns). Want to import this to DB, like we do for CSV files using SQL loader(in oracle).
Need to import the entire data of PDF into SQL table in tabular format, which will be having entire data of PDF including with its columns headers details.
Thank You.

Do you know Ptb from PDF?
😎

Jeff can tell you. ...

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 9

keddula - Sunday, April 29, 2018 8:59 AM
Eirikur Eiriksson - Sunday, April 29, 2018 8:39 AM
keddula - Sunday, April 29, 2018 8:23 AM
Eirikur Eiriksson - Sunday, April 29, 2018 7:39 AM
Jeff Moden - Sunday, April 29, 2018 7:29 AM
keddula - Saturday, April 28, 2018 12:47 AM
Hi,
Can any one suggest on conveting PDF file data to SQL table in SQL server.
Thanks.
You need to identify what you mean by "conveting [sic] PDF file data to SQL table in SQL server".
Are you talking about importing the entire document as a single blob or are you talking about somehow extracting pieces of the document?
If the former, use an OPENROWSET to import the file as a BLOB and store it in a VARBINARY(MAX) column.
If the latter, don't try it in SQL Server. It's just not worth the time. Buy some 3rd party software that will parse the PDF according to a "form" layout and import the results of that as a CSV or TSV.
The real key here is... why not just leave the file on the file system and have SQL Server store the path to it?
Good points Jeff, I made the mistake of assuming that the data was a tabular set, my bad :Whistling:
😎
I have written a parser or few for digesting PDF in SQL and also PDF generators, nothing too complex when one understands the format.
Thanks for your reply..ðŸ˜€
Requirement is to import the entire PDF (it has columns) into database as a table. I will try using OPENROWSET to import.
Can you provide more information please?
😎
The openrowset will produce a binary blob, not tabular data.
PDF has data in table format (headers and columns). Want to import this to DB, like we do for CSV files using SQL loader(in oracle).
Need to import the entire data of PDF into SQL table in tabular format, which will be having entire data of PDF including with its columns headers details.
Thank You.

If you mean that you want a grid like structure that resembles a table in the PDF to actually be converted to a table with rows and column in SQL Server, OPENROWSET won't do it for you. You'll need separate software specifically designed to render out such information into a CSV and then import the CSV.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

B_Oakes SSC-Addicted Points: 404 More actions · Answer 10

Sounds to me that you are searching for something like Tabula. This will allow you to extract the data into CSV format (as long as it's a text based document, not a scan) and then you can import the data from the CSV.

Eric M Russell SSC Guru Points: 125623 More actions · Answer 11

keddula - Sunday, April 29, 2018 8:59 AM
PDF has data in table format (headers and columns). Want to import this to DB, like we do for CSV files using SQL loader(in oracle).
Need to import the entire data of PDF into SQL table in tabular format, which will be having entire data of PDF including with its columns headers details.
Thank You.

There are tool, some of them open source, that can extract data from PDF. But, perhaps you're approaching this from the wrong angle. One problem with extracting from something like a binary document is that the data can get corrupted in the process, and another problem is that the format of the document can change without notice, breaking your ETL process. This can happen even with something like Excel but even more so with PDF. If the tabular data contained in the PDF documents originates from a database, then consider extracting the data from the source database.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

skeleton567 SSCertifiable Points: 5929 More actions · Answer 12

I ran across this post while searching how to do just this task. I have a 776-page hardbound book with ancestry data that I want to import and parse into a SQL database. This was actually found by accident by a cousin exploring the Library of Congress.

So far, I have been fortunate to find, besides the work published in 1940, a very good quality PDF version that was re-published in 2000. I also located a freebie utility that read the file and created a somewhat nicely formatted .TXT version, but with some formatting issues. Then I also discovered another little application that scanned the .pdf file and created a text file for each page - a total of 776 text files! (Not sure the value of THAT yet.)

The large .txt file does contain some issues, as Eric mentions, such as invalid character recognition of some hyphenated words, lots of unwanted carriage returns and special characters. There are many, many indented detail paragraphs scattered among the standard GEDCOM notations. I'm going to use my aged vesion of Visual Studio 2010 Shell to create an import process for the text data version into a main table and attempt to use my also aged SQL capabilities to scan, clean, and parse the data into a custom relational database design. I expect this is going to take lots of visual work comparing the data to the printed version, but it will be worth it. I have NO front-end development skills any more but I used to be pretty good at parsing text data with SQL so I hope I can get this done...if I live long enough. I did, however, contract the publisher and found they had five remaining copies, so I bought all five.

Even in hardback structure, this data has immense value to my family because it has documentation beginning in 1303 in Switzerland and continues down to it's 1940 publication date to include my father's generation. But then we can easily bring the data up to date from where it ends for our one-of-four branches.

This reply was modified 2 years, 3 months ago by skeleton567. Reason: accuracy

Rick
If you do a half-assed job of things, folks will ask 'why did this ass only do half the job?'

pietlinden SSC Guru Points: 63640 More actions · Answer 13

Not meaning to hijack this thread, but I can read PDF files and parse the stuff I need out of them using PowerQuery. Can I call a PowerQuery "script" in whatever SSIS is these days? I guess I can play with it in Azure Data Factory.

Emperor100 SSCrazy Points: 2222 More actions · Answer 14

First convert the PDF to excel format. Remove any unwanted lines, rows or any characters. Verify the data is correctly formatted, then save it as .csv file. Use import/export wizard to import the data to SQL Server table.

=======================================================================