June 28, 2006 at 5:57 pm
I currently have a large number of PDF files on a web server and a reference to them in my database. I'm moving them to within my DB to better control what is there and make the links easier to manage. Enough on my position on that question.
I want to import the files into a table of their own, assume it has PDFID, PDFName and PDFImage as the columns. Thanks to an earlier thread, I know how to import a file. I know how to copy a directory listing to an excel file, clean it up to just the names, and then import it into SQL. Is the following approach good or is there a better way of doing it?
1. Create text directory listing
2. Import to excel and clean-up
3. Import to SQL and use to create a loop and loop through all names to import files
- This gives me the name reference for the PDFName column
- This guarantees I import all files from the reference directory
This would all be done in code (VB) since I can do all parts of it already. I haven't been able to get anything but basic import done with the DTS methods, so not sure if these steps can be done through that utility or not.
June 30, 2006 at 4:05 am
I would change the 2. Import to excel and clean-up to:
2. Import to temp table and clean-up
Why you might say, Excel is really easy to use and has great interactive features.
When you find that Excel has a per cell data type adjustment, you will figure why your step 3 fails for most rows and due to different reasons:
Telephone, Zip+4, Social Security Numbers, Date, and Time columns are changed to arrays of numbers but shown as a string, very hard to detect and impossible to get more than the first array member when importing from Excel. Workaround for this is to export from Excel to a Tab Delimited Text format before importing into SQL Server.
Very easy to get leading or trailing spaces and non printable characters like CHAR(160), which you cannot see in Excel and will have to deal with at some point, you may end up with " A" States in SQL instead of " AK", or CHAR(160)+"AK".
Excel lets you type anything into any column, try to import N/A into an Amount money column in SQL Server.
Excel treats anything that it thinks is a number as a number, so the Zip = 07004 is 7004 in Excel and will have to be cleaned up in SQL, very hard to determine bad data entry vs Excel being too smart.
When you get to the bit of importing the BLOB, look at the ADODB.Stream object, works much better than trying to parse the binary file into chunks.
OK, off the soap box now...
Andy
June 30, 2006 at 10:11 am
Thanks for the reply. When I create a listing of the directory contents, it is a text file with a header, timestamps and a few other things I don't want. I clean it up by deleting these items and leave just a string of the file name and add a column with an incrementing column for the index or ID column of a temp table. My code determines the number of entries in the table and sets a range for the loop. Then it's just a matter of looping through, using the name as the PDFName value and the contents of the file as the PDFData value. I can do a test within the loop to see if the PDF file is already in the table and skip over it, thus eliminating the duplicate issue/error.
As I said, I do this is VB code. Not very good with DTS, but wondered if there was a way of setting up something in SQL to do this. Initially I have to import 1000 files, then weekly a much smaller number. I can always use the same directory structure, and they will always go to the same table. That's why I posted this in the strategy section.
July 7, 2006 at 2:03 am
Woops I really missed that one, sorry about the Excel bashing.
Here are some discussions that may help.
http://www.sqlservercentral.com/forums/shwmessage.aspx?forumid=9&messageid=205450
Andy
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply