May 10, 2007 at 9:45 am
Hi,
I am having a huge number of scanned images (TIF) which i need to make searchable.
How can i bulk upload and store these image files into SQL Server.
Once stored, how can i search the text within these images? What functionality within SQL Server will enable me to do this kind of a search.
Any suggestions as to how this can be done?
Any help is much appreciated.
Many Thanks,
Sathya
May 11, 2007 at 12:29 am
Some questions first.
1.) What is a huge number
2.) What is the average physical (storage size) of the images.
3.) I don't understand what you mean by "text within the images"
4.) What platform are you on (server version, sql server version, hardware)
May 11, 2007 at 2:50 am
Hi, Thanks for replying.
1. A few thousands (more than 10,000)
2. each image is around 300kb - 400kb
3. The images are the result of scanning paper based documents. They are in TIF format and have a number associated with them as their file names. But i want to search for a particular image based on the contents within the image. For example, the document (image) has the name of a person within it and i want to search using the name. There seems to be no metadata associated with these files so i am really stumped how i can search for a particular document which will be for a particular person.
4. SQL Server 2000 running on a new server / hardware (but dont know the exact details). You could also advice me on this.
Cheers!
May 11, 2007 at 5:45 am
Adding them to a table is not terribly difficult. I would suggest a VB or C# application that just inserts using a loop. As far as searching the text, you need to use some kind of an application to get the text from the tiff images. I would suggest you look on the internet for some OCR software to do this for you.
I have found some freeware ones in the past for a project or two.
May 12, 2007 at 1:35 pm
Like it has already been pointed out. You will need to run the images through as OCR package. The best imaging tools I have found are from one of the industry leaders LEAD TOOLS. They have complete imaging and document scanning, and image management suites that are fantastic. The only problem I see is that OCR, though very good, is not perfect and is very dependant on the quality of the scan and the fonts of the scanned documents. This is NOT a small project and you may be better off using an off-the-shelf application designed for this use, and there are many ranging in depth from small office to large enterprise. 10,000 documents falls into the small business multi-user category. You may even want to look at applications from the office automation companies like Canon, Sharp, Toshiba, Minolta, Ricoh
That said....
(assumptions: You know VB, you are on an MS platform, you have a decent budget)
It is best to read the images from the file system using the ADO.Text stream object, as it is optimized for blob manipulation in SQL Server.
Dim objImgStream As ADODB.Stream
Set objImgStream = New ADODB.Stream
'Set the textstream object type
objImgStream.Type = adTypeBinary
objImgStream.Open 'Load the first stream
objImgStream.LoadFromFile strFileName
Once you have the binary, you can easily move it between objects, like feeding in to the OCR and an ADO recordset for saving to the database. This method is way faster than using the old "Chunk" methods. So you would do something like this for each image.
Dim objRS As New ADODB.Recordset
Dim strSQL As String
'Build the SQL to get Blob for update/insert
strSQL = "Select * " _
& "From Documents " _
& "Where DocumentID = '" & strDocumentID & "' "
>> Assumes that the ID can be derived from the file name or document, this is important to eliminate the possibility of duplicates, and positively link the file-based image document to the databse record
as once the database save is completed you will delete the file-based image.
'Rewind the stream (always remeber to rewind the stream)
objImgStream.Position = 0
objRS.CursorType = adOpenKeyset
objRS.LockType = adLockOptimistic
objRS.Open strSQL, gobjConn, , , adCmdText
If Not objRS.EOF Then
'Just update the existing Image
objRS("DocumentText") = strOCRTextDocument '>> From the OCR SDK
objRS("DocumentImage") = objImgStream.Read
objRS("EditTS") = now() '>> or could be set using an update trigger on the table
Else
'Add a new record
objRS.AddNew
objRS("DocumentID") = strDocumentID
'>> again assumes this is given from the filename (your choice)
objRS("DocumentText") = strOCRTextDocument '>> from the OCR SDK
objRS("DocumentImage") = objImgStream.Read
End If
'call the ADO update method (note this is the fastest way to add blob data to a MS SQL Server)
On Error GoTo Err_handler
objRS.Update
On Error GoTo 0
objRS.Close
Set objRS = Nothing
DoEvents
Exit Sub
Err_handler:
Err.Raise -1, "Upload", "Upload | " & Err.Description
Just spin through the documents one at a time, feed them to the OCR object, save them to the database.
All you have to do now is write the whole management, viewing system (which is why you may want to find a package that is already developed.)
Hope this helps.....
May 14, 2007 at 9:11 am
You can save your TFF files to an Image (BLOB) table field, but making it searchable could be another story. You are basically saving a huge binary stream that is mostly not printable windows characters. Microsoft makes some filters that would allow you to put a full text index on these if they were Office docs, html or text, but I don't think that they support the TFF format. See BOL for the overview of full-text catalogs and searches. You could do a quick experiment with teh text filter to seeif it would give you what you need. If memory serves, there are multiple versions of TFF and you may have to write a dll in C or C# to parse these files byte by byte to get to the text that is stored-- kind of a pain.
May 15, 2007 at 6:17 am
Thanks for some useful suggestions. specially donald for the piece of code.
Unfortunately this application or tool that i am developing is just an interface to view archived scanned documents. It will not be a frequently used application and spending money to buy a out of the box solution is out of the question at the moment. It will only be used by me to search and find a document 3 or 4 times a week.
As you have all explained, i could use donald's code to upload the TIF files on to SQL Server. But how do i make them searchable because there is no metadata associated with it? All these files that i have probably need to be OCR-ed to be searchable.
I have been looking for free OCR software and i came across Microsoft Office Document Imaging (MODI) which is a free program that comes with office XP and later versions. What this does is pick a scanned document (in this case a TIF File) and there is an option to 'recognize text using OCR'. This OCR's the file and i can save it in another area. The MODI program also has a search function that you can use to search for a file based on the text within it. I can just use this program as my application to look up these files but i need to OCR each one of them first.
Now that's great, but as i mentioned i have 10000's of TIF files. I cannot import them one by one and OCR them. Can anyone suggest of how i can overcome this?
Thanks for all your replies.
May 15, 2007 at 7:31 am
Since you're storing the OCR'd text, you could just go with plan A - when someone requests the record return the stored OCR text, or actually run the OCR right then, store it, then return it. You could supplement that with a background job that is just working through the records doing the OCR. Might take days or weeks, but as long as low intensity, who cares?
May 15, 2007 at 7:37 pm
Unfortunately as I said above the OCR is the issue, which is why I included the OCR object in the process. If you look at the code you will see that the TIFF gets OCR'd as part of the process of uploading the raw TIFF to the database. By saving the TIFF reference image along with the OCR'd text it gives you a reference in case the OCR was not perfect. Then put a full-text index on that baby... and pow. The OCR can be done without any user intervention depending on the quality and font of the TIFF images.
If you need some examples of how this works go to
http://www.leadtools.com/SDK/Document/Document-Addon-OCR.htm
You might be able to find a less expensive OCR SDK if you look, or the object model for the MODI may be extensible and usable from VB or .NET.
We have sone this and on a butt-load more images than 10,000 and it goes pretty fast (again, so long as you keep user intervention to a minimum).
Look at the code again and you will see where to plugin the OCR routine.
Good luck... If I find any cheaper alternative to the LEAD Tools I will let you know.
May 15, 2007 at 8:55 pm
There is one more option:
You can get a program that bulk converts files, then upload the images and text using the code above, or some of these will output the bulk results into comma-delimted files, that you could load in Excel and use the find. The row contains the text and the name of the orig. TIFF
http://reviews.cnet.com/Presto_OCR_Pro_4_0/4540-3523_7-6759728-4.html?tag=sub
http://www.abbyy.com/finereader_ocr/
http://www.lucion.com/filecenterpro-overview.html
http://www.irislink.com/c2-480/Readiris-Pro-11-OCR-software.aspx?gclid=CMjJooDTkYwCFRs_ggodoTKlAQ
http://www.digitalriver.com/v2.0-img/operations/scansoft/site/html/omnipage/omnipage_pro15_mac.html
http://www.simpleocr.com/ (free ocr sdk)
http://www.transym.com/Sales.html
http://www.ilixis.com/developer/recog_shareware.html !!! Looks like the MODI may be extensible
http://www.sharetool.com/Utilities/Scanner_and_OCR/CuneiForm_OCR_9982.html
Ok got a little link happy buy... you get the idea...
Cheers
May 15, 2007 at 9:00 pm
Use this as your OCR object, pop it into the code I posted, slap a full text index on the results and you are done.
May 15, 2007 at 11:51 pm
I strongly recommend that you do NOT store images in your database... store the file names in the database and let the app load them.
See my next post below for why I struck this comment out.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 20, 2007 at 1:10 pm
I my opinion, and having worked with large image databases for many years now, storing the images in the database is far superior to storing them using what is now an old fashioned method.
When you store an image in the file system, which by the way is merely a file oriented database. You need to first lookup the path, then you need to get the image. That is TWO queries, and if the path and image ever get out of sync... nightmare.
Further, when you store a lot of images in the file system, (in our case over 5 million and growing daily), you have to split the images across multiple folders, or performance of the OS will suffer. God forbid you ever have to move the miriad of folders and images. With a database, depending on how its partitioned, its one file.
If you review the MS Terrabyte project, a project designed in the late 90's to test the performace and capability of storing binary image data from satelites in terrabyte databases, you will see that it is even MS best practices to store images in the database as blobs. Further, they have optimized many of the tools to manage images in the database, which provide a total lower cost of owenship and management, as well as improved performance.
We have been storing millions of images in the database, and it is fantastic. I can say that if we tried to store and manage this many images across a file system that it would have been a far more work and cost, as well as reduction in flexibility and funtionality.
We ran speed tests retriving images out of the database. It is more than two times faster when spread over a large numner of retrivals (e.g. 10,000).
I am mystified as to why anyone would would manage images the old way with the tools available today.
When working with a large volume of images; I very strongly recommend that you do NOT store them in the file system, and DO store them in a database.
November 17, 2015 at 5:02 am
Donald Burr (5/15/2007)
Use this as your OCR object, pop it into the code I posted, slap a full text index on the results and you are done.
http://www.ilixis.com/developer/modi.html%5B/quote%5D
This help me a lot:-D
November 17, 2015 at 8:28 am
After some recent experiences, I have to strongly agree with what Donald wrote above.
I have a phone system where the original programmers stored the recorded calls (similar to storing images) in the database. They also stored the path to the original file. Of course, I was mortified that folks stored the binary call recording in the database and immediately set out to correct that problem.
However, when I did due diligence to make sure that the file in the provided path was actually viable, I found that more than 10% of the files were missing and another 10% were corrupt.
The bottom line is that no one takes care of data better than a good DBA and so I must totally reverse the opinion that I previously stated. The data has to be stored somewhere... it might as well be under the wing of someone dedicated to protecting it and that would be in the database.
As Donald also pointed out, there are other benefits, as well. Yes, it does make for a huge database (mine is almost 400GB and growing) and most of that is in a single table. Learn to deal with it because it's worth it.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply