August 19, 2009 at 1:31 am
Hi All,
I have a varbinary datatype in which I have stored the files into that. How to search the words from those files. Please help us.
Thanks In Advance
Jana
August 19, 2009 at 6:53 am
Can you explain why you are using varbinary column to store plain text?
In any case here is a small demo that shows you what to do in case that the text is stored as binary data. I recommend that you'll consider using one of the string data types (char, varchar, nchar or nvarchar) instead of varbinary data type.
create table Demo (vb varbinary(100))
go
--Inserting 3 records into the table
insert into Demo (vb)
select convert(varbinary(100),'Why whould you use a varbinary column?')
union
select convert(varbinary(100),'When you store plain text,')
union
select convert(varbinary(100),'you should use one of the strings data type')
go
--Getting all the values as binary values
select vb from Demo
--Getting all the values as string by using convert function
select convert(varchar(100), vb) from Demo
--Getting only one of the records by using convert and like operator
--Notice that this will prevent the use of index, and if the table
--is a big table, you'll might have performance problems.
select convert(varchar(100), vb) from Demo where convert(varchar(100),vb) like '%store%'
go
drop table Demo
Adi
--------------------------------------------------------------
To know how to ask questions and increase the chances of getting asnwers:
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
August 20, 2009 at 3:15 am
We are not storing the plain text. we are storing the file into binary format in varbinary column. from that only we need to search the text.
Thanks In Advance
August 20, 2009 at 9:43 am
Create a full-text index and use CONTAINS to search the contents of a binary file. However, you'll need to have a separate column that contains the file type (e.g, doc, ppt, pdf, txt, etc.) of the binary document.
This column is specified when creating the full-text index. It is used by the full-text indexer to determine which iFilter to load in order to "read" through the binary and tokenize the terms (words) found in it.
See post 768830 for a set of links that will get you started with full-text.
http://www.sqlservercentral.com/Forums/FindPost768830.aspx
August 27, 2009 at 3:46 pm
I have a reason why I might want to do this.
My company specializes in storing and full-text indexing PDF files. I am, however, a little bit suspect of Microsoft's use of Adobe's iFilter. (It is so encapsulated that there is little to do when something goes wrong). If, however, my program were to use PDFBox to extract the text, that would be great. In that case, I could simply add the file to the database in a varbinary(max) FILESTREAM column and add the text in a varchar(max) field or something.
Another issue raises itself, however, and that is database size. I currently have a customer who has a single database of scanned PDF files that exceeds 100 gigabytes. I know that this size doesn't count against the 4 GB "Express Edition" database size limitation, but the varchar(max) field does, and for this database to be available under SS2008EE, I would somehow have to circumvent the 4 GB of space taken up by the textual contents of the PDF file. IF SQL were able to take text converted to binary and index it, then I could create a full-text index on that varbinary field.
So, from BLOB --> varbinary(max) (actual file) --> text contents --> varbinar(max) (text contents)
Is there a way to make this work?
Thanks!
August 28, 2009 at 7:37 am
Database size issues aside, as that is a business / environmental issue, I would still give the Adobe iFilter a try.
We are a SaaS-based company, hosting databases for our customers. They have the ability to upload and store PDF documents, along with Microsoft Word and PowerPoint files. Our application also provides the ability to search over the content contained in these documents.
We're currently using 32-bit SQL Server 2005 with the Adobe iFilter that is provided with the Adobe Reader 9.1.
I am, however, a little bit suspect of Microsoft's use of Adobe's iFilter. (It is so encapsulated that there is little to do when something goes wrong).
The issue is generic, regardless of the iFilter provider. I.e., Microsoft's Office iFilter has the same limitations.
August 28, 2009 at 8:10 am
I understand this to be a limitation of the iFilter setup in general. My assertion is, though, that if I move AWAY from the iFilter setup (or at least use the plain text iFilter, which has been much faster and more accurate in my testing), and use, say, PDFBox for text extraction from a PDF, then that process becomes much more granular and controlled by me. The problem with that solution is that it would require the insertion of that text output into the database for indexing. If, however, I can get that text into a binary format, yet still be able to run the text-only iFilter against it, then I could yet again circumvent the SS2008EE database size limitations.
Am I being clear with my question?
Current setup:
PDF Files (on filesystem) --> Database BLOBs (as varbinary(max) FILESTREAM) --> Adobe iFilter --> gives my a database size of roughly (Total Size of PDF files / 20)
Theorhetical setup:
PDF Files (on filesystem) --> extract text using PDFBox --> Now have related PDF file and text contents --> Database BLOB (1 for each) as varbinary(max) FILESTREAM --> Text iFilter (on TEXT varbinary(max), NOT PDF varbinary(max)) --> results in unknown database size
Basically, I'd like to get around using Adobe's iFilter (I don't mind Microsoft's text iFilter) and still maximize the number of records I can store in a database while still coming in under the 4 GB size limit.
Thanks!
August 28, 2009 at 8:37 am
You should be able to full-text index varbinary(max) data that is pure "text". Just specify a value of "txt" for the companion file type column.
I don't know if the documents you are trying to index are all of a single language/culture (e.g., "en-US") or multi-language. When you extract out the "text" from a document and then index it as plain text, you will lose any language-specific tokenization as the language "meta" tag will be lost. In other words, you cannot have a table containing rows of plain text in mixed languages indexed under the language-specific tokenization rules for the language of the row. This is a very bad SQL Server limitation. BTW, in Oracle, you can as you specify a column that contains the 3-character ISO language code. Granted, you can have separate columns for each language, but that is an ugly mess.
As for space savings, why not also store the extracted text column of varbinary(max) as FILESTREAM?
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply