December 6, 2024 at 6:40 am
Hi Experts,
I am trying to find duplicate documents stored in our Filestream database. I have come across multiple documents with same size, created timestamp within minutes\secconds difference, same filename.
Is there a better way to identify the duplicate files?
Regards
December 6, 2024 at 1:04 pm
SQL Server is not going to be able to validate duplicate documents. It just doesn't work like that. Maybe with SQL Server 2025 and the new AI functionality, but even then. Your best bet would be to move this data to some kind of search engine focused storage that has the ability to look at the docs themselves. Otherwise, you're looking at what you've got. Size, name, create date. SQL Server CAN store documents, but SQL Server isn't good at storing documents.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 12, 2024 at 5:04 am
SQL Server is not going to be able to validate duplicate documents. It just doesn't work like that. Maybe with SQL Server 2025 and the new AI functionality, but even then. Your best bet would be to move this data to some kind of search engine focused storage that has the ability to look at the docs themselves. Otherwise, you're looking at what you've got. Size, name, create date. SQL Server CAN store documents, but SQL Server isn't good at storing documents.
Thank you Grant.
Multiple applications are storing documents and there is no metadata to identify which one is storing what. I am planning to propose to use some identification while inserting the document, preferably client id and secret or application id , something unique to each module to identify these documents. Do you have any suggestions\advise.
December 13, 2024 at 12:58 pm
Suggestions? Not as such. Meta data is really going to be the only thing you can do, so it's a good place to focus.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 17, 2024 at 8:41 pm
What type of files are you doing this for? Plaiu text., CSV, TSV, PDF, what?
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply