Duplicate Documents

  • Hi Experts,

    I am trying to find duplicate documents stored in our Filestream database. I have come across multiple documents with same size, created timestamp within minutes\secconds difference, same filename.

    Is there a better way to identify the duplicate files?

     

    Regards

  • SQL Server is not going to be able to validate duplicate documents. It just doesn't work like that. Maybe with SQL Server 2025 and the new AI functionality, but even then. Your best bet would be to move this data to some kind of search engine focused storage that has the ability to look at the docs themselves. Otherwise, you're looking at what you've got. Size, name, create date. SQL Server CAN store documents, but SQL Server isn't good at storing documents.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey wrote:

    SQL Server is not going to be able to validate duplicate documents. It just doesn't work like that. Maybe with SQL Server 2025 and the new AI functionality, but even then. Your best bet would be to move this data to some kind of search engine focused storage that has the ability to look at the docs themselves. Otherwise, you're looking at what you've got. Size, name, create date. SQL Server CAN store documents, but SQL Server isn't good at storing documents.

    Thank you Grant.

    Multiple applications are storing documents and there is no metadata to identify which one is storing what. I am planning to propose to use some identification while inserting the document, preferably client id and secret or application id , something unique to each module to identify these documents. Do you have any suggestions\advise.

  • duplicate posts

    • This reply was modified 6 days, 5 hours ago by  VastSQL.
  • Suggestions? Not as such. Meta data is really going to be the only thing you can do, so it's a good place to focus.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • What type of files are you doing this for?  Plaiu text., CSV, TSV, PDF, what?

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply