Millions of documents process in batch with different applications

  • I am working on eDiscovery related database. i have different applications for processing single document. i have to insertupdate document properties in database. First application will insert records of each document in database and then other applications are updating this same documet records for other properties. All applicatins are running parallel.

    We want to use Sql server 2005. What we are planning is each application user will maintain local database in sql server express and after processing complete for given batch of documents user will update on master database of main sql server database. All user will do same process for each application on main server.

    Please advise for setting staructure for sql server.

  • you've asked an incredibly open-ended question here. Really if you aren't sure of how to set this up, you need to pay for consulting. This is really beyond the scope of free help.

    A couple of things for you to consider/answer.

    Why use local databases? SQL Server on a server can handle the load of concurrent users itself. A well designed schema would not necessarily block users from updating the same data in a document.

    Is the document one large item? Or are you breaking a "document" into separate fields of data that can be edited.

    If you use multiple databases, you have to sync things up at some piont. That can be tricky in trying to handle conflicts. Especially in a batch mode. What happens if two people have edited the same information?

  • Alright, I started to reply to this post before Steve's reply and have to edit much of the content since it would be repetitive.

    As already mentioned by Steve, this is a topic which needs more knowledge about the database you have and the document that you mentioned. What is the primary usage of that document, and if multiple users are maintaining a local database and then sync it, how would you be able to define the original / final one at the end.

    SQL server does provide this capability (Merge Replication or Peer-To-Peer Replication), but need to know more to give a advice and if using those features straight forward may not help, you even may need to work on top of it


    Bru Medishetty

    Blog -- LearnSQLWithBru

    Join on Facebook Page Facebook.comLearnSQLWithBru

    Twitter -- BruMedishetty

  • Why use local databases? SQL Server on a server can handle the load of concurrent users itself. A well designed schema would not necessarily block users from updating the same data in a document.

    There no valid way to do this if the local copies are updated locally, the only option is if there is central ADO.NET DataSet for inserts and updates. I answered this last year because the rows can add up very quickly based on the number of local copies.

    Kind regards,
    Gift Peddie

  • Why use local databases? SQL Server on a server can handle the load of concurrent users itself. A well designed schema would not necessarily block users from updating the same data in a document.

    ----- If we update and insert records from all apllications and users concurrently then it will take time to retrieve and insert/update. can we use integration service for moving data from local DB to main server DB.

    Is the document one large item? Or are you breaking a "document" into separate fields of data that can be edited.

    -----Document we not want to store. We only extract properties using application like author, to cc , bcc , xubject etc. and storing in database.

    If you use multiple databases, you have to sync things up at some piont. That can be tricky in trying to handle conflicts. Especially in a batch

    -----That is right

  • anilp (11/12/2009)


    Why use local databases? SQL Server on a server can handle the load of concurrent users itself. A well designed schema would not necessarily block users from updating the same data in a document.

    ----- If we update and insert records from all apllications and users concurrently then it will take time to retrieve and insert/update. can we use integration service for moving data from local DB to main server DB.

    Is the document one large item? Or are you breaking a "document" into separate fields of data that can be edited.

    -----Document we not want to store. We only extract properties using application like author, to cc , bcc , xubject etc. and storing in database.

    If you use multiple databases, you have to sync things up at some piont. That can be tricky in trying to handle conflicts. Especially in a batch

    -----That is right

    SSIS may not be the right solution for this if you are not using a Central DataSet the only other option is replication because the Express can be a subscriber in all types of replication.

    Kind regards,
    Gift Peddie

  • #1 - silly argument. SQL Server is built for this. On any modern network the time is trivial between the server and client. plus, local databases are accessed the same way. You don't have network traffic, but that's usually noise compared to everything else.

    #2 - good

    #3 - SSIS doesn't handle conflicts. Easier to do this in real time, pop a note to someone that data changed.

  • I guess the whole application here will become too complicated for the problem that you have anilp.

    What I would see here is single database with tables that would keep the document properties. You would use Optimistic concurrency control. After BOL:

    In optimistic concurrency control, users do not lock data when they read it. When a user updates data, the system checks to see if another user changed the data after it was read. If another user updated the data, an error is raised. Typically, the user receiving the error rolls back the transaction and starts over. This is called optimistic because it is mainly used in environments where there is low contention for data, and where the cost of occasionally rolling back a transaction is lower than the cost of locking data when read.

    In other words, if all users are connected to the same database, you have to check if the document they work on has been updated by someone else and take appropriate actions - usually you would cancel the operation and ask user if he/she wants to overrite, merge or discard changes. This is pretty easy. If users work in batches, depending on the concept of the 'batch' you would like to roll back the whole batch or only single documents that report conflicts.

    Usually the optimistic concurrency control is implemented with means of some sort of time stamp applied to each row. You can use rowversion data type (which is a synonym for timestamp) for this.

    Using additional databases and some synchronization between them and the central database would cost a lot of effort both in development, deployment and maintenance.

    Regards

    Piotr

    ...and your only reply is slàinte mhath

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply