scrape data from multiple DB servers into centralized DB

  • hello,

    i need some help.

    i have 8 DB servers that collect data. every 5 min i need to scrape those 8 DB Servers into one centralized DB (9th server). This is only 1 table operation. so basically scrape data from 8 tables into 1 main table

    1. i need to be able to select only unique rows that have not been selected before (maybe delete previoulsy inserted row)

    2. i need to have those 8 DB servers available for new inserts (can't lock the table completely)

    Any suggestions on how to effectively perform this task?

    thank you all in advanced

  • Dunno about most... I'd simply make linked servers and have a proc read the data in. Don't want to make a single view because if one of the servers is down, the view would fail.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • those 8 servers are very intense, gets handreds of thousands of hits (INSERTS) in an hour.

    what about the performance of the main server and the the other 8 linked onces? i am not a DBA, but a developer so i don't know how linking the servers will affect the overall performance.

  • Ok... another way to do it would be to export the data to tab delimited files on some common file share and use BCP or BULK INSERT to import the data to the Main server. That way, each of the satilite servers would be able to mark their own rows as "exported" so they wouldn't re-export. It's a bit more complicated but works very well in an "intense" environment because it won't cause locking and is actually very fast.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • that sounds like an idea. here are the questions though:

    is the table locked during the export?

    what if at that particular moment i have hundreds of new inserts?

    what would be the best way to export into a file?

    how to mark exported rows w/o affecting newly inserted?

    thanx

  • I had to do this for a project before, and each of the remote servers were over slow WAN links. Here's how I was able to do it best (I assume table in each remote db is TableA, and all are dumped into AccumTable):

    On each remote server:

    - Create job to run the following steps each X minutes

    - In 1st sproc, rename TableA to TableA_temp, and rebuild a new TableA. Renaming is quick, and should be ready for next transactions with very little interruption.

    - In 2nd sproc, Insert to AccumTable on master db server from TableA_temp. Because of large volumes, I put this in a While loop, batching 1000 rows at a time using Select Top(1000), until TableA_temp was empty

    - Drop table TableA_temp

    I found it better to push the data from each server, rather than have the "master" server poll and collect data. One key element that I had was being able to rename and build a new table, so that the data "push" was not affected by (or interfereing with) new data coming in.

    Hope this helps



    Mark

  • Mark,

    how does renaming of the TableA affects inserts? as i have mentioned before, this table is being inserted into constantly. i am afraid that renaming this table while someone inserts the data will result in an error on the user side.

    also, "pushing" the data will require master db linked to each of the satellite servers?

  • how does renaming of the TableA affects inserts?

    Yes, it may affect them. However, in the system I worked on, testing showed it to be the least likely to interfere with the inserts, as the rename and new table definition build would take less than 0.10 second. A significant part that made it successful is that the programs that loaded data to the table were written properly to not have a constant connection; each transaction to add data did not leave any locking or connection to the table. You will need to do testing to make sure that will work best in your environments.

    i am afraid that renaming this table while someone inserts the data will result in an error on the user side.

    It should not. SQL server (and depending on the client connection and its code) can be fairly tolerant. By wrapping the table rename and table build together in a sproc, it wraps it in a implicit procedure. We did quite a bit of stress testing, and it seemed to be tolerant and not interfere with individual inserts. One caveat that I remember is that you have to drop and rebuild any indexes, but as this is basically a dump table, you should not need to have any indexes on it anyways.

    "pushing" the data will require master db linked to each of the satellite servers?

    Vice versa. Each satellite server will need a linked server to the master db. You could use OpenDataSource, but for the frequency of the queries it would likely be better to have linked server.



    Mark

  • Mark,

    just to make sure i understand everything:

    1. "rebuild a new TableA" - this is just a CREATE TABLE call? or is there anything more efficient that can be used to rebuild table?

    2. "Select Top(1000)" will always pull only those rows that have not been pulled before?

    3. i need to create a link to each satellite server from master db server (link each satellite server from main under LInked Server). this will mean that satellite servers will push data out to master. is that correct?

    How do you do stress testing on SQL server? can you suggest any software?

    thanks

  • 1. "rebuild a new TableA" - this is just a CREATE TABLE call? or is there anything more efficient that can be used to rebuild table?

    Yes, just a Create Table statement. I tried an alternate using Select Into, but that was slower (I assume because it has to parse the Select part of the query first).

    2. "Select Top(1000)" will always pull only those rows that have not been pulled before?

    Well, there's more to it than just my "shortcut" note. You will need an Order By, then followed by a Delete statement, roughly like this (assuming PKid is Primary Key on TableA):

    While @@rowcount>0

    Begin

    INSERT INTO masterserver.db.dbo.AccumTable (colA, colB,...)

    SELECT TOP(1000) colA, colB,...

    FROM TableA_temp

    ORDER BY PKid

    DELETE FROM TableA_temp

    WHERE PKid IN

    (SELECT TOP(1000) PKid

    FROM TableA_temp

    ORDER BY PKid);

    End

    Of course you will need to add some errorhandling logic around that also.

    3. i need to create a link to each satellite server from master db server (link each satellite server from main under LInked Server). this will mean that satellite servers will push data out to master. is that correct?

    Yes, I found that more effective and efficient to have satellite servers push data to master, rather than master server poll and pull data from satellites.

    How do you do stress testing on SQL server? can you suggest any software?

    What we did was just to have a bunch of VBscripts running on a bunch of machines, including the server itself. Making them able to run as command scripts allowed up to run dozens of command windows at once on each machine, and have about 10-20 workstations then provide being able to emulate 1000 or so simultaneous users.

    Hope this helps



    Mark

  • 1. is the table locked during the export?

    2. what if at that particular moment i have hundreds of new inserts?

    3. what would be the best way to export into a file? BCP.

    4. how to mark exported rows w/o affecting newly inserted?

    1. NO

    2. Doesn't matter... mark the rows to eport as "INPROCESS" and only transfer those rows. When done, mark them as "EXPORTED".

    3. BCP

    4. Doesn't matter... see item 2.

    If you want to export all rows and not keep them, then Mark's suggestion of a table rename would probably be better. Table renames take less than 65 milliseconds so no interference with GUI's....

    HOWEVER, it may be best to do it a slightly different way... Instead of a table rename, have the GUI point to a "surrogate view"... in SQL Server 2005, these would be known as "synonyms". You would have Table1 and Table2... you would simply alter the view so the GUI would point to one of the tables and your proc would point to the other. When it came time to do the download, you would switch those. Of course, the table that you just pointed the GUI at would be empty using truncate.

    The advantage to that method is you get to use Alter View instead of Create Table... all permissions/grants would stay in play for the two tables and the view. You would also use TRUNCATE instead of drop table or doing some gosh awful delete on a high volume OLTP table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • thanks Jeff,

    here is the thing: i can't use "view". at least i don't think i can. this table is a WRITE table. user never views the data in it.

    any time the user hits the page, row is inserted into this table.

    somewhere in between inserts (as i said potentially 100K+ in a few min time span), i need the data extract into master table on another server.

    i don't really care to keep exported data so i don't think with Mark's suggestion i need to mark rows.

  • Jeff; Interesting option you raise with using synonyms. My project was long before SQL 2005, so that option was not available at the time. But it would be interesting to try against these other methods. Another possible tool in the toolbox.

    I was assuming from Liz's statement ("...gets handreds of thousands of hits (INSERTS) in an hour...") that these were not GUI applications that were loading these tables, but some sort of machine-driven data entry like process controllers, temperature monitors, manufacturing monitors, or the like. And I hope that these tables are basically "dump" tables, where raw data is dumped into them, and very little application queries are done to these tables, hence little need for indexes and stored procedures on these tables. The "collection" that Liz is trying to do will pull this raw data together, summarize it for reporting or other applications, etc. That was the basis for my previous project, and I was assuming similar circumstances for Liz's project.

    Liz: you could use the synonym as an alias for the application to write to the table. Instead of renaming and dropping table, have two tables in place, TableA1 and TableA2. Have the synonym (TableA) point to TableA1. When you want to grab that data, drop and recreate the synonym TableA to TableA2 so that any new inserts to go to TableA2. You now have Table1A unhindered to process to push the data to the master db. When done, truncate Table1A. On your next process, change the synonym back to Table1A, and process the data in TableA2.

    It might be easier and faster than renaming the table, especially if you have any indexes (since indexes have to be renamed also). Truncating the entire table is better than deleting. You still have a slight risk of potential problems as you drop and recreate the synonym that an insert may fail when that object name does not exist, but it should be no much different than renaming and creating a new table.

    Again, Jeff, interesting idea.



    Mark

  • Mark,

    you're absolutely right. those are just dump tables. no querying done on these tables. it's really only 4 columns, 3 out of which need to be dumped into master db.

    here is another thing, i need to be able to insert only unique records into the master db (those dump table may potentially have dups)

  • You can check/remove duplicates as you are creating the inserts statements to push the data. Since the table is free from the new inserts coming in (whether you use the renaming method or the synonym method), you can easily identify and drop duplicate rows.

    And note the addition I added to my previous post, talking about the possibilities with using the synonyms that Jeff brought up.



    Mark

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply