When should I use distributed queries?

  • The short: So I'm trying to determine whether or not to use distributed queries across say 50 sql servers. vs. gathering all the data into on sql server.

    The long: I've got about 10 million+ rows of data to contend with coming from our media servers every month. In the past we've used some in house programs to aggregate the logs up to a beefed up SQL 2000 server. This works ok. Sometimes it's pretty hard on the server as you can imagine.

    So I have the opportunity to spread the load out so that our point of delivery servers parse logs directly to a local sql instance and then on our report system I create a big distributed table which runs queries across all the other servers.

    In my preliminary testing this looks to be ok but I'm interested in hearing about the trade-offs.

    Our reporting data is strictly historical and we have resources to spare on our many "delivery servers". I plan to use SQL 2005 Standard as the master, with SQL Express running on the delivery systems. I've already tested this and it seems to work well with a couple tables unioned together, but I'm concerned with the scalability of that particular configuration.

    I have a few specific questions.

    1. At what point and in what way is having to query across 50 servers instead of 1 going to slow me down. How bad of a slow down.

    2. Is this what distributed queries are perfect for or am I barking up the wrong tree.

    3. If you had lot's of histrical data coming from dozens of servers with extra resources would you still prefer to aggregate the data onto one server and if so why?

    In all honesty I'd be glad to pay a real expert (someone with significant experience in distributed queries) to go through this with me on the phone. I've got a phone line and paypal account if you're interested.

  • Matthew,

    It's one thing to go to 2-3 servers, and another to go to 50. If you have 50, then it could easily be 30 or 100 in a year, so you want to plan for some scale.

    If I understand this, and please correct me, you have log data on 50 servers. You query this through a linked server to pull back to your central SQL Server, is that correct? Are the 50 SQL Servers?

    My approach has been to decouple things as much as possible and then roll up the decoupled data. So when I had about 200 instances to manage, I had each instance gather various data and store it locally. I could summarize as this was admin data (db size, backup size, config, etc.). Each server had a DTS package that then ran to send this data to the central server. The central server then could report on everything together.

    Disadvantages

    - Lots of packages, procs, tables, etc. to maintain. Make a change, needs to be pushed to all servers. It's doable, but you need a process here as well.

    Advantages

    - If a central job fails with server 10, I might not get data from servers 11-50 that day. With decoupled jobs, at least the data is on the other servers and that server has to fail to not send the data over.

    - Always can go to the local server for summaries if I need to, even if the central server is down.

    I'd personally have each server roll stuff up and store it, sending the summary to the central server (push), rather than having the central server pull data. To me it makes my life easier over all. Lots of data transfer, little development work (relatively). If the media servers aren't smart enough to roll up separately, I'd honestly build separate packages in SSIS to pull the data with separate jobs. It's a lot of jobs to manage, but it's easy to add more, it's also easy to have one fail and not affect others. You can even build two packages, one that loops through a list of servers, and then passes a server name to the 2nd that then goes to get the data. Makes things simpler (to understand), but more development work, however it's less maintenance. If one's down, you log an error and move on again.

  • Yea, that's pretty much what we're doing now.

    The problem is the amount of data that winds up on the central server is just rediculous.

    The other problem is we can't roll it up as each row is unique. There are some fields that can be rolled up for sure, but many others that cannot. Currently we're at about 90 million rows with 50+ columns.

    It's pretty nasty to maage even with good indexes etc. As I mentioned it's been runinng for a few years, and it's ok. But we need a better solution going forward. We also have some processing that we do when the rows come in which is also fairly intense, for example we do some geo tagging where we tag city,state, country to the column etc.

    We are currently pushing the data up to the central server and we have even created a scheduling system to prevent collisions etc.

    The bottom line is that we have 1 big table at the end of the day and probably need to keep it that way (at least for the forseable future). I was thinking that using a distributed query we'd at least be able to spread both the processing load and the query load out.

    I may come up with ways to reduce the scope of the distributed queries based on knowing which servers will have which data.

    Aside from the management issue, do you think on balance I'm looking at a performance hit by querying 50 servers rather than 1? If each of the 50 servers has 10,000 rows relevant to my query am I better off having it all in one slow table?

  • Yes I'm planning on installing SQL Express or mysql on each of the 50 servers. Using 1 SQL 2005 Standard as the master (where the distributed query will run).

  • If you would go with the Linked Servers scenario, what if one of the 50 servers is currently down or some maintenance is being performed on it?

    You've got large tables with 90 million rows and analysis is becoming a nightmare, maybe you shoud look into using Analysis Services (part of SQL Server 2005 Standard Edition) and if you will need more detailed data than the aggregated values there is always functionality called 'drillthrough'.

    Cheers, Max

    Check-out free open-source utility Sql Server Replication Explorer[/url]

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply