Short-life data structure suggestions

  • We plan on receiving some high volume data (millions of entries per day) but only need to keep the information for a period of time. I am interested in your feedback regarding experience with the following - or any suggestions that are not listed:

    #1) Is it better to use table/month or database/month. Since we will be purging after N months, we want that information separate from the rest to avoid conflict (and double activity) when deletions begin. This does cause some headache when it comes to searching, but that will be coded into the middle tier.

    #2) There are lookup tables that will not be purged. Is it faster to replicate the information into each database, or just provide a separate database with the lookup information? In either case, we plan to have a copy of the lookup information on each SQL machine to avoid performing lookups across machines.

    Obviously, I have some opinions but only have test data with < 10 million rows in my tables which, I fear, may not give me the whole picture.

    Any comments would be welcome.

    Thanks

    Guarddata-

  • Depends on the performance. I'd simulate the 10M rows and see, but my thought is one table/month. If the db performance is too low, separate DBs or even filegroups can help.

    Steve Jones

    sjones@sqlservercentral.com

    http://www.sqlservercentral.com/columnists/sjones

    http://www.dkranch.net

  • Have you considered about Updatable Partitioned View?

  • Thanks for the input. Any preferences regarding the lookup tables? I have generally worked on projects where they were in the same database but have had some cases where they were in a separate database on the same machine. I know there is a performance hit to have them on a separate machine. But what about in a separate database?

    Guarddata-

  • If they rarely change, I'd put them in each db and then use an app, script, or possible replication to update them.

    If they change often, I'd probably use a master db and repl to each db. Keep the app simple by storing it in each db.

    Steve Jones

    sjones@sqlservercentral.com

    http://www.sqlservercentral.com/columnists/sjones

    http://www.dkranch.net

  • Don't see benefits by place those lookup tables in separate database. It will create more issues in system maintenance, administration and operations. For example, if you have problem to bring the database up that lookup tables are in, your application woundn't work too.

    Edited by - Allen_Cui on 02/07/2003 09:51:00 AM

  • i would go for simplicity first unless there is some evidence it will not fit the requirements. keep it all in one db.

    1) delete the data you dont need. delete the data in multiple transactions to improve performance

    2) bcp in the new data

    a few million rows is easy for sql server to handle specially on high-end hardware. take advantage of sql server's capabilities and keep things simple

  • Thanks for the feedback.

    Guarddata-

  • Touching on subject #1 again, here's an idea I've heard once or twice--never done it myself, though I'd like to try it some day.

    To recap, the base idea is:

    -- You get data in every month

    -- You have to keep N months worth of data live on the syatem

    -- After N months have passed, you have to delete the old data.

    The idea I heard is based on creating one table per month's worth of data (with names like Data_0302, though Data_200302 saves you problems in a century or so).

    Create the tables, and then create a view linking all the tables. Perhaps like, select 'Jan 1, 2003', * from Data_200301

    union select 'Feb 1, 2003', * from Data_200302

    union select etc.

    (This is just an outline, more thought is required.)

    Time to add a month? Create and load it, then modify the view and it's available. Time to drop a month? Modify the view, the data's off line, and you can archive and/or drop it at your liesure.

    ...and this is why I'd advocate storing this data in tables, rather than in databases.

    Philip

  • Unions should always be

    UNION ALL

    unless you want the effect of Select DISTINCT.

    Just thought I'd mention it because someone had a bit of code with Union. It can effect performance.

  • Using multiple DBs versus a singular DB will affect SQL Server's memory and hence it's performance. I would recommend testing both multiple DBs and a singular DB.

    If you're going to use multiple DBs, I'd put the lookup tables in a separate DB, which could then be put into read-only mode (no locks to maintain plus no replication or duplication issues). Again, I'd recommend (performance) testing to determine the best solution.

    As a side note, if the data is stored in the tables/databases in chunks that are the same size you will be deleting, purging becomes very simple (alter the views referencing that object, then drop that table/database). That's very efficient compared to a DELETE. Of course, if you *have* to purge one week's worth of data in a table/database that houses a month's worth of data, that suggestion is worthless.


    David R Buckingham, MCDBA,MCSA,MCP

  • Article on partioning:

    http://sqlteam.com/item.asp?ItemID=684

    Then OLAP:

    If you have the Enterprise version of SQL Server, consider dividing your OLAP cubes into partitions to boost performance. A partition is just a separately managed unit of storage, and each partition can have its own storage mode and aggregation level. Partitions provide several performance benefits, including:

    You can place cube partitions on different hard disk arrays in order to better distribute the workload across multiple I/O devices.

    You can divide cubes into partitions that represent how the data in the cube is used. For example, say you have five years of data available, but that 80% of the queries are against the most recent year, and 20% are against the other four years. Put the most recent year in its own partition, and the remaining four years in a second partition. This way, you can select different aggregations for each partition, which will affect both performance and the size of the cube.

    A partition is the unit of granularity for threading during a cube load. This means, that on multiple CPU servers, adding partitions speeds up cube load processing time.

    http://www.sql-server-performance.com/olap_performance.asp

  • I do appreciate all the input. Looks like I need to spend a little time to learn more about OLAP.

    Guarddata-

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply