alternatives to SELECT DISTINCT

  • Here's the situation: I have inherited a few databases, one of which stores statistics on monitored devices (network throughput, cpu utilization, etc).

    The statistics table typically holds about 113 million rows at any given time.

    I started going through some of the existing jobs and stored procedures to see where I can speed things up a bit, and there's one remaining area where I'm hoping for some insight.

    One of the nightly jobs that runs populates a list of currently available statistics...it chews through the statistics database and runs a select distinct to find each unique interface (the most granular identifier for a given statistic)/day combination and populates a table of available daily statistics (a list of every monitored stat for every interface and day that exists in the database).

    On the quad-Xeon (2.8GHz...4GB RAM) server the database sits on, the query takes several minutes to run, and while this happens at 4AM, I'd like to turn it into something a bit more real-time.

    Most of the ideas I've had for improving the situation have to do with adding some logic on the front-end (updating a timestamp in a lookup table when a stat is inserted into the database, for example), but I was wondering if anyone out there had faced a similar situation and come up with something speedier, but with similar functionality to "SELECT DISTINCT".

    To me, this seems to be a case where select distinct is used for a reason that makes sense, but I'd really like to be able to do it faster, If I can.

    Any ideas, or should I just do the processing on the front end?

  • typically you'd want to check for useful indexes , usually distinct implies duplicates, normalisation might help.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • I take that these statisticts are cumulative, each day more records are added for the same device/interface. If this is the case then select distinct would have to read the entire table each time, growing gradually slower each day. After all how does Sql Server know if a record out of 113MM is or is not distinct unless it's looked at it?

    As the previous commenter mentioned, creating an index the would organinize the unique information for the device/interfaces you're interested in and then using the max() function would allow SqlServer to just skim the top of these 113MM records for the information you need.  Once in found the max record on the index for one device/interface it can skip all of them many, (thousands, hundreds of thousands?), that come after it, and move to the spot in the index where the next record type occurs.

    Good Hunting

  • That was my first thought, but the two key identifiers can't be normalized...a date-time stamp (down to the millisecond) and the interface ID will absolutely be present for each entry, and there will absolutely be more than one for each day per interface.

     

    I do have indexes, but I probably need to revisit how well they're working.

  • 113 Million is the typical number of records...pruning occurs weekly for records more than x months old, so the database stays about the same size unless we add more interfaces to monitor.  How does SQL server know if a record out of 113MM is distinct or not is precisely the question I am asking myself and anyone else with an idea.

    I was going to say something about wishing I could use MAX() to create the list of statistics, but I can't...

    Then I had another thought (and thanks to you for mentioning MAX() and indexes).  If I create a basic table of statistics, period (just one entry per unique interface, not one for each interface/day combo), I can test the speed of a much more focused query after that to return just the days that a particular interface had/has stats.

    If that runs fast enough (any thoughts on the performance implications of indexing or not indexing 113 million dates...would reindexing take longer than the existing job takes to do the select distinct dance?), I could use it in the actual application as it runs, instead of as a job.

     

    I'm off to do some benchmarking...

  • If you are only working with the last day or week worth of data you could setup a trigger that copies some of the rows being inserted into the statistics table to a table specific for that day. You could then rotate out the older, lesser used tables.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply