Odd performance query behavior on very large 40 TB partitioned table in distributed view, maybe it's tipping?

  • jallmond (2013-07-19)


    I will create the clustered index as you suggest for the August shard.

    In the long run, you probably want the same across the board, but of course it will take a maintenance window to make those changes.

    Given we do a lot of date range based queries, is it more optimal to create a clustered primary key as:

    ALTER TABLE [dbo].[tblDWProbe] ADD CONSTRAINT [PK_tblDWProbe] PRIMARY KEY CLUSTERED

    (

    [reportStamp] ASC,

    [dwProbeId] ASC

    )

    Or

    ALTER TABLE [dbo].[tblDWProbe] ADD CONSTRAINT [PK_tblDWProbe] PRIMARY KEY CLUSTERED

    (

    [dwProbeId] ASC,

    [reportStamp] ASC

    )

    Or does the key column order not really matter when used in scans?

    In scans it does not matter, but you don't want scans, you want seeks! If your typical query is "give me all rows between certain timestamps", then you want reportStamp first. If you look up by dwProbeId, then you should have an index where dwProbeId is first. But you don't seem to have an index like that today.

    The leading application is always parsing binary files and inserting into this table. I have been told that SELECT with READ COMMITTED blocks the inserts,

    And? With the correct index, this is a single-row read. You can take the blocking, no problem. Better be blocked than get incorrect results or error 601.

    This is an indexing nuance I am not familiar with. So I understand correctly, are you saying for composite indexes, key columns used in equality comparisons are more efficient if they are listed first?

    Yes. Just consider that you are looking at an output with thousands of rows, and you want to find the rows for a certain vehicle, parameter and date range. If the data is sorted by vehicle, parameter and date, you have the data in a single contiguous interval. If the data is sorted on vehicle, date, parameter, you need to filter each date value for the parameter which takes longer time and is more error-prone.

    For our business use case we sometimes query for vehicleId within a date range (usually last 7 days) and never a specific date, without also knowing the parameterId.

    Then you need an index on vehicle and reportStamp only as well. Now, I don't know how many different parameter values there are for a specific vehicle at a certain date, so the exact impact is difficult to measure. But there can certainly be a problem. Particularly if some parameter values come in frequently, and others only twice a day and you are looking for one of the latter.

    Also, do I need to explicitly declare the partition scheme when creating this index if the table is already partitioned? Finding conflicting advice online.

    So partitioned views, distributed or not, are funny creatures. Nowhere do you explicily say "this is a partitioned view" and SQL Server gives you an error message if you have messed up. You just create the tables and the views, and if you did not do everything right, SQL Server will simply handle the view as a regular view with no magic.

    Partitioned tables are another matter. Here you need to explicitly define a partition function and a partition scheme.

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • I think your problems might be associated with statistics.

    It seems that you have automated statistics updating enabled. This means that when 20% of the rows in a table are updated, the statistics for the table is automatically updated.

    This update can take a lot of time for very large tables. This might be one reason why your query is sometimes slow.

    I Suggest that you turn off automatic statistics updating for this table, and instead add a job that manually updates statistics every night (for the active shard).

Viewing 2 posts - 16 through 16 (of 16 total)

You must be logged in to reply to this topic. Login to reply