Problem with performance of Geospatial Query

  • I'm having an issue with the performance of what must be one of the most text-book uses of the Geospatial support in SQL 2008 (I'm running SQL 2008 SP2 atm).

    All I'm trying to do is, with a list of stores (about 120) and a list of customers (about 2.3 million in real life) to assign each customer a closest store. This is for a Data Warehouse query, so it has to process the whole batch each night as stores can be added at any time, which potentially affects any number of rows, so I couldn't just maintain all of this at time of entry.

    I'm unsure whether geospatial indexes can help with this as I effectively have to calculate the distance for everything all the time, meaning scans all round.

    The main problem is the CPU used by the STDistance function - I'm only selecting a top 1 in the end, but it looks like it has to produce a cartesian product of all stores/customer combinations before selecting that one out, even with geospatial indexes in place.

    I've setup a test harness to show the issue (I'm just using 100,000 customer rows here) - I'm using CROSS APPLY here as it was marginally the fastest, but I did try a cross join/row_number/CTE solution with the production version of the code (I'll add that as well later):

    USE tempdb

    GO

    --create Stores table

    CREATE TABLE #Stores (Store_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,

    Store_Location geography NOT NULL

    )

    --Create a spatial index (not sure how much it'll help for the small table)

    CREATE SPATIAL INDEX IDX_Stores_Spatial ON #Stores(Store_Location) WITH

    (GRIDS = (HIGH , HIGH, HIGH, HIGH))

    --create customers table

    CREATE TABLE #Customers (Customer_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,

    Customer_Location geography NOT NULL)

    --Create a spatial index for customers

    CREATE SPATIAL INDEX IDX_Customers_Spatial ON #Customers(Customer_Location) WITH

    (GRIDS = (HIGH , HIGH, HIGH, HIGH))

    --generate some random store locations roughly around the UK - some may be in the sea, but they can float

    INSERT INTO #Stores

    ( Store_Location )

    SELECT TOP 120 geography::Point(50+((ABS(CHECKSUM(NEWID()))) %10000)/1000.0,--random latitude roughly around UK

    -6.2+((ABS(CHECKSUM(NEWID()))) %10000)/1400.0,--random longitude roughly around UK

    4326) --default SRID (metres)

    FROM master..syscolumns a , master..syscolumns b

    --generate some random customer locations roughly around the UK - some may be in the sea, but they can swim

    INSERT INTO #Customers

    ( Customer_Location )

    SELECT TOP 100000 geography::Point(50+((ABS(CHECKSUM(NEWID()))) %10000)/1000.0,--random latitude roughly around UK

    -6.2+((ABS(CHECKSUM(NEWID()))) %10000)/1400.0,--random longitude roughly around UK

    4326) --default SRID (metres)

    FROM master..syscolumns a , master..syscolumns b

    SET STATISTICS TIME ON

    SET STATISTICS IO ON

    SELECT Customer_ID, Store_ID, Distance_In_Metres FROM #Customers c

    CROSS APPLY (SELECT TOP 1 Store_ID,s.Store_Location.STDistance(c.Customer_Location) Distance_in_Metres

    FROM #Stores s

    ORDER BY s.Store_Location.STDistance(c.Customer_Location)

    ) nearest_store

    /*

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (100000 row(s) affected)

    Table '#Stores_____________________________________________________________________________________________________________00000000B291'. Scan count 100000, logical reads 300000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Customers__________________________________________________________________________________________________________00000000B292'. Scan count 9, logical reads 485, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1425553 ms, elapsed time = 247163 ms.

    */

    As you can see, even with a parallel plan, it's ending up with a massive CPU overhead and taking over 4 minutes just to do the 100,000 rows. In the production volumes, this is easily 4 hours.

    Has anyone done anything similar or have any optimisation suggestions? I could sacrifice a little accuracy for more performance, I haven't investigated using different SRID's yet, but that's probably my next port of call

    Edit:Attached execution plan

  • Do you have an index in place? If so, start messing with the density on the index. I found that changing[/url] that has interesting impact. The key to the density is, are you filtering in information, or filtering out information, through the WHERE clause. Depending on which you're doing you're going to get different behavior.

    If you read through my blog, I posted a bunch about getting the spatial queries to perform better. It's still a black art as far as I can tell.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hi Grant,

    Thanks, I'll have a read through your blog and see if I can find anything that'll help.

    Yes, as with the test harness, I have a spatial index on each table with all the grid level's set to HIGH. I've also tried the MEDIUM grid density, but I simply can't get it to use the index at all. If I put an explicit hint for the index I get and error that it can't produce a plan.

    Thanks,

    Howard

  • Hmmm.... I'd try low too. The density stuff seems counter-intuitive sometimes.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hi Grant,

    Ok, I've just tried that and no change to the plan or speed. I think changing the densities isn't going to help if it can't produce a plan using that index even if I force it.

    I found this which looks like what I'm trying to do, although it's just a working article for Denali:

    http://msdn.microsoft.com/en-us/library/ff929109%28v=sql.110%29.aspx

    I'll have a play around and see if I can get the apply to meet all of those criteria...

  • OK, so it looks like in 2008 at least, the Spatial Index only supports STDistance in a WHERE clause - if I put an arbitrary maximum radius around each store in the APPLY, it'll use the index (if I force it with a hint) to reduce the result set. Unfortunately I have no maximum radius to apply.

    I'm guessing from this specific article for Denali that they've added a specific optimisation there to stop it having to calculate STDistance for the cartesian product before filtering down to the nearest, but it may not exist in 2008.

  • HowardW (12/9/2011)


    All I'm trying to do is, with a list of stores (about 120) and a list of customers (about 2.3 million in real life) to assign each customer a closest store. This is for a Data Warehouse query, so it has to process the whole batch each night as stores can be added at any time, which potentially affects any number of rows, so I couldn't just maintain all of this at time of entry.

    I'm unsure whether geospatial indexes can help with this as I effectively have to calculate the distance for everything all the time, meaning scans all round.

    Maybe you can cut this down. For each new store, there are only a handful of stores where that store could possibly be closer to customers than the existing stores. You only need to re-evaluate customers of that small handful of stores rather than all customers. It should be fairly simple to identify the small handful of stores.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Nuts. I wasn't paying attention... again...

    Yeah, 2008 is problematic. They fixed stuff in R2. If you're doing spatial, you really need to move. That's the only good solution. Sorry.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • drew.allen (12/9/2011)


    Maybe you can cut this down. For each new store, there are only a handful of stores where that store could possibly be closer to customers than the existing stores. You only need to re-evaluate customers of that small handful of stores rather than all customers. It should be fairly simple to identify the small handful of stores.

    Drew

    I did consider that. It's easy enough for stores being deleted, but I can't currently wrap my head around the logic to work out the sphere of influence for new stores - any pointers would be greatfully received 🙂

    Grant, don't suppose you have an R2 instance kicking about that you could quickly run the test data through to see if you get a better plan? I could consider an upgrade, but I'd like to work out to what extent it might help. I thought the R2 enhancements for Geospatial data were all in SSRS, didn't realise they'd changed the RDBMS optimisation as well.

  • HowardW (12/9/2011)


    I did consider that. It's easy enough for stores being deleted, but I can't currently wrap my head around the logic to work out the sphere of influence for new stores - any pointers would be greatfully received 🙂

    Unfortunately, we're still on SQL 2005, so I haven't been able to work with geospatial data at all.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Looking at the query, we're really going about it the hard way. A little searching brought up this, and Isaac Kunen's solution. Try that instead.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks Grant, I'll give it a try and post back.

  • Howard, a couple of quick questions:

    (1) Is there a maximum distance from a customer to a store? If they live in rural Alaska, do you ever just say "There is no store close to you."

    (2) How many stores are we talking about? Not customers, just stores.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Hi Dixie,

    No, there's no limit as such, although each store has a marketing radius assigned to it, which will be used in combination with the distance to say whether they have a nearest store for Marketing purposes.

    I've just checked the real data and the furthest distance from a store out of all 2.3 million customers is 159 miles, but they're fairly anomalous (the average is just 11.19 miles).

    In the real data, there are 2.3 million customers and 120 stores (and growing).

    I think the link Grant posted is going to be the solution, it's quite ingenious. It does a series of index seeks using a Tally table to increment the seek radius until it finds a store. With the correct start value and power, I'm thinking you could calculate 99% of the records with just 1-2 spatial index seeks rather than calculate STDistance for all 120 stores

  • Initial testing with the sample data improves the overall time/CPU by a factor of 2, but since all the data points are randomised, this isn't that representative (as in real life, both stores and customers will tend to be clustered around population centres making this method much more effective). By FAR the quickest is setting the spatial index to LOW density with this test data. For those interested, here's the query with Isaac's method:

    --Create Numbers table for Isaac's solution

    SELECT TOP 100000 IDENTITY(int,1,1) AS n

    INTO #numbers

    FROM MASTER..spt_values a, MASTER..spt_values b

    CREATE UNIQUE CLUSTERED INDEX idx_1 ON #numbers(n)

    GO

    DECLARE @start FLOAT = 10000;

    SELECT Customer_ID, Store_ID, Distance_In_Metres FROM #Customers c

    CROSS APPLY (SELECT TOP(1) * FROM (SELECT TOP(1) WITH TIES *, s.store_location.STDistance(c.customer_location) AS distance_in_metres

    FROM #Numbers JOIN #Stores s WITH(INDEX(IDX_Stores_Spatial))

    ON s.store_location.STDistance(c.customer_location) < @start*POWER(2,#Numbers.n)

    ORDER BY n) NearestPoints

    ORDER BY n, distance_in_metres

    ) nearest_store

    /*SQL Server parse and compile time:

    CPU time = 13 ms, elapsed time = 13 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (100000 row(s) affected)

    Table 'Worktable'. Scan count 8, logical reads 3741591, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Stores_____________________________________________________________________________________________________________00000000D44E'. Scan count 8, logical reads 24, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'extended_index_387857540_384000'. Scan count 2885287, logical reads 5770574, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#numbers____________________________________________________________________________________________________________00000000D44A'. Scan count 100000, logical reads 200000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Customers__________________________________________________________________________________________________________00000000D44F'. Scan count 9, logical reads 485, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 502591 ms, elapsed time = 133353 ms.*/

    The start value can be played around with, but the optimum value will depend on the random data distribution from the sample data. I'm going to try converting this into the real data now where hopefully it'll perform a lot better. There's some very high scan counts and logical reads in this, I'm trying to work out why at the moment, it may be normal for spatial indexes

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply