November 3, 2013 at 1:36 am
Hi all, I was wondering if anyone could chime in with random tips or thoughts on what I'm doing.
So I've implemented a GEO/IP table that includes geographic data mapped to IP ranges and those range bounds converted into an integer. In addition, I've created a geometry column that converts each IP range into a LINESTRING whose bounds are determined by the converted IP start and end points.
There is a clustered PK on an auto inc ID column and a spatial index on the geometry column.
My query to match fact table data to geo data uses a LEFT JOIN on the geo table using:
<geo_column>.STcontains( <fact IP converted to a geometric point> ) = 1
Initially I tested a TOP 100,000 rows of the fact table to benchmark speed and I got back about 6.2 minutes for the query to come back.
Integrated into the actual report it was intended for however the query simply never comes back, even when the # of fact rows being used as the base is only ~29,000 rows.
Anyone have any thoughts or tips on optimizing the performance of STContains/Spatial Indexes?
I had high hopes for this method of IP range matching but it seems like I am stuck in a rut now.
November 3, 2013 at 2:48 am
Actual execution plan would have helped a lot, but here is my shot in the dark.
Spatial indexes work automatically only in enterprise and development editions, so if you have standard you need to use hint.
Second guess is that bounding box of spatial index is incorrect.
If those two are fine. Please post execution plan and some sample data.
November 3, 2013 at 4:43 am
I am not an expert on Spatial data, but I do know a few things:
* The reply by Ville-Pekka is incorrect. I think he is confusing two features. Indexed views is the only feature that works automatically in enterprise but needs hints in lower editions; all other features either work automatically on all editions, or don't work at all on lower features.
* Are you running on SQL Server 2008? If so, check SELECT @@VERSION to see what, if any, service pack you have installed. If you are on RTM, that might be the cause - there were some huge imprevements to the costing of spatial queries that solved a lot of issues with spatial indexes not being used in SP1. See http://www.sqlskills.com/blogs/bobb/how-to-ensure-your-spatial-index-is-being-used/[/url].
* Other things that might help are simplfying the query (break it down in subqueries if needed), and executing it in such a way that the optimizer "knows" the parameter value. You may also need to add a recompile hint, to prevent parameter sniffing issues.
I do agree with Ville-Pekka that posting the execution plan would help. And the full query or stored procedure.
November 3, 2013 at 6:37 am
Hugo Kornelis (11/3/2013)
* The reply by Ville-Pekka is incorrect. I think he is confusing two features. Indexed views is the only feature that works automatically in enterprise but needs hints in lower editions; all other features either work automatically on all editions, or don't work at all on lower features.
No I did not confuse. I just did not know this and made a false assumption. My empirical studies has just shown that standard and express editions rather choose some other index than spatial if there is an option. My test can be bad because I know for sure that those queries are not near optimal where this has happened.
Thank you Hugo for correction.
November 3, 2013 at 11:11 am
OK so the query looks as follows:
SELECT
<fact table attributes>
,Tr.Country
,tr.city
,tr.Region
,Tr.dmacode
FROM #fact_table ip
LEFT JOIN GEO2IP tr ON geo_range.STWithin(geometry::STGeomFromText('POINT('+CONVERT(varchar(20),ip.IP_Converted)+' 0)',0)) = 1
Here's the relatively simple DEP:
I admit DEP analysis is not a strong point of mine yet-- I know Table Scans are usually bad, but if the cost of it is 0% doesn't that mean it wouldn't be causing much of a slow down? Looking at this DEP, it seems to me that STContains really just isn't a very good function performance wise.
November 3, 2013 at 11:16 am
A picture of the plan is fairly useless. Can you save the plan and attach the resultant file to your post?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 3, 2013 at 11:18 am
I'm hesitant to since this is my company's data. Any advice to offer without me having to upload specific details of the underlying structure? Things to look out for etc..
November 3, 2013 at 11:46 am
razzgor (11/3/2013)
I'm hesitant to since this is my company's data. Any advice to offer without me having to upload specific details of the underlying structure? Things to look out for etc..
You can anonymous company data with
http://www.sqlsentry.com/plan-explorer/sql-server-query-view.asp#features.
In picture optimizer is not using spatial index even though it could.
November 3, 2013 at 11:56 am
Hugo Kornelis (11/3/2013)
* Are you running on SQL Server 2008? If so, check SELECT @@VERSION to see what, if any, service pack you have installed. If you are on RTM, that might be the cause - there were some huge imprevements to the costing of spatial queries that solved a lot of issues with spatial indexes not being used in SP1. See http://www.sqlskills.com/blogs/bobb/how-to-ensure-your-spatial-index-is-being-used/[/url].
Definitely make sure you are not running 2008 RTM, otherwise hinting is really the only sure way to go and that can also cause issues.
Ville-Pekka Vahteala
Second guess is that bounding box of spatial index is incorrect.
Agreed, when creating the index make your bounding box as tight as possible.
Are you using the same SRIDs between you Geometries? In your query you build the point using SRID 0. Is the SRID of the geometries in GEO2IP also 0?
Assuming you having polygons in GEO2IP, you will want to use STContains or STIntersects.
With the LEFT OUTER JOIN on the query, the spatial comparison is rather redundant. Try an INNER JOIN.
If you could post the create script for your spatial index and the DDL for the GEO2IP table that would be helpful. Approximately how many records does that have?
Are the coordinates Lat/Lon's or projected?
November 3, 2013 at 12:18 pm
mickyT (11/3/2013)
Hugo Kornelis (11/3/2013)
* Are you running on SQL Server 2008? If so, check SELECT @@VERSION to see what, if any, service pack you have installed. If you are on RTM, that might be the cause - there were some huge imprevements to the costing of spatial queries that solved a lot of issues with spatial indexes not being used in SP1. See http://www.sqlskills.com/blogs/bobb/how-to-ensure-your-spatial-index-is-being-used/[/url].
Definitely make sure you are not running 2008 RTM, otherwise hinting is really the only sure way to go and that can also cause issues.
Ville-Pekka Vahteala
Second guess is that bounding box of spatial index is incorrect.
Agreed, when creating the index make your bounding box as tight as possible.
Are you using the same SRIDs between you Geometries? In your query you build the point using SRID 0. Is the SRID of the geometries in GEO2IP also 0?
Assuming you having polygons in GEO2IP, you will want to use STContains or STIntersects.
With the LEFT OUTER JOIN on the query, the spatial comparison is rather redundant. Try an INNER JOIN.
If you could post the create script for your spatial index and the DDL for the GEO2IP table that would be helpful. Approximately how many records does that have?
Are the coordinates Lat/Lon's or projected?
1) SQL Server 08 R2 SP2
2) The bounding box is set as the MIN() and MAX() of the integer conversion of the start and end IP addresses. I don't know how you could improve this.
3)The SRIDs match because in my unit testing expected results were indeed brought back correctly.
4)The LEFT JOIN is necessary because GEO data mapped to a single IP address is not housed in the same table as "true" ranges.
5)The GEO 2 IP table features somewhere in the neighborhood of 3 million rows
6)I'm not sure I understand the last question about coordinates, could you please clarify?
November 3, 2013 at 12:20 pm
Hi Razzgor,
Thanks for posting the plan. Like I said, I'm not a spatial expert - but I do know a fair bit about plans. So let's start with some generic stuff.
1. Don't trust the percentages in the plan. These are based on the estimated cost of the operator, not on the actual cost (which is not measured per operator). If slow processing is caused by bad estimates, these percentages are based on those bad estimates. They will not show where the real pain is.
2. The table scan is not a problem. The query has no filter on the #fact_table, so it will need to read all rows. Scanning is the fastest and most efficient way to fetch all rows. The fact that it's a table scan, not a (clustered) index scan, proves that you have no clustered index on the table - for a permanent table almost always a bad thing, for a temp table not so much, though it can be. But in this case not the problem.
3. Outer joins give the optimizer much less freedom than inner joins. Do you really need this to be an outer join? Even if that is the case, I would still like to know if you get much better performance from an inner join.
4. The way you built the query makes it hard for the optimizer to optimize. I don't know the datatype and contents of the IP_Converted column, but I do know that you convert it to varchar(20), then convert that (using STGeomFromText) to a point, which is then fed into the STWithin function. This means that whatever statistics the optimizer may have on IP_Converted is useless for this query. Do you really need this complex double converston?
Can you try adapting the procedure to add an extra column to the temp table that holds the result of the conversion (so the point, as geometry data type)? It can be a normal column, or a persisted computed column. Then build a spatial index on that column too. Does this help?
5. You wrote in the original post that your benchmark worked fine, but it doesn't work well from within the report. Is the code exactly the same, or are there differences? And how about the execution plans? (I don't know how your report works; you may have to use profiler or extended events to get the exact query).
6. The plan you posted will fetch each row from the temporary fact table (the top-right table scan), then do some arithmetic in the Compute Scalar (that's where the Geomerty value is calculated from the IP_Converted column). Then, for each row, the GEO2IP table is scanned (the clustered index scan bottom-right - this IS a big part of the problem); those rows are passed through a filter (I'm willing to bet that this includes the georange.STwithin calculation), and then the join is performed. For a query like this, I would much rather have seen a merge or hash join, but I think those are not possible for geometry/geography functions. A nested loops join like this, but using the spatial index in the bottom part (instead of the clustered index scan + filter), would probably be the best option. Which, of course, brings us back to your original question. 😉
Can you try to create a repro? I understand that the data, and maybe even the schema, is confidential. That doesn't hinder a repro. You can create new tables, with fake column names, put in some fake data (but similar enough to your real problem to demonstrate the behaviour). Then post that schema (as CREATE TABLE statements) and data (as INSERT statements) here. Don't go overboard with the sample data - if you can repro the issue with 10 rows, don't post 1,000.
Some other questions:
* How many rows are in each of the two tables involved?
* Are the values in IP_Converted all different? If not, how many distinct values are there? If it's low, you could try to rewrite the query to nudge the optimizer to a plan that will first order in IP_Convnerted and perform the join only once per distinct value.
* In an actual execution plan, what are the rowcounts (both estimated and actual) for all of the arrows? And what are the execution counts (both estimated and actual) for the operators on the bottom part of the nested loops join?
* If I understand the query, you have a bunch of points in the temp table (in IP_Converted), a bunch of areas in the geo_range, and now want to report eachh point with all areas that enclose the point. An idea I have might or might not help at all - depending on the actual data distribution. What if you first create a new geometry object that includes all the points in the temp table (no idea how to do that - like I said, I have no experience with Spatial), then use STOverlaps between that area and the data in GEO2IP to quickly weed out all rows that can never overlap with anything. (This would be useful if, for instance, GEO2IP covers the whole world and the temp table is always restricted to a small area, like a single state; it would be useful if the temp table can cover all of the area).
EDIT: Sorry for the overlaps with some previous questions; I somehow overlooked those message before posting mine.
November 3, 2013 at 1:41 pm
Hi
Having another look at your query, the points you are creating for the fact table are in a straight line on a zero y coordinate, unless you are creating a point with zero height.
FROM #fact_table ip
LEFT JOIN GEO2IP tr ON geo_range.STWithin(geometry::STGeomFromText(
'POINT('+CONVERT(varchar(20),ip.IP_Converted)+' 0)',0)) = 1
So when you build a point it will look something like POINT(2155.21 0), this may be part of your problem.
Can you give us an example of a point that you are querying with and a geo_range from your geo2ip table
SELECT TOP 1 geometry::STGeomFromText('POINT('+CONVERT(varchar(20),ip.IP_Converted)+' 0)',0).ToString() FROM #fact_table;
SELECT TOP 1 geo_range.ToString() FROM geo2ip;
November 3, 2013 at 2:40 pm
mickyT (11/3/2013)
HiHaving another look at your query, the points you are creating for the fact table are in a straight line on a zero y coordinate, unless you are creating a point with zero height.
FROM #fact_table ip
LEFT JOIN GEO2IP tr ON geo_range.STWithin(geometry::STGeomFromText(
'POINT('+CONVERT(varchar(20),ip.IP_Converted)+' 0)',0)) = 1
So when you build a point it will look something like POINT(2155.21 0), this may be part of your problem.
Can you give us an example of a point that you are querying with and a geo_range from your geo2ip table
SELECT TOP 1 geometry::STGeomFromText('POINT('+CONVERT(varchar(20),ip.IP_Converted)+' 0)',0).ToString() FROM #fact_table;
SELECT TOP 1 geo_range.ToString() FROM geo2ip;
Yes, that's intentional. The geometry built from the IP ranges are LINESTRINGs so there is no height
Here's kind of a quick rundown on how the tables are structured:
@hugo, thanks for the huge post full of information. I am reading through and will post a response once I organize my thoughts.
November 3, 2013 at 5:01 pm
Hi
Sorry for the delay getting back ... I see what you are doing now. While you can use geometry to do this, it probably isn't the quickest way to resolve the query. You would probably be better with left outer join geo2ip on ip_convert between ip2startinteger and ip2endinteger
Also STIntersects is probably the best method to use for this (point on line), if you decide to carry on with geometries.
I did a bit of testing on this using some made up data. The one thing that did kill the query was when I used the following spatial index definition across my range.
create spatial index g2i_sdx on #geo2ip (geo_range)
using geometry_grid
with (bounding_box = (0, -2, 2001000, 0), grids=(HIGH,MEDIUM,LOW,LOW));
So avoid using 0 in either the miny or maxy bounding box parameters.
This is the test script that I ran. I was suprised to see that the geometry query perform as well if not better than the int between query. So I've had to eat my words above.
-- fact table with 10000 points some overlapping some not
create table #fact_table (
id int identity primary key,
ip_converted int
);
with tally as (
select row_number() OVER (order by (select null)) N
from (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) E1 (N),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) E2 (N)
),
createFact as (
select CAST(RAND(CAST(NEWID() AS VARBINARY)) * 2000000.0 AS INT) ip_converted
from tally x, tally y
)
insert into #fact_table ( ip_converted)
select ip_converted
from createFact;
-- Create a grid table with 1,000,000 cells
create table #geo2ip (
id int identity primary key
,geo_range geometry
,ipstartint int
,ipendint int
);
with tallyPlus as (
select row_number() OVER (order by (select null)) N,
CAST(RAND(CAST(NEWID() AS VARBINARY)) * 1000000.0 AS INT) R
from (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) E1 (N),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) E2 (N),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) E3 (N)
),
createLine as (
select Geometry::STGeomFromText('LINESTRING ('
+ CAST(cast(x.R + y.R as int) AS VARCHAR(10)) + ' 0, ' +
+ CAST(cast(x.R + y.R + 2 as int) AS VARCHAR(10)) + ' 0)',0) geo_range,
cast(x.R + y.R as int) ipstartint,
cast(x.R + y.R + 2 as int) ipendint
from tallyPlus x, tallyPlus y
)
insert into #geo2ip (geo_range, ipstartint, ipendint)
select geo_range, ipstartint, ipendint
from createLine;
create spatial index g2i_sdx on #geo2ip (geo_range)
using geometry_grid
with (bounding_box = (0, -2, 2001000, 1), grids=(HIGH,MEDIUM,LOW,LOW));
create index g2i_ip_idx on #geo2ip (ipstartint, ipendint);
select f.id,
g.id,
ip_converted
from #fact_table f
left outer join #geo2ip g on geo_range.STIntersects(
geometry::STGeomFromText('POINT('+CONVERT(varchar(20),f.IP_Converted)+' 0)',0)
) = 1;
select f.id,
g.id,
ip_converted
from #fact_table f
left outer join #geo2ip g on ip_converted between ipstartint and ipendint;
drop table #fact_table;
drop table #geo2ip;
November 3, 2013 at 5:10 pm
Thanks, I will ad STIntersects to my list of things to test out. And yeah, BETWEEN on integer comparison absolutely kills performance, it was the reason I started doing this experimental stuff with spatial data in the first place. I'm not sure, but I am pretty sure that I set the miny and maxy bounding box to -1 and 1 respectively.
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply