November 24, 2014 at 8:07 am
Hi,
I have a table with 257 mil records with latitude and longitude data.
My goal is to find the closest intersecting values from a locations table (88 rows) and update any of the 257 mil records that are applicable with the location_Name and Location_Group_Name.
The query I have works but doesn't perform well on such a big data set.
CREATE TABLE #Positions -- Base table 257 mil rows. Actual table has 20 columns
(
IDBigInt PRIMARY KEY,
LatitudeDec(10,6),
LongitudeDec(10,6),
Location_Name Varchar(50),
Location_Group_NameVarchar(50)
)
CREATE TABLE #Locations -- 88 rows
(
IDInt,
LatitudeDec(10,6),
LongitudeDec(10,6),
GeoShapeGeoGraphy,
Loc_NameVarchar(50),
Loc_Grp_NameVarchar(50),
)
Attached you will find the tables, test data, a function to measure distance and some queries that work but are too slow for this much data.
Here are the best resources i found.
Questioned asked here by drjdewhurst
I have only been working with this stuff for a week now, could really use some help.
November 24, 2014 at 9:03 am
You will get a lot more responses if you post your tables and data in a consumable format. Not many people are willing to download rar files from people they don't know.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
November 24, 2014 at 9:29 am
Thanks for the tip Sean. No .sql available, had to make a choice.
Here it is again as .doc and .zip.
Can also do .txt or post all 1130 lines of code if anyone prefers.
November 24, 2014 at 9:42 am
DennisPost (11/24/2014)
Thanks for the tip Sean. No .sql available, had to make a choice.Here it is again as .doc and .zip.
Can also do .txt or post all 1130 lines of code if anyone prefers.
Replacing on potentially dangerous file type with another is not going to do much for you. What do you mean by no .sql available? Don't you have the content for a file as that? Wouldn't be too hard to change the extension of a text file to sql.
Why do you need 1130 lines to demonstrate the issue here? Or are you saying your function is 1130 lines?
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
November 25, 2014 at 12:28 am
Sean Lange (11/24/2014)
DennisPost (11/24/2014)
Thanks for the tip Sean. No .sql available, had to make a choice.Here it is again as .doc and .zip.
Can also do .txt or post all 1130 lines of code if anyone prefers.
Replacing on potentially dangerous file type with another is not going to do much for you. What do you mean by no .sql available? Don't you have the content for a file as that? Wouldn't be too hard to change the extension of a text file to sql.
Why do you need 1130 lines to demonstrate the issue here? Or are you saying your function is 1130 lines?
When I try to upload my file as .sql it says. See the .png
A file you attempted to upload is not a permitted type.
I need 1130 lines of code to demonstrate how I am tackling this. Test data for the #Position table is 1000 + 20 for the #Location table and some more the function I am using to measure distance. If you know of a way to generate random data for geography datatypes and how to generate latitude and longitudes that intersect with them, I'm keep to learn.:-)
Which format would you choose?
November 25, 2014 at 7:48 am
Post your data as all text, as insert statements into your demonstration tables. Obviously, works best if you don't post zetabyte databases this way LOL just post a sample set that demonstrates the issue.
edit:
http://msdn.microsoft.com/en-us/library/bb895266.aspx might help build some sample data from textual geographical coordinates.
-- basically you want a stack of these sorts of things:
INSERT INTO SpatialTable (GeogCol1)
VALUES (geography::STGeomFromText('LINESTRING(-122.360 47.656, -122.343 47.656)', 4326));
INSERT INTO SpatialTable (GeogCol1)
VALUES (geography::STGeomFromText('POLYGON((-122.358 47.653, -122.348 47.649, -122.348 47.658, -122.358 47.658, -122.358 47.653))', 4326));
November 25, 2014 at 11:44 am
Also check out this link: http://msdn.microsoft.com/en-us/magazine/jj133823.aspx
This talks about putting locations into "sectors" to chip away at the efficiency problems, maybe you can get some insight into some of the issues you'll run into with these sort of big geographical datasets.
Interesting stuff!
edit: as a bonus, that page has c# routines to create dummy datasets.
November 26, 2014 at 6:45 am
Thanks for the link Patrick, was a good read.
We use the sectors technique for other purposes, but our sectors are 6m x 6m. Unfortunately even sectors this small are not accurate enough.
Most of the #Position and #Location data will be in one city.
Here's the simplified sample code.
USE TempDB
GO
IF OBJECT_ID('TempDB..#Positions') IS NOT NULL
DROP TABLE #Positions
CREATE TABLE #Positions -- Actual table has 257 mil rows, 20 columns
(
Position_IDBigInt CONSTRAINT PK_PosID PRIMARY KEY,
LatitudeDec(10,6),
LongitudeDec(10,6),
Location_NameVarchar(50)
)
INSERT#Positions
SELECT581763287, 51.546783, 4.803814, NULL UNION ALL
SELECT581763288, 52.070960, 5.123537, NULL UNION ALL
SELECT581773709, 51.118755, 3.561964, NULL UNION ALL
SELECT581773710, 51.118759, 3.562069, NULL UNION ALL
SELECT581773711, 51.118816, 3.562144, NULL
IF OBJECT_ID('TempDB..#Locations') IS NOT NULL
DROP TABLE #Locations
CREATE TABLE #Locations -- 88 rows
(
Loc_IDInt CONSTRAINT PK_LocID PRIMARY KEY,
LatitudeDec(10,6),
LongitudeDec(10,6),
GeoShapeGeoGraphy,
Loc_NameVarchar(50)
)
INSERT#Locations
SELECT638767, 52.070732, 5.123783, 0xE6100000010405000000FA8777B2D7084A4077EDB30B107C144039C124B041094A4077EDB30B107C144039C124B041094A40E42C30C86F811440FA8777B2D7084A40E42C30C86F811440FA8777B2D7084A4077EDB30B107C144001000000020000000001000000FFFFFFFF0000000003, 'Location 1' UNION ALL
SELECT797174, 51.546888, 4.803603, 0xE6100000010405000000DF03BDD1CBC54940170659B43A341340F58B15D235C64940170659B43A341340F58B15D235C649401C97157D8A391340DF03BDD1CBC549401C97157D8A391340DF03BDD1CBC54940170659B43A34134001000000020000000001000000FFFFFFFF0000000003, 'Location 2'
IF EXISTS (SELECT * FROM sys.objects WHERE Name = 'GetDistance' AND [Type] = 'FN')
DROP FUNCTION GetDistance
GO
CREATE FUNCTION [dbo].[GetDistance]
(
@Latitude1 Decimal(10,6),
@Longitude1Decimal(10,6),
@Latitude2 Decimal(10,6),
@Longitude2 Decimal(10,6)
)
RETURNS INT WITH SCHEMABINDING
AS
BEGIN
DECLARE @Temp Float
SET @Temp = SIN(@Latitude1/57.2957795130823) *
SIN(@Latitude2/57.2957795130823) +
COS(@Latitude1/57.2957795130823) *
COS(@Latitude2/57.2957795130823) *
COS(
@Longitude2/57.2957795130823 -
@Longitude1 /
57.2957795130823
)
IF @Temp > 1
SET @Temp = 1
ELSE IF @Temp < -1
SET @Temp = -1
RETURN CAST((3958.75586574 * ACOS(@Temp) * 1609.344) AS INT)
END
GO
UPDATE#Positions
SETLocation_Name = L.Loc_Name
FROM#Positions P WITH (NOLOCK)
OUTER APPLY
(
SELECTTOP(1) L.Loc_Name
FROM#Locations L WITH (NOLOCK)
WHEREGeoshape.STIntersects(geography::Point(P.Latitude, P.Longitude, 4326)) = 1
ORDERBY dbo.GetDistance(P.Latitude, P.Longitude, L.latitude, L.longitude) ASC
) L
SELECT*
FROM#Positions
IF OBJECT_ID('TempDB..#Positions') IS NOT NULL
DROP TABLE #Positions
IF OBJECT_ID('TempDB..#Locations') IS NOT NULL
DROP TABLE #Locations
IF EXISTS (SELECT * FROM sys.objects WHERE Name = 'GetDistance' AND [Type] = 'FN')
DROP FUNCTION GetDistance
November 26, 2014 at 8:07 am
Right off hand, you might want to try to convert that function into an inline calculation. Other experts might give better advice, but it might be the case that multiline functions exact a performance penalty. Yes, moving the calculations to the update statement will probably be cumbersome, but I think sacrificing some readability for getting rid of some of what I guess is an interpretation based penalty might pay off.
Something like the following, although I doubt I'm doing the right thing as I hacked off a few things that you might need to be included (casting as int, limitting the figures to -1 or 1, etc...) just to get your sample to run 🙂
UPDATE#Positions
SETLocation_Name = L.Loc_Name
FROM#Positions P WITH (NOLOCK)
OUTER APPLY
(
SELECTTOP(1) L.Loc_Name
FROM#Locations L WITH (NOLOCK)
WHEREGeoshape.STIntersects(geography::Point(P.Latitude, P.Longitude, 4326)) = 1
ORDERBY
--dbo.GetDistance(P.Latitude, P.Longitude, L.latitude, L.longitude) ASC
(3958.75586574 * ACOS
(
SIN(P.Latitude/57.2957795130823) *
SIN(L.Latitude/57.2957795130823) +
COS(P.Latitude/57.2957795130823) *
COS(L.Latitude/57.2957795130823) *
COS(
L.Longitude/57.2957795130823 -
P.Longitude /
57.2957795130823
)
)
)
* 1609.344
--
) L
November 26, 2014 at 10:53 am
Even completely removing the order by doesn't help.:(
November 26, 2014 at 12:54 pm
DennisPost (11/26/2014)
Even completely removing the order by doesn't help.:(
Yeah that does sort of indicate that 257 million rows is a bit of data to push through looking for nearby neighbors. It seems doable as you're only checking 88 locations but most of my SQL is counting stuff and drawing reports, hopefully someone else can lend some insight.
Here's some stuff on spatial indexing: http://technet.microsoft.com/en-us/library/bb964712(v=sql.105).aspx , I found that pretty interesting. Still, the fact that you've removed the "order by" without even getting help with performance (not even taking into consideration correctness) does tell a bit. Maybe trying 88 passes through the dataset looking for intersections for each location, but thats just moving the right side of the "apply" to the left though, if that even makes sense to do.
Sorry to have you spend your time uploading samples and then not actually helping, it IS really helpful for the forum here to assist in getting answers, but that sort of applies only in the cases that good help is actually forthcoming.
December 10, 2014 at 6:40 am
I was able to speed up the query by reducing the intersect results by correcting the geography orientation.
In the end a VIEW was sufficient as only a small portion of the position rows are queried.
SELECTP.*
, Locs.Location_Name
, Locs.Location_Group_Name
FROMdbo.Positions P WITH (NOLOCK)
OUTER APPLY
(
SELECTTOP(1) Location_Name, Location_Group_Name
FROMLocations L WITH (NOLOCK)
CROSS APPLY
(
VALUES
(
-- Repair the ring orientation on geography datatype.
Geography::STGeomFromWKB
(Geometry::STGeomFromText(L.Geography_Shape.STAsText(), 4326).STUnion
(Geometry::STGeomFromText(L.Geography.STAsText(), 4326).STStartPoint()
).STAsBinary(), 4326))
) Val1(GeogShape)
WHEREVal1.GeogShape.STIntersects(geography::Point(P.Latitude, P.Longitude, 4326)) = 1
ORDERBY dbo.GetDistance(P.Latitude, P.Longitude, L.Latitude, L.Longitude) ASC
) Locs
December 10, 2014 at 6:55 am
Thanks for the help patrickmcginnis59 10839.
You my not have given a complete answer, but I learned a lot for the links and insights you provided.
My colleague and I also devised a Sectoring style solution with all location IDs and their intersecting sectors. Unfortunately the resulting multi-billion row table made this unfeasible. (+100K locations with +100 million used sectors). Any changes to a locations shape, which occurs frequently, would have to be recalculated. Other attempts to generate the relevant table data on the fly resulted in terrible performance.
December 18, 2014 at 9:37 am
Thanks for posting your conclusion as a followup, interesting stuff!
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply