March 29, 2011 at 10:34 pm
I am trying to break free from my RBAR ways, but this one is stumping me. Can you provide any suggestions on how to accomplish the following in a more set-based approach?
I have a table that stores information about road segments. Each segment must have a begin-point and an end-point and is ranked based on some criteria (the criteria are not important to this question). A segment is 0.10 miles long, and a new segment begins every 0.01 miles. So there are lots of overlaping segments. My challenge is to 'group' segments together that rank above some threshold value (in this example 90%) and that have begin-points that are 'consecutive' along the roadway. That is, there is an unbroken sequence of begin-points along a physical roadway. If there is a break in begin-points, a new 'group' should be started.
Here is the relevant table structure:
CREATE TABLE [dbo].[RoadSegment](
[SegmentId] [nvarchar](25) NOT NULL,
[BeginPoint] [float] NOT NULL,
[EndPoint] [float] NOT NULL,
[SegmentRank] [float] NOT NULL,
[GroupId] [nvarchar](25) NULL
)
A small set of sample data:
INSERT INTO RoadSegment (SegmentId, BeginPoint, [EndPoint], SegmentRank) VALUES
('123', 0.00, 0.10, 90),
('124', 0.01, 0.11, 92),
('125', 0.02, 0.12, 95),
('126', 0.03, 0.13, 84),
('127', 0.04, 0.14, 95),
('128', 0.05, 0.15, 95)
In this example, segments with the IDs '123', '124', and '125' would be grouped together, i.e. get the same GroupId. A new group woud begin with segment '127' and include '128'. Segment '126' would not be included in any group because its rank was below the threshold value.
I can start off the data processing easily enough....SELECT * FROM RoadSegment WHERE SegmentRank >= 90 ORDER BY BeginPoint. But after that my old RBAR ways want to take over and iterate over each record:
psuedo-code:
IF CurrentBeginPoint - PreviousBeginPoint = 0.01
THEN CurrentGroupId = PreviousGroupId
ELSE CurrentGroupId = PreviousGroupId + 1
SET GroupId = CurrentGroupId
PreviousBeginPoint = CurrentEndPoint
PreviousGroupId = CurrentGroupId
FETCH NEXT record
Is there a better way of doing this?
Thank you.
March 30, 2011 at 12:37 am
Essentially what you need is a running total. You can read about that in an article from Jeff Moden here: http://www.sqlservercentral.com/articles/T-SQL/68467/
So something along this should for for you:
--Determine where each group should start
WITH CTE AS (SELECT SegmentId, BeginPoint, EndPoint, ROW_NUMBER() OVER (ORDER BY BeginPoint) AS RN FROM RoadSegment WHERE SegmentRank >= 90)
SELECT
C.SegmentId,
C.RN,
CASE WHEN C.BeginPoint = (P.BeginPoint + 0.01) THEN 0 ELSE 1 END AS NewGroup
INTO #Pass1
FROM CTE C
LEFT JOIN CTE P
ON (C.RN - 1) = P.RN;
--Setup for the "Quirky Update"
ALTER TABLE #Pass1 ALTER COLUMN RN INT NOT NULL;
ALTER TABLE #Pass1 ADD GroupNum INT NULL;
ALTER TABLE #Pass1 ADD CONSTRAINT Pass1_PK PRIMARY KEY CLUSTERED (RN ASC);
DECLARE
@Anchor INT,
@SUM INT;
SET @SUM = 0;
--Perform the "Quirky Update" (i.e. Running Total)
UPDATE #Pass1
SET
@Anchor = RN,
@Sum = GroupNum = @SUM + NewGroup
FROM #Pass1 WITH (TABLOCKX)
OPTION (MAXDOP 1);
--Store the group number in the source table
UPDATE RS
SET
GroupID = CAST(P1.GroupNum AS NVARCHAR(25))
FROM RoadSegment RS
INNER JOIN #Pass1 P1
ON RS.SegmentId = P1.SegmentId;
--Cleanup
DROP TABLE #Pass1;
--Display the results
SELECT
*
FROM RoadSegment
ORDER BY
BeginPoint;
I did it in a temporary table since the "Quirky Update" method has very specific requirements to make it work correctly, and then updated your table with the results. (Read and understand the article.)
I noticed that your GroupID column is a NVARCHAR, is there a reason you aren't using an INT?
BTW: Using the FLOAT type for your Begin/End Point probably isn't the best idea since it in an imprecise data type, in that what you put in is not always what you will get out. (Precision can be lost.) I would recommend using the DECIMAL/NUMERIC type instead.
March 30, 2011 at 12:49 am
This uses what is called the quirky update by some. It is one way of doing this, but the code below is is not completely correct, as I am sure Jeff will point out when he sees it.
I have left out several key components because of the size of the data set, plus it is getting late. I just wanted to demonstrate one way of doing this.
Read this article Solving the Running Total and Ordinal Rank Problems (Rewritten)[/url] and the discussions to understand this method better, especially if you intend to use this method in a production environment.
drop table [dbo].[RoadSegment];
go
CREATE TABLE [dbo].[RoadSegment](
[SegmentId] [nvarchar](25) NOT NULL,
[BeginPoint] [float] NOT NULL,
[EndPoint] [float] NOT NULL,
[SegmentRank] [float] NOT NULL,
[GroupId] [int] NULL -- change to int from nvarchar
);
GO
INSERT INTO RoadSegment (SegmentId, BeginPoint, [EndPoint], SegmentRank) VALUES
('123', 0.00, 0.10, 90),
('124', 0.01, 0.11, 92),
('125', 0.02, 0.12, 95),
('126', 0.03, 0.13, 84),
('127', 0.04, 0.14, 95),
('128', 0.05, 0.15, 95);
GO
declare @SeqRank float = 90,
@GrpId int = 0;
update dbo.RoadSegment set
GroupId = null; -- ensure GroupId is null before processing
with RoadSegments as (
select
SegmentId,
BeginPoint,
BeginPoint + 0.01 as NextPoint,
EndPoint,
SegmentRank,
GroupId
from
dbo.RoadSegment
where
SegmentRank >= @SeqRank
),
GroupedSegments as (
select
rs1.SegmentId,
rs1.BeginPoint,
rs1.NextPoint,
rs1.EndPoint,
rs1.SegmentRank,
rs1.GroupId,
rs2.BeginPoint as BeginPoint2
from
RoadSegments rs1
left outer join RoadSegments rs2
on rs1.BeginPoint = rs2.NextPoint
)
update dbo.RoadSegment set
@GrpId = GroupId = @GrpId + case when gs.BeginPoint2 is null then 1 else 0 end
from
dbo.RoadSegment rs
inner join GroupedSegments gs
on (rs.SegmentId = gs.SegmentId)
;
select
*
from
dbo.RoadSegment
where
GroupId is not null
order by
GroupId,
SegmentId
;
March 30, 2011 at 11:05 am
Thank you for the code and the article references. I am excited to read and understand both.
(I was a bit too casual in my table definition. I am using an int for the GroupId, and per the advice here, i will change the data type of the begin/end points).
Thanks again.
March 31, 2011 at 4:18 pm
CELKO (3/31/2011)
Cast the segment_id to integer and do the subtraction:('123', 0.00, 0.10, 90, ('123- 1))
('124', 0.01, 0.11, 92, ('124- 2))
('125', 0.02, 0.12, 95, ('125- 3))
('127', 0.04, 0.14, 95, ('127- 4))
('128', 0.05, 0.15, 95, ('128- 5))'
Celko,
I had thought about suggesting that method, but I made the assumption that I couldn't guarantee that the segment_id would always be an integer and always be in begin point order so I went with the safer, but less performant method.
March 31, 2011 at 4:25 pm
It was just coincidence / convenience regarding the order of segment IDs in the example I created. I think the method proposed by Celko wouldn't work out for that reason.
March 31, 2011 at 6:09 pm
First, thank you for your time. I do appreciate it.
The segment id is a varchar(24), and its value is assigned randomly-ish. There are some pertinent bits of info in the first 8 or so characters of the Id to help identify the road (county number, region, district, etc), but with the remaining characters there is not a particular sequence that we could count on to 'order' the segments.
I suspect adding another ROW_NUMBER() as an 'ordering surrogate' to view would be the way to go - but certainly not sure what that would look like.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply