I need help breaking free from RBAR

  • I am trying to break free from my RBAR ways, but this one is stumping me. Can you provide any suggestions on how to accomplish the following in a more set-based approach?

    I have a table that stores information about road segments. Each segment must have a begin-point and an end-point and is ranked based on some criteria (the criteria are not important to this question). A segment is 0.10 miles long, and a new segment begins every 0.01 miles. So there are lots of overlaping segments. My challenge is to 'group' segments together that rank above some threshold value (in this example 90%) and that have begin-points that are 'consecutive' along the roadway. That is, there is an unbroken sequence of begin-points along a physical roadway. If there is a break in begin-points, a new 'group' should be started.

    Here is the relevant table structure:

    CREATE TABLE [dbo].[RoadSegment](

    [SegmentId] [nvarchar](25) NOT NULL,

    [BeginPoint] [float] NOT NULL,

    [EndPoint] [float] NOT NULL,

    [SegmentRank] [float] NOT NULL,

    [GroupId] [nvarchar](25) NULL

    )

    A small set of sample data:

    INSERT INTO RoadSegment (SegmentId, BeginPoint, [EndPoint], SegmentRank) VALUES

    ('123', 0.00, 0.10, 90),

    ('124', 0.01, 0.11, 92),

    ('125', 0.02, 0.12, 95),

    ('126', 0.03, 0.13, 84),

    ('127', 0.04, 0.14, 95),

    ('128', 0.05, 0.15, 95)

    In this example, segments with the IDs '123', '124', and '125' would be grouped together, i.e. get the same GroupId. A new group woud begin with segment '127' and include '128'. Segment '126' would not be included in any group because its rank was below the threshold value.

    I can start off the data processing easily enough....SELECT * FROM RoadSegment WHERE SegmentRank >= 90 ORDER BY BeginPoint. But after that my old RBAR ways want to take over and iterate over each record:

    psuedo-code:

    IF CurrentBeginPoint - PreviousBeginPoint = 0.01

    THEN CurrentGroupId = PreviousGroupId

    ELSE CurrentGroupId = PreviousGroupId + 1

    SET GroupId = CurrentGroupId

    PreviousBeginPoint = CurrentEndPoint

    PreviousGroupId = CurrentGroupId

    FETCH NEXT record

    Is there a better way of doing this?

    Thank you.

  • Essentially what you need is a running total. You can read about that in an article from Jeff Moden here: http://www.sqlservercentral.com/articles/T-SQL/68467/

    So something along this should for for you:

    --Determine where each group should start

    WITH CTE AS (SELECT SegmentId, BeginPoint, EndPoint, ROW_NUMBER() OVER (ORDER BY BeginPoint) AS RN FROM RoadSegment WHERE SegmentRank >= 90)

    SELECT

    C.SegmentId,

    C.RN,

    CASE WHEN C.BeginPoint = (P.BeginPoint + 0.01) THEN 0 ELSE 1 END AS NewGroup

    INTO #Pass1

    FROM CTE C

    LEFT JOIN CTE P

    ON (C.RN - 1) = P.RN;

    --Setup for the "Quirky Update"

    ALTER TABLE #Pass1 ALTER COLUMN RN INT NOT NULL;

    ALTER TABLE #Pass1 ADD GroupNum INT NULL;

    ALTER TABLE #Pass1 ADD CONSTRAINT Pass1_PK PRIMARY KEY CLUSTERED (RN ASC);

    DECLARE

    @Anchor INT,

    @SUM INT;

    SET @SUM = 0;

    --Perform the "Quirky Update" (i.e. Running Total)

    UPDATE #Pass1

    SET

    @Anchor = RN,

    @Sum = GroupNum = @SUM + NewGroup

    FROM #Pass1 WITH (TABLOCKX)

    OPTION (MAXDOP 1);

    --Store the group number in the source table

    UPDATE RS

    SET

    GroupID = CAST(P1.GroupNum AS NVARCHAR(25))

    FROM RoadSegment RS

    INNER JOIN #Pass1 P1

    ON RS.SegmentId = P1.SegmentId;

    --Cleanup

    DROP TABLE #Pass1;

    --Display the results

    SELECT

    *

    FROM RoadSegment

    ORDER BY

    BeginPoint;

    I did it in a temporary table since the "Quirky Update" method has very specific requirements to make it work correctly, and then updated your table with the results. (Read and understand the article.)

    I noticed that your GroupID column is a NVARCHAR, is there a reason you aren't using an INT?

    BTW: Using the FLOAT type for your Begin/End Point probably isn't the best idea since it in an imprecise data type, in that what you put in is not always what you will get out. (Precision can be lost.) I would recommend using the DECIMAL/NUMERIC type instead.

  • This uses what is called the quirky update by some. It is one way of doing this, but the code below is is not completely correct, as I am sure Jeff will point out when he sees it.

    I have left out several key components because of the size of the data set, plus it is getting late. I just wanted to demonstrate one way of doing this.

    Read this article Solving the Running Total and Ordinal Rank Problems (Rewritten)[/url] and the discussions to understand this method better, especially if you intend to use this method in a production environment.

    drop table [dbo].[RoadSegment];

    go

    CREATE TABLE [dbo].[RoadSegment](

    [SegmentId] [nvarchar](25) NOT NULL,

    [BeginPoint] [float] NOT NULL,

    [EndPoint] [float] NOT NULL,

    [SegmentRank] [float] NOT NULL,

    [GroupId] [int] NULL -- change to int from nvarchar

    );

    GO

    INSERT INTO RoadSegment (SegmentId, BeginPoint, [EndPoint], SegmentRank) VALUES

    ('123', 0.00, 0.10, 90),

    ('124', 0.01, 0.11, 92),

    ('125', 0.02, 0.12, 95),

    ('126', 0.03, 0.13, 84),

    ('127', 0.04, 0.14, 95),

    ('128', 0.05, 0.15, 95);

    GO

    declare @SeqRank float = 90,

    @GrpId int = 0;

    update dbo.RoadSegment set

    GroupId = null; -- ensure GroupId is null before processing

    with RoadSegments as (

    select

    SegmentId,

    BeginPoint,

    BeginPoint + 0.01 as NextPoint,

    EndPoint,

    SegmentRank,

    GroupId

    from

    dbo.RoadSegment

    where

    SegmentRank >= @SeqRank

    ),

    GroupedSegments as (

    select

    rs1.SegmentId,

    rs1.BeginPoint,

    rs1.NextPoint,

    rs1.EndPoint,

    rs1.SegmentRank,

    rs1.GroupId,

    rs2.BeginPoint as BeginPoint2

    from

    RoadSegments rs1

    left outer join RoadSegments rs2

    on rs1.BeginPoint = rs2.NextPoint

    )

    update dbo.RoadSegment set

    @GrpId = GroupId = @GrpId + case when gs.BeginPoint2 is null then 1 else 0 end

    from

    dbo.RoadSegment rs

    inner join GroupedSegments gs

    on (rs.SegmentId = gs.SegmentId)

    ;

    select

    *

    from

    dbo.RoadSegment

    where

    GroupId is not null

    order by

    GroupId,

    SegmentId

    ;

  • Thank you for the code and the article references. I am excited to read and understand both.

    (I was a bit too casual in my table definition. I am using an int for the GroupId, and per the advice here, i will change the data type of the begin/end points).

    Thanks again.

  • CELKO (3/31/2011)


    Cast the segment_id to integer and do the subtraction:

    ('123', 0.00, 0.10, 90, ('123- 1))

    ('124', 0.01, 0.11, 92, ('124- 2))

    ('125', 0.02, 0.12, 95, ('125- 3))

    ('127', 0.04, 0.14, 95, ('127- 4))

    ('128', 0.05, 0.15, 95, ('128- 5))'

    Celko,

    I had thought about suggesting that method, but I made the assumption that I couldn't guarantee that the segment_id would always be an integer and always be in begin point order so I went with the safer, but less performant method.

  • It was just coincidence / convenience regarding the order of segment IDs in the example I created. I think the method proposed by Celko wouldn't work out for that reason.

  • First, thank you for your time. I do appreciate it.

    The segment id is a varchar(24), and its value is assigned randomly-ish. There are some pertinent bits of info in the first 8 or so characters of the Id to help identify the road (county number, region, district, etc), but with the remaining characters there is not a particular sequence that we could count on to 'order' the segments.

    I suspect adding another ROW_NUMBER() as an 'ordering surrogate' to view would be the way to go - but certainly not sure what that would look like.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply