September 17, 2012 at 7:18 am
Hi,
Many Thanks for your help in advance, this one has my stumped. I would like to concatenate the text fields of x numbers of rows if the rows depths overlap one anothers.
Example table:
CREATE TABLE [dbo].[TEMP1](
[well_id] [int] NOT NULL,
[Top1] [real] NOT NULL,
[BASE1] [real] not NULL,
[comment] [nvarchar] (50) not null
) ON [PRIMARY]
GO
INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (5443, 2665, 2710, 'A')
INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 10, 20, 'A')
INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 15, 25, 'b')
INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 0, 0, 'A')
INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 34.66, 44.45, 'A')
INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 44.25, 54.25, 'Ac')
INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 54.25, 67.31, 'A')
INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 93.44, 106.5, 'A')
INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 1833, 2100, 'A')
INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 2100, 2305, 'A')
INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 2305, 2855, 'A')
I would like to find all the wells with the same id, that have overlapping top and base depths, e.g. row 2; 10 -20 overlaps with row 3; 15-25 in this case I would like to merge the comments from those two rows into a new record with the maximum range of those two rows populated (e.g. 5732,10,25, Ab). Any rows from the same well that do not overlap can simply be put into the new table.
Can anyone give me some clues as to how I should go about doing this.
Many Thanks,
Oliver
September 17, 2012 at 7:47 am
Not very efficient, but should give the correct results
SELECT s1.well_id,
s1.Top1,
MIN(t1.BASE1) AS BASE1,
(SELECT DISTINCT c.comment AS "text()"
FROM TEMP1 c
WHERE c.well_id = s1.well_id
AND c.Top1 BETWEEN s1.Top1 AND MIN(t1.BASE1)
AND c.BASE1 BETWEEN s1.Top1 AND MIN(t1.BASE1)
ORDER BY c.comment
FOR XML PATH(''),TYPE).value('.','VARCHAR(100)') AS comments
FROM TEMP1 s1
INNER JOIN TEMP1 t1 ON t1.well_id=s1.well_id
AND s1.Top1 <= t1.BASE1
AND NOT EXISTS(SELECT * FROM TEMP1 t2
WHERE t2.well_id=t1.well_id
AND t1.BASE1 >= t2.Top1 AND t1.BASE1 < t2.BASE1)
WHERE NOT EXISTS(SELECT * FROM TEMP1 s2
WHERE s2.well_id=s1.well_id
AND s1.Top1 > s2.Top1 AND s1.Top1 <= s2.BASE1)
GROUP BY s1.well_id,s1.Top1
ORDER BY s1.well_id,s1.Top1;
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537September 17, 2012 at 7:58 am
wow, is amazing. thank you so much.
I will now try and understand what you have done here. I dont really understand how you are comparing more than one row at a time, bulk will break it down and see how I get on.
Many Thanks for your help,
Oliver
September 17, 2012 at 8:44 am
This yields slightly different results to Mark's solution - I'm assuming that you want the comments from the middlemen in a group:
-- identify rows with/without overlaps
;WITH SplitData AS (
SELECT a.*,
Parent = CASE WHEN EXISTS (
SELECT 1
FROM TEMP1 b
WHERE b.well_id = a.well_id
AND (b.Top1 BETWEEN a.Top1 AND a.Base1 OR b.Base1 BETWEEN a.Top1 AND a.Base1)
AND NOT (b.Top1 = a.Top1 AND b.Base1 = a.Base1)
) THEN 1 ELSE 0 END
FROM TEMP1 a
),
-- sequence the rows for each well_id, partitioned by whether or not there's an overlap
SequencedData AS (
SELECT *,
seq = ROW_NUMBER() OVER (PARTITION BY a.well_id, a.parent ORDER BY a.Top1, a.Base1)
FROM SplitData a
),
rCTE AS (
SELECT -- anchor: no subsequent overlaps, or first in a sequence
level = 1, seq, Parent,
well_id, Top1, BASE1, comment,
Newcomment = CAST(ISNULL(comment,'') AS VARCHAR(100)),
NewTop1 = Top1,
NewBase1 = BASE1
FROM SequencedData
WHERE seq = 1
UNION ALL
SELECT
level = lr.level + 1, tr.seq, tr.Parent,
tr.well_id, tr.Top1, tr.BASE1, tr.comment,
Newcomment = CAST(lr.Newcomment + ISNULL(tr.comment,'') AS VARCHAR(100)),
NewTop1 = CASE WHEN lr.NewTop1 < tr.Top1 THEN lr.NewTop1 ELSE tr.Top1 END,
NewBase1 = CASE WHEN lr.NewBase1 > tr.BASE1 THEN lr.NewBase1 ELSE tr.BASE1 END
FROM SequencedData tr
INNER JOIN rCTE lr
ON lr.well_id = tr.well_id AND tr.seq > lr.seq
AND (tr.Top1 BETWEEN lr.Top1 AND lr.Base1 OR tr.Base1 BETWEEN lr.Top1 AND lr.Base1)
)
SELECT well_id, NewTop1, NewBase1, Newcomment
FROM (
SELECT *, Maxlevel = MAX(level) OVER(PARTITION BY well_id)
FROM rCTE
) d
WHERE Parent = 0
OR level = Maxlevel
ORDER BY well_id, level
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 17, 2012 at 8:53 am
Thank you very much for this. I am slowly working through this.
Currently though I dont class two rows one row ending and the other row starting on the same number as an overlap. Therefore do I need to change your statement from between to > and <
Many Thanks again.
Oliver
September 17, 2012 at 9:00 am
I changed the code to this to sort that change. Is there any problem with this?
-- identify rows with/without overlaps
;WITH SplitData AS (
SELECT a.*,
Parent = CASE WHEN EXISTS (
SELECT 1
FROM TEMP1 b
WHERE b.well_id = a.well_id
AND ((b.Top1 > a.Top1 AND b.Top1 < a.Base1) OR (b.Base1 > a.Top1 AND b.Base1 < a.Base1))
AND NOT (b.Top1 = a.Top1 or b.Base1 = a.Base1)
) THEN 1 ELSE 0 END
FROM TEMP1 a
),
-- sequence the rows for each well_id, partitioned by whether or not there's an overlap
SequencedData AS (
SELECT *,
seq = ROW_NUMBER() OVER (PARTITION BY a.well_id, a.parent ORDER BY a.Top1, a.Base1)
FROM SplitData a
),
rCTE AS (
SELECT -- anchor: no subsequent overlaps, or first in a sequence
level = 1, seq, Parent,
well_id, Top1, BASE1, comment,
Newcomment = CAST(ISNULL(comment,'') AS VARCHAR(100)),
NewTop1 = Top1,
NewBase1 = BASE1
FROM SequencedData
WHERE seq = 1
UNION ALL
SELECT
level = lr.level + 1, tr.seq, tr.Parent,
tr.well_id, tr.Top1, tr.BASE1, tr.comment,
Newcomment = CAST(lr.Newcomment + ISNULL(tr.comment,'') AS VARCHAR(100)),
NewTop1 = CASE WHEN lr.NewTop1 < tr.Top1 THEN lr.NewTop1 ELSE tr.Top1 END,
NewBase1 = CASE WHEN lr.NewBase1 > tr.BASE1 THEN lr.NewBase1 ELSE tr.BASE1 END
FROM SequencedData tr
INNER JOIN rCTE lr
ON lr.well_id = tr.well_id AND tr.seq > lr.seq
AND (tr.Top1 BETWEEN lr.Top1 AND lr.Base1 OR tr.Base1 BETWEEN lr.Top1 AND lr.Base1)
)
SELECT well_id, NewTop1, NewBase1, Newcomment
FROM (
SELECT *, Maxlevel = MAX(level) OVER(PARTITION BY well_id)
FROM rCTE
) d
WHERE Parent = 0
OR level = Maxlevel
ORDER BY well_id, level
Many Thanks
September 17, 2012 at 9:31 am
oliver.morris (9/17/2012)
I changed the code to this to sort that change. Is there any problem with this?
-- identify rows with/without overlaps
;WITH SplitData AS (
SELECT a.*,
Parent = CASE WHEN EXISTS (
SELECT 1
FROM TEMP1 b
WHERE b.well_id = a.well_id
AND ((b.Top1 > a.Top1 AND b.Top1 < a.Base1) OR (b.Base1 > a.Top1 AND b.Base1 < a.Base1))
AND NOT (b.Top1 = a.Top1 or b.Base1 = a.Base1)
) THEN 1 ELSE 0 END
FROM TEMP1 a
),
-- sequence the rows for each well_id, partitioned by whether or not there's an overlap
SequencedData AS (
SELECT *,
seq = ROW_NUMBER() OVER (PARTITION BY a.well_id, a.parent ORDER BY a.Top1, a.Base1)
FROM SplitData a
),
rCTE AS (
SELECT -- anchor: no subsequent overlaps, or first in a sequence
level = 1, seq, Parent,
well_id, Top1, BASE1, comment,
Newcomment = CAST(ISNULL(comment,'') AS VARCHAR(100)),
NewTop1 = Top1,
NewBase1 = BASE1
FROM SequencedData
WHERE seq = 1
UNION ALL
SELECT
level = lr.level + 1, tr.seq, tr.Parent,
tr.well_id, tr.Top1, tr.BASE1, tr.comment,
Newcomment = CAST(lr.Newcomment + ISNULL(tr.comment,'') AS VARCHAR(100)),
NewTop1 = CASE WHEN lr.NewTop1 < tr.Top1 THEN lr.NewTop1 ELSE tr.Top1 END,
NewBase1 = CASE WHEN lr.NewBase1 > tr.BASE1 THEN lr.NewBase1 ELSE tr.BASE1 END
FROM SequencedData tr
INNER JOIN rCTE lr
ON lr.well_id = tr.well_id AND tr.seq > lr.seq
AND (tr.Top1 BETWEEN lr.Top1 AND lr.Base1 OR tr.Base1 BETWEEN lr.Top1 AND lr.Base1)
)
SELECT well_id, NewTop1, NewBase1, Newcomment
FROM (
SELECT *, Maxlevel = MAX(level) OVER(PARTITION BY well_id)
FROM rCTE
) d
WHERE Parent = 0
OR level = Maxlevel
ORDER BY well_id, level
Many Thanks
Hi Oliver
It looks ok to me and works in the way you anticipate on the small sample data set you've provided. Test it thoroughly against a more substantial data set - which only you have - before committing it to production.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 17, 2012 at 10:00 am
Hi Mark,
I have spent the last couple of hours looking at your code and I cant work it out, if you could place a few text pointers around that would be great.
A few things, why the group by (min base) and why s1.top <= t1.base
I am sorry I am stupid but there are two many steps removed for me to learn how you came up with this.
Sorry to be a pain.
Many Thanks,
Oliver
September 17, 2012 at 10:47 am
This is a just couple of queries rolled into one.
This first query finds the start points of the intervals by finding all start points that don't lie inside an earlier interval.
SELECT s1.well_id,
s1.Top1
FROM TEMP1 s1
WHERE NOT EXISTS(SELECT * FROM TEMP1 s2
WHERE s2.well_id=s1.well_id
AND s1.Top1 > s2.Top1 AND s1.Top1 <= s2.BASE1)
This is the same idea but for end points instead.
SELECT t1.well_id,
t1.BASE1
FROM TEMP1 t1
WHERE NOT EXISTS(SELECT * FROM TEMP1 t2
WHERE t2.well_id=t1.well_id
AND t1.BASE1 >= t2.Top1 AND t1.BASE1 < t2.BASE1)
Finally, the start and end points are paired up by joining on well_id and finding the earliest end point for a start point, hence the GROUP BY and MIN. You could equally do this pairing up using ROW_NUMBER.
The "FOR XML" bit is a technique to concatenate rows, you'll find plenty of other examples of it in posts on SSC.
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537September 17, 2012 at 3:56 pm
Hi,
Many thanks for the break down. Sorry to bother you I am just keen to get my head round this.
Taken another look at it this evening. I made a change to is to ensure that the overlapping numbers must be within the range as opposed to <= or >=
SELECT s1.well_id,
s1.Top1,
MIN(t1.BASE1) AS BASE1,
(SELECT distinct c.comment AS "text()"
FROM TEMP1 c
WHERE c.well_id = s1.well_id
AND c.Top1 BETWEEN s1.Top1 AND MIN(t1.BASE1)
AND c.BASE1 BETWEEN s1.Top1 AND MIN(t1.BASE1)
ORDER BY c.comment
FOR XML PATH(''),TYPE).value('.','VARCHAR(400)') AS comments
FROM TEMP1 s1
INNER JOIN TEMP1 t1 ON t1.well_id=s1.well_id
AND s1.Top1 < t1.BASE1
AND NOT EXISTS(SELECT * FROM TEMP1 t2
WHERE t2.well_id=t1.well_id
AND t1.BASE1 > t2.Top1 AND t1.BASE1 < t2.BASE1)
WHERE NOT EXISTS(SELECT * FROM TEMP1 s2
WHERE s2.well_id=s1.well_id
AND s1.Top1 > s2.Top1 AND s1.Top1 < s2.BASE1)
GROUP BY s1.well_id,s1.Top1;
I always find the best way I can understand it is to explain it, so here goes:
SELECT s1.well_id,
s1.Top1
FROM TEMP1 s1
WHERE NOT EXISTS(SELECT * FROM TEMP1 s2
WHERE s2.well_id=s1.well_id
AND s1.Top1 > s2.Top1 AND s1.Top1 < s2.BASE1)
SELECT t1.well_id,
t1.BASE1
FROM TEMP1 t1
WHERE NOT EXISTS(SELECT * FROM TEMP1 t2
WHERE t2.well_id=t1.well_id
AND t1.BASE1 > t2.Top1 AND t1.BASE1 < t2.BASE1)
These two sub queries build up list of ranges where top>base (for Top values) and base < top (For Base Values)
Then these are joined by well_ID and min base to each top in the list - to be honest I am still not clear about how this part works any why you only need to choose min base and not max top. If you would be kind enough to show how it looks with row number that would be really helpful.
Then regarding the XML essentially this runs the same queries again but the XML is concatenating the comments field using the text() node in XML Path.
Really appreciate your help, sorry I am a slow learner.
Oliver
September 17, 2012 at 8:06 pm
My first thought, like ChrisM@Work was to use a rCTE for this, showing as always that great minds think alike! 😛
So, to be different, we'll propose a QU approach, which I have found works interestingly well in many cases for gaps and islands (well, islands anyway).
First the setup data in a temp table, with two additional helper columns added.
CREATE TABLE #TEMP1(
[well_id] [int] NOT NULL,
[Top1] [real] NOT NULL,
[BASE1] [real] not NULL,
[comment] [nvarchar] (50) not null,
[cmin] [real] NULL,
[cmax] [real] NULL,
PRIMARY KEY CLUSTERED ([well_id], [Top1], [BASE1])
) ON [PRIMARY]
GO
INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (5443, 2665, 2710, 'A')
INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 10, 20, 'A')
INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 15, 25, 'b')
INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 0, 0, 'A')
INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 34.66, 44.45, 'A')
INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 44.25, 54.25, 'Ac')
INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 54.25, 67.31, 'A')
INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 93.44, 106.5, 'A')
INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 1833, 2100, 'A')
INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 2100, 2305, 'A')
INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 2305, 2855, 'A')
Now my proposed solution:
UPDATE #TEMP1
SET @cmin = CASE
WHEN @cwell <> [well_id] THEN [Top1]
WHEN [Top1] > @cmax AND [Top1] > @cmin THEN [Top1]
WHEN [Top1] > @cmin THEN @cmin
ELSE NULL END
,@cmax = CASE
WHEN @cwell <> [well_id] THEN [BASE1]
WHEN [BASE1] > @cmax AND [Top1] < @cmin THEN @cmax
ELSE [BASE1] END
,@cwell = [well_id]
,cmin = @cmin
,cmax = @cmax
SELECT well_id, [Top1]=cmin, [BASE1]=MAX(cmax)
FROM #TEMP1
GROUP BY well_id, cmin
ORDER BY well_id, cmin
SELECT * FROM #TEMP1 ORDER BY well_id, [Top1]
DROP TABLE #TEMP1
Note that it may require a bit of tweaking that would only be uncovered by additional test data. I'd bet it's pretty fast too.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply