Aggregating records representing ranges

Question

Post reply

Aggregating records representing ranges

CanuckBuck

Hall of Fame

Points: 3915
More actions
March 27, 2008 at 12:13 pm

#379436

I have a bit of an obscure problem.
I have a table that looks like this;
Filename range_start range_End
file1 A B
file1 B C
file1 C D
file1 E F
file1 F G
file2 A B
file2 B C
file2 D F
file2 E F
file2 F G
...
I want to use a query to Aggregate these records to look like this;
file1 A D
file1 E G
file2 A C
file2 D G
Can anyone help me? I've been hitting my head against a brick wall for days now.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Mark Cowne One Orange Chip Points: 26952 More actions · Answer 1

Try this

SELECT s1.Filename,

s1.range_start,

MIN(t1.range_End) AS range_End

FROM MyTable s1

INNER JOIN MyTable t1 ON s1.Filename=t1.Filename

AND s1.range_start <= t1.range_End

AND NOT EXISTS(SELECT * FROM MyTable t2

WHERE t1.Filename=t2.Filename

AND t1.range_End >= t2.range_start

AND t1.range_End < t2.range_End)

WHERE NOT EXISTS(SELECT * FROM MyTable s2

WHERE s1.Filename=s2.Filename

AND s1.range_start > s2.range_start

AND s1.range_start <= s2.range_End)

GROUP BY s1.Filename,s1.range_start

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 2

Is it wrong to use code you don't understant? 😉

It'll take me a little while to figure out exactly what this does but it does indeed work.

Thanks so much!

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 3

Hi. I'm presently using this solution to my problem and it does indeed work. The only drawback is that it takes a long time.

This code bit is the last step in a chain of 16 queries. The preceeding 15 take a total time of < 1 min. This single script segment takes more than 3 minutes.

I'm working with an input dataset of almost 900,000 rows!

Is it possible that there is a way to achieve the same solution which is quicker?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 4

If you don't mind using a temp table - you can cut that to 15 seconds.

[font="Courier New"]--Set up Test scenario

DROP TABLE files

GO

CREATE TABLE files([file_name] VARCHAR(10), range_start VARCHAR(10), Range_end VARCHAR(10))

INSERT files([file_name],range_start)

SELECT TOP 1000000

'file'+CAST(CAST(RAND(checksum(NEWID()))*60000 AS INT) AS VARCHAR(10)),

CHAR(65+CAST(RAND(checksum(NEWID()))*25 AS INT))

FROM sys.all_columns sc1, sys.all_columns sc2

UPDATE files

SET range_end =CHAR(ASCII(range_start)+1)

GO

--start the actual work

--first some variables

DECLARE @g DATETIME --just to test how fast

SET @g=GETDATE();

DECLARE @groupnum INT

SET @groupnum=0;

DECLARE @currfile VARCHAR(10)

SET @currfile='';

DECLARE @currrange VARCHAR(10)

SET @currrange='';

--create the temp table

SELECT DISTINCT *,0 AS groupnum

INTO #tmpfiles

FROM files

CREATE UNIQUE CLUSTERED INDEX pk_files ON #tmpfiles(FILE_NAME,range_start)

UPDATE #tmpfiles

SET @groupnum=groupnum=CASE WHEN @currfile=FILE_NAME AND @currrange=range_start THEN @groupnum

WHEN NOT(@currfile=FILE_NAME) THEN 1

ELSE @groupnum+1 END,

@currfile=FILE_NAME,

@currrange=range_end

FROM #tmpfiles WITH (tablock, INDEX(pk_files))

--just to see how fast that was

SELECT DATEDIFF(ms,@g,GETDATE()) --9076ms

--show the results

SELECT FILE_NAME,groupnum, MIN(range_start) AS rstart, MAX(range_end) AS rend

FROM #tmpfiles

GROUP BY FILE_NAME,groupnum

ORDER BY FILE_NAME,groupnum

--just to see how fast that was up until now --14983ms

SELECT DATEDIFF(ms,@g,GETDATE())

SELECT COUNT(*) FROM #tmpfiles --729675 distinct values

[/font]

DROP TABLE #tmpfiles

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 5

Thank you for your thoughts and effort on this.

Your answer seems oh so close and it certainly is fast! However, it doesn't quite work the way I need it to.

Here is a subset of the data that I'm trying to Agregate;

RecNoUWISeqPTypeOpenerTopOpenerBaseGroupNum

324File11Perforation836.7838.20

325File12Perforation839.7840.60

329File16Perforation841.8842.70

326File13Perforation841.9842.80

What I want is;

File1836.7838.21

File1839.7840.62

File1841.8842.73

File1841.9842.83

Which would aggregate to;

File1836.7838.2

File1839.7840.6

File1841.8842.8

What I get is;

File1836.7838.21

File1839.7840.62

File1841.8842.73

File1841.9842.84

Is there anyway to modify this query so I get what I want?

Thanks for your help on this.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 6

you went and changed the specs on me! Sorry to say - by having to do range checks - it gets a little uglier. You unfortunately lose a little performance in the process as well...we're up to 30 seconds.

[font="Courier New"]--Set up Test scenario

--DROP TABLE files

--GO

--CREATE TABLE files([file_name] VARCHAR(10), range_start decimal(10,1), Range_end decimal(10,1))

--

--INSERT files([file_name],range_start)

--SELECT TOP 1000000

-- 'file'+CAST(CAST(RAND(checksum(NEWID()))*60000 AS INT) AS VARCHAR(10)),

-- CAST(RAND(checksum(NEWID()))*200+500.0 AS decimal(10,1))

--FROM sys.all_columns sc1, sys.all_columns sc2

--UPDATE files

--SET range_end =range_start+RAND(checksum(NEWID()))*3

GO

--start the actual work

--first some variables

DECLARE @g DATETIME --just to test how fast

SET @g=GETDATE();

DECLARE @groupnum INT

SET @groupnum=0;

DECLARE @currfile VARCHAR(10)

SET @currfile='';

DECLARE @currrange_start VARCHAR(10)

SET @currrange_start=0;

DECLARE @currrange_end VARCHAR(10)

SET @currrange_end=0;

--create the temp table

SELECT DISTINCT *,0 AS groupnum

INTO #tmpfiles

FROM files

CREATE UNIQUE CLUSTERED INDEX pk_files ON #tmpfiles(FILE_NAME,range_start,range_end)

SELECT DATEDIFF(ms,@g,GETDATE())

UPDATE #tmpfiles

SET @groupnum=groupnum= CASE WHEN @currfile=FILE_NAME AND range_start BETWEEN @currrange_start AND @currrange_end THEN @groupnum

WHEN NOT(@currfile=FILE_NAME) THEN 1

ELSE @groupnum+1 END,

@currfile=FILE_NAME,

@currrange_start= CASE WHEN NOT(@currfile=FILE_NAME) THEN range_start ELSE @currrange_start END,

@currrange_end= CASE WHEN NOT(@currfile=FILE_NAME) OR @currrange_end<range_end THEN range_end

ELSE @currrange_end END

FROM #tmpfiles WITH (tablock, INDEX(pk_files))

SELECT DATEDIFF(ms,@g,GETDATE())

CREATE NONCLUSTERED INDEX ixtmpfiles ON #tmpfiles (FILE_NAME,groupnum) include (range_start,range_end)

SELECT DATEDIFF(ms,@g,GETDATE())

--show the results

SELECT FILE_NAME,groupnum, MIN(range_start) AS rstart, MAX(range_end) AS rend

FROM #tmpfiles

GROUP BY FILE_NAME,groupnum

ORDER BY FILE_NAME,groupnum

SELECT DATEDIFF(ms,@g,GETDATE())

SELECT COUNT(*) FROM #tmpfiles

DROP TABLE #tmpfiles

[/font]

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 7

Apparently the answer is in the details. Thanks again for spending time on this. Unfortunately, it's still not quite there.

The code seems to work correctly if I use a WHERE clause and specify a distinct file name. However, when I run it on the whole table most of the rows come back with a groupnum of 1 even if they should have a different group number which means that they then get aggregated when they shouldn't

Am I correct in thinking that the script is expecting to update the records in the order specified by the index - sort of like walking a cursor where you can retrieve records in a defined order? If that thinking is correct, it doesn't appear to be behaving that way.

Perhaps a little background would help.

The records I'm processing represent perforated intervals of a oil or gas well. Briefly (and leaving out many detail) - a well gets drilled and then "completed" (tubing is pushed down the well do seal off the sides of the well so that many bad things don't happen). Once the company determines what depth intervals of the well have the gas or oil they "perforate" those intervals (punch holes in the tubing) in order to produce the oil or gas.

Over time a well may have many of these perforation events. To complicate matters, a company can also seal off previously perforated intervals, re-perforate the same or overlapping intervals etc. It can get very complex with respect to opening and closing intervals over time.

The table that I'm starting with contains these event records sorted by date for each file (the "file" is actually the identifier for the well. I've translated that to a file name to protect confidential data).

My goal is to determine for each well (file) what intervals are actually presently open. We have certain wells which can have 60 or 70 events and actually don't have any presently open!

I've processed the opening and closing events to the point where I'm only left with the presently open intervals the only thing left to do - which you've been helping me with - is aggregating the overlapping & contiguous intervals for each well.

Interesting problem, no?

If it would help and you're interested, I could post a zip of a CSV the table you create as #tempfiles. It's about 6Mb zipped.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 8

A few specifics:

- The group numbers reset when you go from one filename to another. so every filename should have a groupnum 1. That's just to make sure we can tell the query to group by filename AND groupnumb. My test data was definitely not aggregating multiple filenames together, because my understanding was that they're not supposed to be. grouping by BOTH if the key there. Now - if you'd prefer the groupnums to just keep incrementing forever - all you would have to do is make the following change

WHEN NOT(@currfile=FILE_NAME) THEN 1 --<-- replace the 1 with @groupnum+1

- You're right - the clustered index is key. Keep in mind that the clustered index is required , since it's the only way to consistently force the order into this. Even switching to a non-clustered would make the technique unreliable.

Once the order is there, then yes - it's a fancy, high-speed way to walk the dataset in order, in a running totals/aggregates fashion (kind of like a cursor would do, except that it's all one big operation, and not 900K individual update operations like a cursor, which is why this takes a fraction of the time a cursor would). We use the variables to be able to compare the current row's value to the previous row's, and make decisions based on that.

The technique I'm using is detailed here - with a rather lively follow-on discussion:

http://www.sqlservercentral.com/articles/Advanced+Querying/61716/[/url]

Take a look through the data, and see if you can pinpoint things that don't work for you in what I'm doing. We'd have to formalize what the actual business rules for what belongs together vs what should not, since there still seems to be some disconnect.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 9

I've got it!

The problem was in the order of the set statements.

Because @currrange_start and @currrange_end rely on the value of @currfile, setting @currfile to the value of the current record before using them in the tests to set @currrange_start and @currrange_end them was what was causing the problem.

Here's what it should look like;

SET @groupnum=groupnum= CASE WHEN @currfile=FILE_NAME AND range_start BETWEEN @currrange_start AND @currrange_end THEN @groupnum

WHEN NOT(@currfile=FILE_NAME) THEN 1

ELSE @groupnum+1 END,

@currrange_start= CASE WHEN NOT(@currfile=FILE_NAME) THEN range_start ELSE @currrange_start END,

@currrange_end= CASE WHEN NOT(@currfile=FILE_NAME) OR @currrange_end<range_end THEN range_end

ELSE @currrange_end END

@currfile=FILE_NAME,

This gives me exactly the same answer as the other code and completes in about 45 seconds!

This is awesome!

Thank you SO much for your help!

One more question if you don't mind... What is the purpose of the WITH (tablock, INDEX(pk_tempfiles)) statement? I know that it's a table hint but I don't understand what the performance impact/benefit is.

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 10

FYI...

The total time for my process now has gone from ~ 4:30 to ~ 1:00!

Now that's performance!

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 11

Michael (4/10/2008)
One more question if you don't mind... What is the purpose of the WITH (tablock, INDEX(pk_tempfiles)) statement? I know that it's a table hint but I don't understand what the performance impact/benefit is.

Sure - this is the "forcing the order" I mentioned earlier and that you were picking up on. The Table lock is icing on the cake (since that will happen anyway during a full table update), but the INDEX(PK_tempfiles) "hint" is telling/commanding the optimizer to use the order set up by the PK_TEMPFILES clustered index. It's how we make sure it doesn't get creative and use whatever order it likes....

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 12

Michael (4/10/2008)
FYI...
The total time for my process now has gone from ~ 4:30 to ~ 1:00!
Now that's performance!

Glad it helped!

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?