March 27, 2008 at 12:13 pm
I have a bit of an obscure problem.
I have a table that looks like this;
Filename range_start range_End
file1 A B
file1 B C
file1 C D
file1 E F
file1 F G
file2 A B
file2 B C
file2 D F
file2 E F
file2 F G
...
I want to use a query to Aggregate these records to look like this;
file1 A D
file1 E G
file2 A C
file2 D G
Can anyone help me? I've been hitting my head against a brick wall for days now.
March 27, 2008 at 12:31 pm
Try this
SELECT s1.Filename,
s1.range_start,
MIN(t1.range_End) AS range_End
FROM MyTable s1
INNER JOIN MyTable t1 ON s1.Filename=t1.Filename
AND s1.range_start <= t1.range_End
AND NOT EXISTS(SELECT * FROM MyTable t2
WHERE t1.Filename=t2.Filename
AND t1.range_End >= t2.range_start
AND t1.range_End < t2.range_End)
WHERE NOT EXISTS(SELECT * FROM MyTable s2
WHERE s1.Filename=s2.Filename
AND s1.range_start > s2.range_start
AND s1.range_start <= s2.range_End)
GROUP BY s1.Filename,s1.range_start
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537March 27, 2008 at 12:57 pm
Is it wrong to use code you don't understant? 😉
It'll take me a little while to figure out exactly what this does but it does indeed work.
Thanks so much!
April 9, 2008 at 10:17 am
Hi. I'm presently using this solution to my problem and it does indeed work. The only drawback is that it takes a long time.
This code bit is the last step in a chain of 16 queries. The preceeding 15 take a total time of < 1 min. This single script segment takes more than 3 minutes.
I'm working with an input dataset of almost 900,000 rows!
Is it possible that there is a way to achieve the same solution which is quicker?
April 9, 2008 at 11:02 am
If you don't mind using a temp table - you can cut that to 15 seconds.
[font="Courier New"]--Set up Test scenario
DROP TABLE files
GO
CREATE TABLE files([file_name] VARCHAR(10), range_start VARCHAR(10), Range_end VARCHAR(10))
INSERT files([file_name],range_start)
SELECT TOP 1000000
'file'+CAST(CAST(RAND(checksum(NEWID()))*60000 AS INT) AS VARCHAR(10)),
CHAR(65+CAST(RAND(checksum(NEWID()))*25 AS INT))
FROM sys.all_columns sc1, sys.all_columns sc2
UPDATE files
SET range_end =CHAR(ASCII(range_start)+1)
GO
--start the actual work
--first some variables
DECLARE @g DATETIME --just to test how fast
SET @g=GETDATE();
DECLARE @groupnum INT
SET @groupnum=0;
DECLARE @currfile VARCHAR(10)
SET @currfile='';
DECLARE @currrange VARCHAR(10)
SET @currrange='';
--create the temp table
SELECT DISTINCT *,0 AS groupnum
INTO #tmpfiles
FROM files
CREATE UNIQUE CLUSTERED INDEX pk_files ON #tmpfiles(FILE_NAME,range_start)
UPDATE #tmpfiles
SET @groupnum=groupnum=CASE WHEN @currfile=FILE_NAME AND @currrange=range_start THEN @groupnum
WHEN NOT(@currfile=FILE_NAME) THEN 1
ELSE @groupnum+1 END,
@currfile=FILE_NAME,
@currrange=range_end
FROM #tmpfiles WITH (tablock, INDEX(pk_files))
--just to see how fast that was
SELECT DATEDIFF(ms,@g,GETDATE()) --9076ms
--show the results
SELECT FILE_NAME,groupnum, MIN(range_start) AS rstart, MAX(range_end) AS rend
FROM #tmpfiles
GROUP BY FILE_NAME,groupnum
ORDER BY FILE_NAME,groupnum
--just to see how fast that was up until now --14983ms
SELECT DATEDIFF(ms,@g,GETDATE())
SELECT COUNT(*) FROM #tmpfiles --729675 distinct values
[/font]
DROP TABLE #tmpfiles
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 9, 2008 at 3:15 pm
Thank you for your thoughts and effort on this.
Your answer seems oh so close and it certainly is fast! However, it doesn't quite work the way I need it to.
Here is a subset of the data that I'm trying to Agregate;
RecNoUWISeqPTypeOpenerTopOpenerBaseGroupNum
324File11Perforation836.7838.20
325File12Perforation839.7840.60
329File16Perforation841.8842.70
326File13Perforation841.9842.80
What I want is;
File1836.7838.21
File1839.7840.62
File1841.8842.73
File1841.9842.83
Which would aggregate to;
File1836.7838.2
File1839.7840.6
File1841.8842.8
What I get is;
File1836.7838.21
File1839.7840.62
File1841.8842.73
File1841.9842.84
Is there anyway to modify this query so I get what I want?
Thanks for your help on this.
April 9, 2008 at 7:54 pm
you went and changed the specs on me! Sorry to say - by having to do range checks - it gets a little uglier. You unfortunately lose a little performance in the process as well...we're up to 30 seconds.
[font="Courier New"]--Set up Test scenario
--DROP TABLE files
--GO
--CREATE TABLE files([file_name] VARCHAR(10), range_start decimal(10,1), Range_end decimal(10,1))
--
--INSERT files([file_name],range_start)
--SELECT TOP 1000000
-- 'file'+CAST(CAST(RAND(checksum(NEWID()))*60000 AS INT) AS VARCHAR(10)),
-- CAST(RAND(checksum(NEWID()))*200+500.0 AS decimal(10,1))
--FROM sys.all_columns sc1, sys.all_columns sc2
--UPDATE files
--SET range_end =range_start+RAND(checksum(NEWID()))*3
GO
--start the actual work
--first some variables
DECLARE @g DATETIME --just to test how fast
SET @g=GETDATE();
DECLARE @groupnum INT
SET @groupnum=0;
DECLARE @currfile VARCHAR(10)
SET @currfile='';
DECLARE @currrange_start VARCHAR(10)
SET @currrange_start=0;
DECLARE @currrange_end VARCHAR(10)
SET @currrange_end=0;
--create the temp table
SELECT DISTINCT *,0 AS groupnum
INTO #tmpfiles
FROM files
CREATE UNIQUE CLUSTERED INDEX pk_files ON #tmpfiles(FILE_NAME,range_start,range_end)
SELECT DATEDIFF(ms,@g,GETDATE())
UPDATE #tmpfiles
SET @groupnum=groupnum= CASE WHEN @currfile=FILE_NAME AND range_start BETWEEN @currrange_start AND @currrange_end THEN @groupnum
WHEN NOT(@currfile=FILE_NAME) THEN 1
ELSE @groupnum+1 END,
@currfile=FILE_NAME,
@currrange_start= CASE WHEN NOT(@currfile=FILE_NAME) THEN range_start ELSE @currrange_start END,
@currrange_end= CASE WHEN NOT(@currfile=FILE_NAME) OR @currrange_end<range_end THEN range_end
ELSE @currrange_end END
FROM #tmpfiles WITH (tablock, INDEX(pk_files))
SELECT DATEDIFF(ms,@g,GETDATE())
CREATE NONCLUSTERED INDEX ixtmpfiles ON #tmpfiles (FILE_NAME,groupnum) include (range_start,range_end)
SELECT DATEDIFF(ms,@g,GETDATE())
--show the results
SELECT FILE_NAME,groupnum, MIN(range_start) AS rstart, MAX(range_end) AS rend
FROM #tmpfiles
GROUP BY FILE_NAME,groupnum
ORDER BY FILE_NAME,groupnum
SELECT DATEDIFF(ms,@g,GETDATE())
SELECT COUNT(*) FROM #tmpfiles
DROP TABLE #tmpfiles
[/font]
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 10, 2008 at 8:45 am
Apparently the answer is in the details. Thanks again for spending time on this. Unfortunately, it's still not quite there.
The code seems to work correctly if I use a WHERE clause and specify a distinct file name. However, when I run it on the whole table most of the rows come back with a groupnum of 1 even if they should have a different group number which means that they then get aggregated when they shouldn't
Am I correct in thinking that the script is expecting to update the records in the order specified by the index - sort of like walking a cursor where you can retrieve records in a defined order? If that thinking is correct, it doesn't appear to be behaving that way.
The records I'm processing represent perforated intervals of a oil or gas well. Briefly (and leaving out many detail) - a well gets drilled and then "completed" (tubing is pushed down the well do seal off the sides of the well so that many bad things don't happen). Once the company determines what depth intervals of the well have the gas or oil they "perforate" those intervals (punch holes in the tubing) in order to produce the oil or gas.
Over time a well may have many of these perforation events. To complicate matters, a company can also seal off previously perforated intervals, re-perforate the same or overlapping intervals etc. It can get very complex with respect to opening and closing intervals over time.
The table that I'm starting with contains these event records sorted by date for each file (the "file" is actually the identifier for the well. I've translated that to a file name to protect confidential data).
My goal is to determine for each well (file) what intervals are actually presently open. We have certain wells which can have 60 or 70 events and actually don't have any presently open!
I've processed the opening and closing events to the point where I'm only left with the presently open intervals the only thing left to do - which you've been helping me with - is aggregating the overlapping & contiguous intervals for each well.
Interesting problem, no?
If it would help and you're interested, I could post a zip of a CSV the table you create as #tempfiles. It's about 6Mb zipped.
April 10, 2008 at 9:20 am
A few specifics:
- The group numbers reset when you go from one filename to another. so every filename should have a groupnum 1. That's just to make sure we can tell the query to group by filename AND groupnumb. My test data was definitely not aggregating multiple filenames together, because my understanding was that they're not supposed to be. grouping by BOTH if the key there. Now - if you'd prefer the groupnums to just keep incrementing forever - all you would have to do is make the following change
WHEN NOT(@currfile=FILE_NAME) THEN 1 --<-- replace the 1 with @groupnum+1
- You're right - the clustered index is key. Keep in mind that the clustered index is required , since it's the only way to consistently force the order into this. Even switching to a non-clustered would make the technique unreliable.
Once the order is there, then yes - it's a fancy, high-speed way to walk the dataset in order, in a running totals/aggregates fashion (kind of like a cursor would do, except that it's all one big operation, and not 900K individual update operations like a cursor, which is why this takes a fraction of the time a cursor would). We use the variables to be able to compare the current row's value to the previous row's, and make decisions based on that.
The technique I'm using is detailed here - with a rather lively follow-on discussion:
http://www.sqlservercentral.com/articles/Advanced+Querying/61716/[/url]
Take a look through the data, and see if you can pinpoint things that don't work for you in what I'm doing. We'd have to formalize what the actual business rules for what belongs together vs what should not, since there still seems to be some disconnect.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 10, 2008 at 10:27 am
I've got it!
The problem was in the order of the set statements.
Because @currrange_start and @currrange_end rely on the value of @currfile, setting @currfile to the value of the current record before using them in the tests to set @currrange_start and @currrange_end them was what was causing the problem.
Here's what it should look like;
SET @groupnum=groupnum= CASE WHEN @currfile=FILE_NAME AND range_start BETWEEN @currrange_start AND @currrange_end THEN @groupnum
WHEN NOT(@currfile=FILE_NAME) THEN 1
ELSE @groupnum+1 END,
@currrange_start= CASE WHEN NOT(@currfile=FILE_NAME) THEN range_start ELSE @currrange_start END,
@currrange_end= CASE WHEN NOT(@currfile=FILE_NAME) OR @currrange_end<range_end THEN range_end
ELSE @currrange_end END
@currfile=FILE_NAME,
This gives me exactly the same answer as the other code and completes in about 45 seconds!
This is awesome!
Thank you SO much for your help!
One more question if you don't mind... What is the purpose of the WITH (tablock, INDEX(pk_tempfiles)) statement? I know that it's a table hint but I don't understand what the performance impact/benefit is.
April 10, 2008 at 10:36 am
FYI...
The total time for my process now has gone from ~ 4:30 to ~ 1:00!
Now that's performance!
April 10, 2008 at 10:53 am
Michael (4/10/2008)
One more question if you don't mind... What is the purpose of the WITH (tablock, INDEX(pk_tempfiles)) statement? I know that it's a table hint but I don't understand what the performance impact/benefit is.
Sure - this is the "forcing the order" I mentioned earlier and that you were picking up on. The Table lock is icing on the cake (since that will happen anyway during a full table update), but the INDEX(PK_tempfiles) "hint" is telling/commanding the optimizer to use the order set up by the PK_TEMPFILES clustered index. It's how we make sure it doesn't get creative and use whatever order it likes....
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 10, 2008 at 10:54 am
Michael (4/10/2008)
FYI...The total time for my process now has gone from ~ 4:30 to ~ 1:00!
Now that's performance!
Glad it helped!
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply