August 3, 2012 at 12:15 pm
I have read through many of the posts on the grouping, but cannot find anything that fits what I need. I have a solution that involves staging tables, % calculations and an aweful lot of work. I wondered if any of you SQL wizards can help me reduce the amount of work required as I need to re-do some reports with a slight change. Here goes.
DECLARE @Temp table (price real, vol int, country varchar(3))
insert into @Temp select 2.00,1,'GB'
insert into @Temp select 2.00,200,'GB'
insert into @Temp select 2.00,270,'GB'
insert into @Temp select 2.00,280,'GB'
insert into @Temp select 2.00,300,'GB'
insert into @Temp select 2.00,800,'GB'
insert into @Temp select 2.00,900,'GB'
insert into @Temp select 2.00,3000,'GB'
insert into @Temp select 4.00,1,'GB'
insert into @Temp select 4.00,1,'GB'
insert into @Temp select 4.00,270,'GB'
insert into @Temp select 4.00,280,'GB'
insert into @Temp select 4.00,300,'GB'
insert into @Temp select 4.00,800,'GB'
insert into @Temp select 4.00,8800,'GB'
insert into @Temp select 4.00,6000,'GB'
insert into @Temp select 9.00,1,'GB'
insert into @Temp select 9.00,1,'GB'
insert into @Temp select 9.00,290,'GB'
insert into @Temp select 9.00,310,'GB'
insert into @Temp select 9.00,350,'GB'
insert into @Temp select 9.00,800,'GB'
insert into @Temp select 9.00,8800,'GB'
insert into @Temp select 9.00,6000,'GB'
This doesn't represent the real data which is approx 1,000,000 rows.
I want to sum the vol (volume) grouped by price and country. BUT I want to exclude the top nn% and bottom nn% from the sum (this is to remove outliers) of each group.
Is this possible without several staging tables? All help greatly appreciated.
TIA
August 3, 2012 at 1:20 pm
dji (8/3/2012)
I want to sum the vol (volume) grouped by price and country. BUT I want to exclude the top nn% and bottom nn% from the sum (this is to remove outliers) of each group.
It could be, if you could elaborate more on the above statment!
1. Can you take one specific example form the sample data and show us your expected output?
2. Where does nn% come from ? who provides it?
3. It would be great if you could give the expected results based off of the sample data you provided.
August 3, 2012 at 1:35 pm
ColdCoffee (8/3/2012)
dji (8/3/2012)
I want to sum the vol (volume) grouped by price and country. BUT I want to exclude the top nn% and bottom nn% from the sum (this is to remove outliers) of each group.It could be, if you could elaborate more on the above statment!
1. Can you take one specific example form the sample data and show us your expected output?
2. Where does nn% come from ? who provides it?
3. It would be great if you could give the expected results based off of the sample data you provided.
I didn't specify originally that the top and bottom rows I am trying to exclude are the highest xx and lowest xx vol column.
Using the data above, assuming the top 1 and bottom 1 row represent the nn% of rows (this is 5% in my true life example). I would expect;
2.00 GB 2750
4.00 GB 7651
9.00 GB 7751
So basically I have the sum(vol) per group by price excluding the top 5% and bottom 5% of rows ordered by vol.
August 3, 2012 at 3:23 pm
Here is a general approach, which I'm sure you can improve.
-- Declare variables to set exclusion ranges
Declare @LowerOutlier decimal(4,2)
Declare @UpperOutlier decimal(4,2)
Set @LowerOutlier=0.05 -- adjust as needed
Set @UpperOutlier=0.95 -- adjust as needed
-- use common table expressions to pre-sort the data
-- "GroupedData" assigns row numbers to each group
-- "RowCounts" finds the numbers of rows for each group
;WITH GroupedData as (
Select Country, Price, Vol, Row_Number() Over(Partition by Country,Price order by Vol) as rn From @Temp
),
RowCounts as (
Select Country, Price, Max(rn) as 'MaxRowCount' From GroupedData
Group by Country, Price)
-- Now perform selections and calculations
-- Adjust data-types as needed to improve accuracy
Select t.Country, t.Price, SUM(t.Vol)
From @Temp t JOIN GroupedData g ON t.Country=g.Country and t.Price=g.Price and t.Vol=g.Vol
JOIN RowCounts rc ON g.Country=rc.Country and g.Price=rc.Price
WHERE
CAST(g.rn as decimal(4,2))/CAST(rc.MaxRowCount as decimal(4,2))< @UpperOutlier AND
CAST(g.rn as decimal(4,2))/CAST(rc.MaxRowCount as decimal(4,2))>@LowerOutlier
Group by t.Country, t.Price
HTH
Elliott
August 3, 2012 at 4:14 pm
dji (8/3/2012)
ColdCoffee (8/3/2012)
dji (8/3/2012)
I want to sum the vol (volume) grouped by price and country. BUT I want to exclude the top nn% and bottom nn% from the sum (this is to remove outliers) of each group.It could be, if you could elaborate more on the above statment!
1. Can you take one specific example form the sample data and show us your expected output?
2. Where does nn% come from ? who provides it?
3. It would be great if you could give the expected results based off of the sample data you provided.
I didn't specify originally that the top and bottom rows I am trying to exclude are the highest xx and lowest xx vol column.
Using the data above, assuming the top 1 and bottom 1 row represent the nn% of rows (this is 5% in my true life example). I would expect;
2.00 GB 2750
4.00 GB 7651
9.00 GB 7751
So basically I have the sum(vol) per group by price excluding the top 5% and bottom 5% of rows ordered by vol.
Okay, I must be missing something very basic, but I'm not sure what you mean by excluding the 5% and bottom 5% or rows ordered by volume. Maybe its because it is Friday.
August 4, 2012 at 2:32 am
Thanks Elliot, trying this out now and will report back 🙂
Lynn, consider the following list of rows (obviously this is far removed from the real data which is 1,000,000+ rows;
price, vol, country
2.0 1 GB
2.0 200 GB
2.0 300 GB
2.0 400 GB
2.0 50000 GB
6.0 2 GB
6.0 200 GB
6.0 300 GB
6.0 400 GB
6.0 69000 GB
I am wanting to sum the vol amount grouped by country and price. BUT I want to do the sum without the top 5% and bottom 5% of rows in each group of price, sorted in vol order. What I am effectively trying to do is use the inner 90% of data excluding the lowest and highest vol rows as they contain freak data that affects further calculations I want to do on the resultant data set.
August 4, 2012 at 6:02 am
try this....it should work where you have a variable number of rows for each group
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[temp]') AND type in (N'U'))
DROP TABLE [dbo].[temp]
GO
CREATE TABLE [dbo].[temp](
[price] [int] NULL,
[vol] [int] NULL,
[country] [varchar](3) NULL
) ON [PRIMARY]
GO
INSERT INTO [dbo].[temp]([price], [vol], [country])
SELECT 1, 1000, N'GB' UNION ALL
SELECT 1, 25, N'GB' UNION ALL
SELECT 1, 26, N'GB' UNION ALL
SELECT 1, 24, N'GB' UNION ALL
SELECT 1, 22, N'GB' UNION ALL
SELECT 1, 15, N'GB' UNION ALL
SELECT 1, 16, N'GB' UNION ALL
SELECT 1, 22, N'GB' UNION ALL
SELECT 1, 24, N'GB' UNION ALL
SELECT 1, 22, N'GB' UNION ALL
SELECT 1, 15, N'GB' UNION ALL
SELECT 1, 16, N'GB' UNION ALL
SELECT 1, 24, N'GB' UNION ALL
SELECT 1, 22, N'GB' UNION ALL
SELECT 1, 15, N'GB' UNION ALL
SELECT 1, 16, N'GB' UNION ALL
SELECT 1, 22, N'GB' UNION ALL
SELECT 1, 16, N'GB' UNION ALL
SELECT 1, 19, N'GB' UNION ALL
SELECT 1, 2, N'GB' UNION ALL
SELECT 2, 144, N'GB' UNION ALL
SELECT 2, 58, N'GB' UNION ALL
SELECT 2, 56, N'GB' UNION ALL
SELECT 2, 55, N'GB' UNION ALL
SELECT 2, 55, N'GB' UNION ALL
SELECT 2, 68, N'GB' UNION ALL
SELECT 2, 91, N'GB' UNION ALL
SELECT 2, 47, N'GB' UNION ALL
SELECT 2, 66, N'GB' UNION ALL
SELECT 2, 66, N'GB' UNION ALL
SELECT 2, 47, N'GB' UNION ALL
SELECT 2, 62, N'GB' UNION ALL
SELECT 2, 55, N'GB' UNION ALL
SELECT 2, 68, N'GB' UNION ALL
SELECT 2, 77, N'GB' UNION ALL
SELECT 2, 66, N'GB' UNION ALL
SELECT 2, 55, N'GB' UNION ALL
SELECT 2, 58, N'GB' UNION ALL
SELECT 2, 59, N'GB' UNION ALL
SELECT 2, 33, N'GB' UNION ALL
SELECT 3, 1000, N'GB' UNION ALL
SELECT 3, 144, N'GB' UNION ALL
SELECT 3, 91, N'GB' UNION ALL
SELECT 3, 77, N'GB' UNION ALL
SELECT 3, 68, N'GB' UNION ALL
SELECT 3, 68, N'GB' UNION ALL
SELECT 3, 66, N'GB' UNION ALL
SELECT 3, 66, N'GB' UNION ALL
SELECT 3, 66, N'GB' UNION ALL
SELECT 3, 62, N'GB' UNION ALL
SELECT 3, 59, N'GB' UNION ALL
SELECT 3, 58, N'GB' UNION ALL
SELECT 3, 58, N'GB' UNION ALL
SELECT 3, 56, N'GB' UNION ALL
SELECT 3, 55, N'GB' UNION ALL
SELECT 3, 55, N'GB' UNION ALL
SELECT 3, 55, N'GB' UNION ALL
SELECT 3, 55, N'GB' UNION ALL
SELECT 3, 47, N'GB' UNION ALL
SELECT 3, 47, N'GB' UNION ALL
SELECT 3, 33, N'GB' UNION ALL
SELECT 3, 26, N'GB' UNION ALL
SELECT 3, 25, N'GB' UNION ALL
SELECT 3, 24, N'GB' UNION ALL
SELECT 3, 24, N'GB' UNION ALL
SELECT 3, 24, N'GB' UNION ALL
SELECT 3, 22, N'GB' UNION ALL
SELECT 3, 22, N'GB' UNION ALL
SELECT 3, 22, N'GB' UNION ALL
SELECT 3, 22, N'GB' UNION ALL
SELECT 3, 22, N'GB' UNION ALL
SELECT 3, 19, N'GB' UNION ALL
SELECT 3, 16, N'GB' UNION ALL
SELECT 3, 16, N'GB' UNION ALL
SELECT 3, 16, N'GB' UNION ALL
SELECT 3, 16, N'GB' UNION ALL
SELECT 3, 15, N'GB' UNION ALL
SELECT 3, 15, N'GB' UNION ALL
SELECT 3, 5, N'GB' UNION ALL
SELECT 3, 2, N'GB'
--=== declare percentage of rows to exclude
declare @rows decimal(5,2)
set @rows = 0.05
;with cte_Rows as
(
SELECT @rows * COUNT(*) AS bottomrows, COUNT(*) - (@rows * COUNT(*) ) AS toprows, price, country
FROM dbo.temp
GROUP BY price, country
)
,
cte_RN as
(
SELECT price, country, vol, ROW_NUMBER() OVER (partition BY price, country
ORDER BY price, country, vol DESC) AS nt
FROM temp
)
SELECT cte_RN.price ,
cte_RN.country ,
SUM( cte_RN.vol )AS VolSum
FROM
cte_RN INNER JOIN cte_rows ON cte_RN.price = cte_rows.price
AND cte_RN.country = cte_rows.country
AND cte_RN.nt > cte_rows.bottomrows
AND cte_RN.nt <= cte_rows.toprows
GROUP BY cte_RN.price ,
cte_RN.country;
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
August 4, 2012 at 10:18 am
have been seeing how the code would run an a bigger data set.
pls feedback on whether this gives what I think you are looking for
regards
USE [tempdb]
GO
--== CREATE SOME DATA TO PLAY WITH
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[temp]') AND type in (N'U'))
DROP TABLE [dbo].[temp]
GO
SELECT TOP 1000000 --=== 1M rows
Price = ABS(CHECKSUM(NEWID())) % 9 + 1,
Vol = ABS(CHECKSUM(NEWID())) % 90000 + 1,
Country = CHAR(Abs(Checksum(Newid())) % 10 + 65)
into temp
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
CROSS JOIN sys.all_columns ac3
CREATE CLUSTERED INDEX [IX1] ON [dbo].[temp]
(
[Country] ASC,
[Price] ASC,
[Vol] DESC
) ON [PRIMARY]
GO
--=== cludge to skew the data for varying rowcounst
DELETE TOP (20) PERCENT
FROM temp
WHERE (Price IN ('2', '7', '8'))
DELETE TOP (30) PERCENT
FROM temp
WHERE (Country IN ('B', 'C', 'F'))
--=== RUN CODE TO COMPARE RESULTS...NB only using Country A in code below
--=== ALLROWS
SET STATISTICS TIME,IO ON
SELECT Country ,
Price ,
MAX( Vol )AS MaxVol ,
MIN( Vol )AS MinVol ,
SUM( Vol )AS VolSum ,
COUNT( * )AS CntRows
FROM temp
GROUP BY Country ,
Price
HAVING Country = 'A'
ORDER BY Country , Price;
SET STATISTICS TIME,IO ON
--=== SUM IGNORE TOP?BOTTOM PERCENT as set by @rows variable
--=== declare percentage of rows to exclude
declare @rows decimal(5,2)
set @rows = 0.05 --- ALTER THIS AS REQUIRED
SET STATISTICS TIME,IO ON
;with cte_Rows as
(
SELECT TOP 100 PERCENT @rows * COUNT( * )AS bottomrows ,
COUNT( * ) - @rows * COUNT( * )AS toprows ,
price ,
country
FROM dbo.temp
GROUP BY country ,
price
ORDER BY country , price
)
,
cte_RN as
(
SELECT country ,
price ,
vol ,
ROW_NUMBER( )OVER( PARTITION BY price ,
country
ORDER BY country , price , vol DESC )AS nt
FROM temp
)
SELECT cte_RN.Country ,
cte_RN.Price ,
MAX(cte_RN.Vol )AS MaxVol ,
MIN(cte_RN.Vol )AS MinVol ,
SUM(cte_RN.Vol )AS VolSum ,
COUNT( * )AS CntRows
FROM
cte_RN INNER JOIN cte_rows ON cte_RN.price = cte_rows.price
AND cte_RN.country = cte_rows.country
AND cte_RN.nt > cte_rows.bottomrows
AND cte_RN.nt <= cte_rows.toprows
WHERE cte_RN.country IN ( 'A' ) ---, 'B', 'C','D','E')
GROUP BY cte_RN.country, cte_RN.price
SET STATISTICS TIME,IO OFF
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply