Finding 5 most recent records for each customer with abnormal order amounts

  • The database consists of the following tables:

    create table dbo.customer (

    customer_id int identity primary key clustered,

    customer_name nvarchar(256) not null

    )

    create table dbo.purchase_order (

    purchase_order_id int identity primary key clustered

    customer_id int not null,

    amount money not null,

    order_date date not null

    )

    Implement a query for the report that will provide the following information: for each customer output at most 5 different dates which contain abnormally high or low amounts (bigger or less than 3 times SDTDEV from AVG), for each of these dates output minimum and maximum amounts as well.

    Possible result: https://www.sqlservercentral.com/Forums/Uploads/image-unavailable.png

  • I don't have sample data to test, but this might help you.

    WITH AvailableDates AS(

    SELECT customer_id,

    order_date,

    STDEV(amount) stdev_amount,

    AVG(amount) avg_amount,

    MIN(amount) min_amount,

    MAX(amount) max_amount

    FROM purchase_order

    GROUP BY customer_id,

    order_date

    HAVING ABS( STDEV(amount)) * 3 > AVG(amount)

    ),

    RowNums AS(

    SELECT customer_id,

    order_date,

    stdev_amount,

    avg_amount,

    min_amount,

    max_amount,

    ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY stdev_amount DESC) rn

    FROM AvailableDates

    )

    SELECT customer_name,

    order_date,

    min_amount,

    max_amount

    FROM RowNums r

    JOIN customer c ON r.customer_id = c.customer_id

    WHERE rn <= 5;

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • HAVING ABS( STDEV(amount)) * 3 > AVG(amount)

    This clause aint correct. We are looking for amount Not Between stdev_amt - 3 * avg_amt And stdev_amt + 3 * avg_amt

  • Do you need more help with this?

    If so, could you post sample data and expected results?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • FWIW, sure potentially sounds to me like homework or the equivalent.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • jbionic2010 (11/5/2014)


    HAVING ABS( STDEV(amount)) * 3 > AVG(amount)

    This clause aint correct. We are looking for amount Not Between stdev_amt - 3 * avg_amt And stdev_amt + 3 * avg_amt

    Nor is the above post correct. It should be:

    HAVING amount NOT BETWEEN AVG(amount) - (STDEV(amount) * 3) AND AVG(amount) + (STDEV(amount) * 3)

    Keep in mind that my having clause is somewhat "pseudocode", because you couldn't quite get the result you seek from a single query. You'd have to pre-determine the average value and the STDEV value, and then measure each detail record against those values.

    Also, one finds "abnormal" conditions when an amount is greater than 3 standard deviations from the mean. It's a fairly common statistical measure of "normalcy", as far as that goes. In a large group of data with a large standard deviation, even that might not be "normal". One must be VERY careful with statistics, as the meaning of a word like "normal" can so easily be made to "fit the situation" instead of letting the data tell no tales because there aren't actually any tales to tell. One measure of confidence in one's data can come from another statistical measure called statistical significance. I'd have to go look up how that's computed, but the idea behind it is that you don't come to conclusions based on extremely small sample sizes, or where a very small number of samples are taken from a large dataset. Thus, what average and/or STDEV is one going to use? One derived on a per customer basis, or one derived across the entire customer base? I'd be more likely to seek the latter, and again, due to the potential for some customers to just not have enough orders to make their data statistically significant. I might also seek to apply different standards to different classes of customers, assuming that there is a sales history demonstrating that customers come in sufficiently different sizes, in terms of overall number of orders.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • The following query calculates the SD and MEAN by customer (it doesn't include order_date).

    select PO.customer_id

    ,PO.order_date

    ,row_number() over(partition by order_date order by order_date desc) as DATE_LINE

    ,PO.amount

    ,row_number() over(partition by order_date order by amount desc) as BIG_AMT_LINE

    ,row_number() over(partition by order_date order by amount) as SMALL_AMT_LINE

    from dbo.purchase_order PO

    inner join (select customer_id

    ,stdev(amount) as SDX

    ,avg(amount) as MEANX

    ,(MEANX - (3 * SDX)) as MIN_AMT

    ,(MEANX + (3 * SDX)) as MAX_AMT

    from dbo.purchase_order

    group by customer_id

    ) RANGE on PO.customer_id = RANGE.customer_id and (PO.amount < RANGE.min_amt or PO.amount > RANGE.max_amt)

    qualify (big_amt_line = 1 or small_amt_line = 1)

    and date_line <= 5

    If you need to calculate SD and MEAN by customer_id and order_date, use the following query

    select PO.customer_id

    ,PO.order_date

    ,row_number() over(partition by order_date order by order_date desc) as DATE_LINE

    ,PO.amount

    ,row_number() over(partition by order_date order by amount desc) as BIG_AMT_LINE

    ,row_number() over(partition by order_date order by amount) as SMALL_AMT_LINE

    from dbo.purchase_order PO

    inner join (select customer_id

    ,order_date

    ,stdev(amount) as SD

    ,avg(amount) as MEAN

    ,(MEAN - (3 * SD)) as MIN_AMT

    ,(MEAN + (3 * SD)) as MAX_AMT

    from dbo.purchase_order

    group by customer_id

    ,order_date

    ) RANGE on PO.customer_id = RANGE.customer_id and PO.order_date = RANGE.order_date and (PO.amount < RANGE.min_amt or PO.amount > RANGE.max_amt)

    qualify (big_amt_line = 1 or small_amt_line = 1)

    and date_line <= 5

    Hope this helps.

  • VSP (11/11/2014)


    The following query calculates the SD and MEAN by customer (it doesn't include order_date).

    select PO.customer_id

    ,PO.order_date

    ,row_number() over(partition by order_date order by order_date desc) as DATE_LINE

    ,PO.amount

    ,row_number() over(partition by order_date order by amount desc) as BIG_AMT_LINE

    ,row_number() over(partition by order_date order by amount) as SMALL_AMT_LINE

    from dbo.purchase_order PO

    inner join (select customer_id

    ,stdev(amount) as SDX

    ,avg(amount) as MEANX

    ,(MEANX - (3 * SDX)) as MIN_AMT

    ,(MEANX + (3 * SDX)) as MAX_AMT

    from dbo.purchase_order

    group by customer_id

    ) RANGE on PO.customer_id = RANGE.customer_id and (PO.amount < RANGE.min_amt or PO.amount > RANGE.max_amt)

    qualify (big_amt_line = 1 or small_amt_line = 1)

    and date_line <= 5

    If you need to calculate SD and MEAN by customer_id and order_date, use the following query

    select PO.customer_id

    ,PO.order_date

    ,row_number() over(partition by order_date order by order_date desc) as DATE_LINE

    ,PO.amount

    ,row_number() over(partition by order_date order by amount desc) as BIG_AMT_LINE

    ,row_number() over(partition by order_date order by amount) as SMALL_AMT_LINE

    from dbo.purchase_order PO

    inner join (select customer_id

    ,order_date

    ,stdev(amount) as SD

    ,avg(amount) as MEAN

    ,(MEAN - (3 * SD)) as MIN_AMT

    ,(MEAN + (3 * SD)) as MAX_AMT

    from dbo.purchase_order

    group by customer_id

    ,order_date

    ) RANGE on PO.customer_id = RANGE.customer_id and PO.order_date = RANGE.order_date and (PO.amount < RANGE.min_amt or PO.amount > RANGE.max_amt)

    qualify (big_amt_line = 1 or small_amt_line = 1)

    and date_line <= 5

    Hope this helps.

    VSP,

    The latter query would probably not make much sense to use, because on any one order date, the most likely scenario is that the mean value is the only value, and probably 99.9% of the time. Also, the MIN and MAX amounts may actually be meant to be the MIN and MAX values for the order amounts, as opposed to the computed values for 3 STDEV's above and below the mean. We have yet to hear back from the original poster on any of this.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply