Distribution of data, percentile curve.

  • Dear Reader,

    Fan: Of Percentile curves in Excel.

    This presents in Excel a value (or a count of values) and the percentile which values falls below that value.

    So a curve which starts at 0 and finishes at 100 percent and always goes up.

    On of the uses can be how the distribution of the number of different values is for a varchar field.

    This is represented below.

    But could also be about amounts.

    Used to do the row based calculations in SQL-server. And the running total in Excel.

    But then you have to add a few columns in Excel.

    Now I have build a 'template' to do all calculations in SQL-server.

    So this is less work, but I am disappointed in the performance of this query.

    Can this be improved ?

    Am I doing something wrong ?

    Typical number of rows: 1 000 000

    Typical number of different values: 1 000

    (In excel this results in hundreds or less rows).

    Or when a numeric field is used, the number of values can be greater or smaller.

    Ben

    --

    -- Calculate a percentiele column for a distribution of a value in a column CCCCCC1 (or in two or more columns).

    --

    -- For presentation the column Tel is the X axis.

    -- column percentile is the Y axis. (Scale from 0 to 100).

    declare @work1 varchar(8000)

    set @work1 = '

    declare @Row_Count bigint

    select @Row_Count = count(*) from TTTTTT

    ;

    With

    A as (select CCCCC1, CCCCC2, COUNT(*) as tel FROM patients where WWWWWW group by CCCCC1, CCCCC2 ),

    B as (select tel,COUNT(*) as freq, tel* COUNT(*) as area from A group by tel),

    C as (select ROW_NUMBER() OVER(PARTITION BY 1 ORDER BY tel) nr, * from B),

    D as (

    select *, tel as curve from C where c.nr = 1

    Union all

    select C.*, D.curve+c.area as curve from C join D on C.nr = D.nr+1

    )

    select *, 100.0*curve/@row_count as percentile from D order by tel OPTION (MAXRECURSION 0)

    '

    set @work1 = REPLACE(@work1,'TTTTTT', '<Tabel_name>')

    set @work1 = REPLACE(@work1,'CCCCC1', '<Column_name>')

    set @work1 = REPLACE(@work1,'WWWWWW', '1 = 1')

    set @work1 = REPLACE(@work1,', CCCCC2', '')

    PRINT @WORK1

    exec (@work1)

    A good example of a percentile curve is the Lorenz Curve,

    which shows in a single glance the distribution of wealth.

  • What do you mean by

    I am disappointed in the performance

    ?

    Would you please post some sample data (table def and sample data) together with your actual execution plan so we have something to work with?



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • well...I didn't know this was called a Lorenz Curve...and neither have I done this in SQL before.

    but saw this "simple" explanation http://courses.washington.edu/geog326/Lorenz.pdf in how to do it in Excel.

    so...thought I would share the following code to see ifs anywhere need what you are looking for

    /*some thoughts on how to get data for Lorenz Curve

    intial ideas are based on http://courses.washington.edu/geog326/Lorenz.pdf*/

    /*set up some test data*/

    USE tempdb

    GO

    IF OBJECT_ID('tempdb..TransData', 'U') IS NOT NULL

    DROP TABLE tempdb..TransData ;

    IF OBJECT_ID('tempdb..tempcalc', 'U') IS NOT NULL

    DROP TABLE tempdb..tempcalc ;

    SELECT TOP 1000000

    RegionID = 1 + CAST(Abs(Checksum(Newid()) % 1000 ) AS INT),

    PopCount = 1 + CAST(Abs(Checksum(Newid()) % 90) AS INT),

    Amount = 1 + CAST(Rand(Checksum(Newid())) * 1000 AS INT)

    INTO TransData

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    CROSS JOIN sys.all_columns ac3

    /*skew the data a litle to emphasise the results*/

    UPDATE TransData

    SET PopCount = PopCount * 1.2, Amount = Amount * 0.3

    WHERE (RegionID % 3 = 0)

    UPDATE TransData

    SET PopCount = PopCount * 0.3, Amount = Amount * 2.5

    WHERE (RegionID % 23 = 0)

    /*solution follows.....there are other methods for running totals but here we use the quirky update method....

    http://www.sqlservercentral.com/articles/T-SQL/68467/ by Jeff Moden

    be sure you read and fully understand ALL the requirements to ensure data integrity*/

    /*first off we will pre calculate the different requirements into "tempcalc"

    this could be a cte/temp table...but the new table is created here for further analysis and testing*/

    DECLARE @pop AS INT

    DECLARE @amt AS INT

    SET @pop = ( SELECT SUM(PopCount) FROM TransData )

    SET @amt = ( SELECT SUM(Amount) FROM TransData )

    SELECT RegionID,

    SUM(PopCount) popcnt,

    SUM(PopCount) * 1.00 / @pop as poppct,

    SUM(Amount) AS amt ,

    SUM(Amount) * 1.00 / @amt as amtpct,

    (SUM(Amount) * 1.00 / @amt) / (SUM(PopCount) * 1.00 / @pop) as calc,

    row_number() over (order by (SUM(Amount) * 1.00 / @amt) / (SUM(PopCount) * 1.00 / @pop)) rn,

    cast(0 as numeric (25,13)) as popRT,

    cast(0 as numeric (25,13)) as amtRT

    INTO tempcalc

    FROM TransData

    GROUP BY RegionID

    /*the CIX below is an absolute requirement when using quirky update*/

    CREATE UNIQUE CLUSTERED INDEX [CIX_rn] ON [dbo].[tempcalc]

    ([rn] ASC )

    GO

    DECLARE @popRT NUMERIC(25, 13)

    DECLARE @amtRT NUMERIC(25, 13)

    DECLARE @SafetyCounter INT

    SET @popRT = 0

    SET @amtRT = 0

    SET @SafetyCounter = 1

    ;WITH

    cteWork AS

    (

    SELECT SafetyCounter = ROW_NUMBER() OVER (order by rn),

    [poppct],[amtpct],[popRT],[amtRT]

    FROM dbo.tempcalc

    )

    UPDATE cteWork

    SET @popRT = popRT = CASE

    WHEN SafetyCounter = @SafetyCounter

    THEN @popRT + poppct

    ELSE CAST('Safety Counter went out of sync at ' + CAST(@SafetyCounter AS VARCHAR(10)) AS INT)

    END,

    @amtRT = amtRT = CASE

    WHEN SafetyCounter = @SafetyCounter

    THEN @amtRT+amtpct

    ELSE CAST('Safety Counter went out of sync at ' + CAST(@SafetyCounter AS VARCHAR(10)) AS INT)

    END,

    @SafetyCounter = @SafetyCounter + 1

    FROM cteWork WITH (TABLOCKX)

    OPTION (MAXDOP 1)

    GO

    /*results here*/

    select null as RegionID, 0 as popRT , 0 as amtRT, 0 as rn

    union all

    SELECT RegionID, popRT, amtRT, rn

    FROM tempcalc

    order by rn

    drop table tempcalc

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Hello all,

    Thanks for your contributions and sorry for not responding.

    This weekend I had a fall and have a head injuri. In a few days time everything should be back to normal.

    I'll pick up this thread then. Sorry that I am not able to participate at this time.

    Ben

  • ben.brugman (4/28/2014)


    Hello all,

    Thanks for your contributions and sorry for not responding.

    This weekend I had a fall and have a head injuri. In a few days time everything should be back to normal.

    I'll pick up this thread then. Sorry that I am not able to participate at this time.

    Ben

    Sorry to hear that Ben...trust you recover quickly,that is far more important than this thread !

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply