Blog Post

Power BI – Hungry Median

,

Introduction

Median is a useful statistical function, which first time appeared in SSAS 2016 and in Power BI around that year as well. There are several articles on how to implement the median function in DAX from the time before the native DAX function was introduced. With one client we recently faced an issue when using the implicit median function in Power BI. Size of the dataset was roughly 30mio records. I would say nothing challenging for Power BI or DAX itself. However, the behavior of the median function was not convincing at all. Let’s look at the setup:

I created a median dataset based on free data from weather sensors in one city (a link to download at the end of the blog) which has similar data characteristics as our report with the original issue.

We have the following attributes: date, hour, location (just numeric ID of location which is fine for our test) and we are monitoring the temperature.

We have 35mio records -> 944 unique records for temperature, 422 unique locations, and 24 hours of course. ??

Now we make a simple report – we would like to see the median for temperature per hour despite date or location.

Measure:

MEASURE

Senzors[Orig_Med] =

MEDIAN

(

Senzors[temperature]

)

The following result took 71 seconds to complete on the dataset in PB desktop. And took almost 8GB of memory::

Memory profile during the DAX query:

If you try to publish this report to Power BI service, you will get the following message:

I was just WOW! But what can I tune on such a simple query and such a simple measure?

Tunning 1 – Rewrite Median?

I was a bit disappointed about the median function. When we used date for filtering, the performance of the query was ok. But when we used a larger dataset it was not performing at all.

I do know nothing about the inner implementation of the median function in DAX but based on memory consumption it seems like if there would be column materialization on the background and sorting when searching for the median.

Here’s a bit of theory about median and a bit of fact about columnar storage so we can discover how we can take advantage of the data combination/granularity we have in the model.

Below are two median samples for a couple of numbers – when the count of the numbers is Even and when is Odd. More median theory on Wikipedia.

The rules for calculating median are the same, even when numbers in the set are repeating (non-unique). Here are the steps of the potential algorithm:

  1. Sort existing values.
  2. Find the median position(s).
  3. Take a value or two and make average to get median.

Let’s look at this from the perspective of column store where we have just a couple of values with hundreds of repeats.

As we know the count is very fast for column store and that could be our advantage as we have a small number of unique values repeated many times.

Following is an example of data where we can visualize the way how we can take advantage of the fact described above.

TemperatureCountCumulative CountCumulative Count Start
125005001
135001000501
1850015001001
2050120011501
Total Count2001
Position of median Odd1001
Position of median Even1001

In this case, we just need to go through 4 values and find in which interval our position of median belongs.

In the worst-case scenario, we will hit between two values like on the following picture (we changed the last count from 501 to 500):

TemperatureCountCumulative CountCumulative Count Start
125005001
135001000501
1850015001001
2050020001501
Total Count2000
Position of median Odd1000
Position of median Even1001

How to implement this in DAX:

First helper measures are count and cumulative count for temperature:

MEASURE

Senzors[RowCount] =

COUNTROWS

( Senzors )

MEASURE

Senzors[TemperatureRowCountCumul] =

VAR _curentTemp = MAX ( ‘Senzors'[temperature] )

RETURN

CALCULATE (

COUNTROWS ( Senzors ),

Senzors[temperature] <= _curentTemp

)

Second and third measures give us a position of the median for given context:

MEASURE

Senzors[MedianPositionEven] =

ROUNDUP

(

(

COUNTROWS

( Senzors ) / 2

), 0

)

MEASURE

Senzors[MedianPositionOdd] =

VAR _cnt =

COUNTROWS

( Senzors )

RETURN

ROUNDUP

(

( _cnt / 2

), 0

)

— this is a trick where boolean is auto-casted to int (0 or 1)

+ ISEVEN

( _cnt )

The fourth measure – Calculated median – does what we described in the tables above. Iterate through temperature values and find rows that contain median positions and make average on that row(s).

MEASURE

Senzors[Calc_Med] =

— get two possible position of median

VAR _mpe = [MedianPositionEven]

VAR _mpeOdd = [MedianPositionOdd]

— Make Temperature table in current context with Positions where value starts and finishes

VAR _TempMedianTable =

ADDCOLUMNS

(

VALUES

(

Senzors[temperature]

),

“MMIN”,

[TemperatureRowCountCumul] – [RowCount] + 1,

“MMAX”, [TemperatureRowCountCumul]

)

— Filter table to keep only values which contains Median positions in it

VAR _T_MedianVals =

FILTER

(

_TempMedianTable,

(_mpe >= [MMIN]

&& _mpe <= [MMAX]

)

||

(_mpeOdd >= [MMIN]

&& _mpeOdd <= [MMAX]

)

)

— return average of filtered dataset (one or two rows)

RETURN

AVERAGEX

( _T_MedianVals, [temperature]

)

Maximum number of rows which goes to the final average is 2.

Let us see the performance of such measure:

Performance for Hour (24 values)Duration (s)Memory Consumed (GB)
Native median function718
Custom implementation6.30.2

Sounds reasonable and promising!

But not so fast – when the number of values by which we group the data grow, the duration grows as well.

Here are some statistics when removing hour (24 values) and bringing location (400+ values) into the table.

Performance for location (422 values)Duration (s)Memory Consumed (GB)
Native Median Function818
Custom Implementation1072.5

Look at the memory consumption profile of calculated median for location below:

That is not so good anymore!

Our custom implementation is a bit slower for location and despite the fact it is consuming a lot less memory, this will not work in Power BI service as well.

This means that we solved just a part of the puzzle – our implementation is working fine only when we have a small number of values that we are grouping by.

So, what are the remaining questions to make this report working in PBI service?

  • How to improve the overall duration of the query?
  • How to decrease memory consumption?

Tuning 2 – Reduce Memory Consumption

We start with the memory consumption part. First, we need to identify which part of the formula is eating so much memory.

Actually, it is the same one that has the most performance impact on the query.

It’s this formula for the cumulative count, which is evaluated for each row of location multiplied by each value of temperature:

MEASURE

Senzors[TemperatureRowCountCumul] =

VAR _curentTemp = MAX ( ‘Senzors'[temperature] )

RETURN

CALCULATE (

COUNTROWS ( Senzors ),

Senzors[temperature] <= _curentTemp

)

 

Is there a different way to get a cumulative count without using CALCULATE? Maybe a more transparent way for the PB engine? Yes, there is! We can remodel the temperature column and define the cumulative sorted approach as a many-to-many relationship towards the sensors.

Sample content of temperature tables would look like this:

I believe that the picture above is self-describing.

As a result of this model, when you use the temperature attribute from the TemperatureMapping table, you have:

– Cumulative behavior of RowCount.

– Relation calculated in advance.

For this new model version, we define measures as below:

RowCount measure we have already, but with temperature from Mapping table, it will give us CumulativeCount in fact.

MEASURE

Senzors[RowCount] =

COUNTROWS

( Senzors )

We must create a new measure which will give us a normal count for the mapping table to be able to calculate the starting position of the temperature value:

MEASURE

Senzors[TemperatureMappingRowCount] =

CALCULATE

(

[RowCount],

FILTER

(

TemperatureMapping,

TemperatureMapping[LowerTemperature] = TemperatureMapping[temperature]

)

)

New median definition:

MEASURE

Senzors[Calc_MedTempMap] =

VAR _mpe = [MedianPositionEven]

VAR _mpeOdd = [MedianPositionOdd]

VAR _TempMedianTable =

ADDCOLUMNS

(

VALUES

(

TemperatureMapping[temperature]

),

“MMIN”,

[RowCount] – [TemperatureMappingRowCount] + 1,

“MMAX”, [RowCount]

)

VAR _T_MedianVals =

FILTER

(

_TempMedianTable,

(_mpe >= [MMIN]

&& _mpe <= [MMAX]

)

||

(_mpeOdd >= [MMIN]

&& _mpeOdd <= [MMAX]

)

)

RETURN

AVERAGEX

( _T_MedianVals, [temperature]

)

Alright, let’s check the performance – the memory consumption is now just in MBs!

Performance Many2Many MedianDuration (s)Memory Consumed (GB)
Used with Hours2.20,02
Used with location41.10,08

I think we can be happy about it and the memory puzzle seems to be solved.

You can download a sample PBI file (I decreased data to only one month of the data, but you can download the whole dataset).

Below is the statistics summary for now:

Performance for Hour (24 values)Duration (s)Memory Consumed (GB)
Native median function71.008.00
Custom implementation6.300.20
Many2Many median2.200.02
Performance for Location (422 values)Duration (s)Memory Consumed (GB)
Native median function81.008.00
Custom implementation 1107.002.50
Many2Many median41.100.08

I’ll stop this blog here, as it is too long already. Next week, I’ll bring the second part in regards to how to improve performance, so the user has a better experience while using this report.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

5 (1)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (1)

You rated this post out of 5. Change rating