What can be done in SAS must be possible in SQL
A friend of mine is a data professional in marketing data analysis with expertise in SAS and pretty good working knowledge of SQL. When he sent me questions about “how to do in SQL…”, I often sent back a query with either a sub query, or a JOIN, or a CTE, or with all of them.
My friend has been always happy to take my suggestions, until one day he sent me a code snippet and insisted that what can be done in SAS must be possible in SQL too.
Getting the MAX row (by date) in each ID
In the following example, each ID can have multiple rows. What he wants to do is to get the MAX row (by date) in each ID, and then get the status. He also wanted to get a row count for each ID.
The 3 rows highlighted in red should be the output.
id | date | status |
101 | 1/3/2013 | 1 |
101 | 4/6/2012 | 3 |
101 | 11/23/2014 | 2 |
108 | 6/16/2007 | 3 |
108 | 1/24/2003 | 1 |
109 | 5/21/2014 | 1 |
A simple aggregate query with a GROUP BY will not work
A simple aggregate query with a GROUP BY, such as the one below, is not going to work, because each aggregate function MAX are independent of each other.
SELECT EventID , MAX(EventDate) AS max_EventDate , MAX(EventStatus) AS max_EventStatus , COUNT(*) AS row_count FROM #test_max_row GROUP BY EventID
SAS has solved the classic MAX VALUE vs. MAX ROW problem
This is a classic MAX VALUE vs. MAX ROW problem in SQL.
It turned out that SAS has solved this classic problem long time ago, with no sub query whatsoever.
Here is the code in SAS my friend sent to me.
data test;
input id date mmddyy10. status;
cards;
101 01/03/2013 1
101 04/06/2012 3
101 11/23/2014 2
108 06/16/2007 3
108 01/24/2003 1
109 05/21/2014 1
;
run;
* In SAS;
proc sql;
create table results as
select id, date format=mmddyy10., status, count(id) as cnt_id
from test
group by id
having date=max(date);
quit;
The trick in the above proc SQL with SAS-flavor is in the in the HAVING clause, date=max(date).
In standard SQL, the HAVING clause cannot have any column by itself and any columns in the HAVING clause must be companied by an aggregate function. In another word, the following is illegal in SQL:
having date=max(date)
I have programmed in SAS before for 2 years and have some basic understanding of proc SQL in SAS. So it’s not a surprise to me that SAS has its own implementation of SQL.
It’s easy in SQL too but…
To find the max row in SQL is not hard either, but most would require a JOIN or sub query. Most popular ways are:
- Use Correlated query
- Find the MAX(date) with GROUP BY, then JOIN to the original table
- Use a sub query that uses ROW_NUMBER() OVER with PARTITION BY and ORDER BY.
ROW_NUMBER() OVER can be used to find the MAX row, and also to de-duplicate data
My favorite is the last one that uses the ROW_NUMBER() OVER. It’s very flexible, and can be used to not only find the MAX row, but also to de-duplicate data.
But none of the above is a one-liner query, meaning with single SELECT.
A one-liner query is possible in SQL
I agree with my friend that a one-liner query is possible in SQL. In the end, this is what I came out with.
Here is my first try.
I used the OVER() clause three times. This OVER() with PARTITION BY and ORDER BY sorted the rows by date in descending order in each ID. The I used the FIRST_VALUE() function to get the MAX date row. The first one gets the first value of the date, and the second one gets the first value of the status.
The last OVER() uses only a PARTITION by not the ORDER BY because the count() aggregate function does not care about any particular ordering.
The result has the same number of rows as the original table with each ID being repeated. By adding a DISTINCT word, the result is reduced to one ID per row.
Here is the final query:
Here is the query in text.
SELECT DISTINCT EventID , first_value(EventDate) OVER(PARTITION BY EventID ORDER BY EventDate DESC) AS max_EventDate , first_value(EventStatus) OVER(PARTITION BY EventID ORDER BY EventDate DESC) AS max_row_EventStatus , count(EventID) OVER(PARTITION BY EventID) AS row_count_for_the_EventID FROM #test_max_row
Cannot recommend this
I didn’t have time to experiment with a large table to test the execution plan and query time against the other more conventional ways mentioned above, so I cannot recommend this yet. But I am not very optimistic about this query by just looking at it. The PARTITION BY was repeated three times. If you have more columns in your output, the PARTITION BY will be repeated even more times. SQL Server might have done some optimization in this kind of scenario, but I cannot be sure without further experimenting.
Another special thing about this query is that it didn’t use GROUP BY. PARTITION BY is essentially the same as GROUP BY, but without reducing the results. That’s why I had to use the DISTINCT in the SELECT. DISTINCT itself is also a GROUP BY in disguise. This is another reason I am not very optimistic about this query.
It’s certainly fun to experiment!