Question

What is the best idiomatic way to use .filter() in .agg()?

There are certain contexts in Polars that call on a column, not the entire dataframe, like agg, groupby, with_columns, select, and more. In these contexts filter sometimes needs to be used. What is the idiomatic way to use filter in this situation?

E.g.:

import polars as pl
from polars import col

confidence_level = 0.95
lookback_days = 200

percentile_lower = col.Excess_Return.quantile(quantile=1-confidence_level)
percentile_upper = col.Excess_Return.quantile(quantile=confidence_level)

rachev_ratio = excess_return.with_row_index().rolling('index', period=str(lookback_days)+'i').agg(
    Date = col.Date.last(),
    Price = col.Price.last(),
    PnL = col.PnL.last(),
    TBill_Return = col.TBill_Return.last(),
    Excess_Return = col.Excess_Return.last(),
    Percentile_Lower = percentile_lower,
    Percentile_Upper = percentile_upper,
    Lower_CVaR = col.Excess_Return.filter(col.Excess_Return <= percentile_lower).mean(),
    Upper_CVaR = col.Excess_Return.filter(col.Excess_Return >= percentile_upper).mean(),
.with_columns(
    Rachev_Ratio = (col.Lower_CVaR / col.Upper_CVaR).abs(),
).with_columns(
    Rachev_Ratio = pl.when(col.Rachev_Ratio > 3).then(3).otherwise(col.Rachev_Ratio)
)[lookback_days:].drop('index')

Specifically the lines:

Lower_CVaR = col.Excess_Return.filter(col.Excess_Return <= percentile_lower).mean(),
Upper_CVaR = col.Excess_Return.filter(col.Excess_Return >= percentile_upper).mean(),`

To use filter on col.Excess_Return it needs to be called twice, first so .filter can be called and second in the condition statement itself.

There doesn't seem to be pl.filter as a shorthand. df.filter doesn't work in this scenario, neither does pl.all.filter nor pl.all().filter or pl.col('*').filter.

Does anyone know the idiomatic way to use filter in .agg or in similar contexts? Should the column name be written twice in every if statement?

3 82 3

1 Jan 1970

Solution

Least hacky

You can use the walrus operator to keep from repeating the column directly.

Lower_CVaR = (x:=col.Excess_Return).filter(x<= percentile_lower).mean(),
Upper_CVaR = (y:=col.Excess_Return).filter(y>= percentile_upper).mean(),`

It's not ideal but it's not too hacky.

Really hacky

Here's something, let's make a Class where we'll monkey path an object into your polars namespace and call it "elem" as in pl.elem and we'll make it pl.col('terkcvzlkfgrt4598') where that name is just me mashing the keyboard. That class will have a filter method where you can use pl.elem as a placeholder and then it'll swap out the placeholder for whatever you put into the initializer of the class. Here it is:

from io import StringIO
class CF:
    def __init__(self, col_name):
        import polars as pl
        self.col=pl.col(col_name)

        if 'pl' in globals() and globals()['pl']==pl:
            global_pl='pl'
        else:
            global_keys=globals().keys()
            for k in global_keys:
                if k=="__": continue
                try:
                    if globals()[k]==pl:
                        global_pl=k
                        break
                except:
                    pass
        setattr(globals()[global_pl], 'elem', pl.col('terkcvzlkfgrt4598'))
    def filter(self, filt_expr):
        serialized = (
            filt_expr.meta.serialize()
            .replace(
                """{"Column":"terkcvzlkfgrt4598"}""", 
                self.col.meta.serialize()
                )
        )
        new_filt_expr=pl.Expr.deserialize(StringIO(serialized))
        return new_filt_expr

With that setup, you can then do

Lower_CVaR = CF('Excess_Return').filter(pl.elem<= percentile_lower).mean(),
Upper_CVaR = CF('Excess_Return').filter(pl.elem>= percentile_upper).mean(),

2024-07-11

Dean MacGregor

Solution

As mentioned in the comments, at the moment there's no shortcut for that, but there's an open issue to allow usage of pl.element() inside filter() method.

If your expression is particularly complicated, you could probably use inline anonymous function to shorten it.

In your cause it might look like

Lower_CVaR = (lambda x: x.filter(x <= percentile_lower).mean())(col.Excess_Return),
Upper_CVaR = (lambda x: x.filter(x >= percentile_upper).mean())(col.Excess_Return)

It's not particularly shorter, but at least col.Excess_Return is only mentioned once which means it's a bit less error-prone.

2024-07-11

Roman Pekar