Question
How to take the average of all previous entries in a group?
I'd like to do the following in python using the polars library:
Input:
df = pl.from_repr("""
┌──────┬────────┐
│ Name ┆ Number │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════╪════════╡
│ Mr.A ┆ 1 │
│ Mr.A ┆ 4 │
│ Mr.A ┆ 5 │
│ Mr.B ┆ 3 │
│ Mr.B ┆ 5 │
│ Mr.B ┆ 6 │
│ Mr.B ┆ 10 │
└──────┴────────┘
""")
Output:
shape: (7, 3)
┌──────┬────────┬──────────┐
│ Name ┆ Number ┆ average │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞══════╪════════╪══════════╡
│ Mr.A ┆ 1 ┆ 0.0 │
│ Mr.A ┆ 4 ┆ 1.0 │
│ Mr.A ┆ 5 ┆ 2.5 │
│ Mr.B ┆ 3 ┆ 0.0 │
│ Mr.B ┆ 5 ┆ 3.0 │
│ Mr.B ┆ 6 ┆ 4.0 │
│ Mr.B ┆ 10 ┆ 4.666667 │
└──────┴────────┴──────────┘
That is to say:
- For every first entry of a person, set the average to zero.
- For every subsequent entry, calculate the average based on the previous entries
Example:
Mr. A started off with average=0 and the Number=1.
Then, Mr. A has the Number=4, thus it took the average of the previous entry (1/1 data=1)
Then, Mr. A has the Number=5, thus the previous average was: (1+4) / (2 data) = 5/2 = 2.5
And so on
I've tried the rolling mean function (using a Polars Dataframe, df), however, I'm restricted by rolling_mean's window size (i.e. it calculates only the past 2 entries, plus it averages the current entry as well; I want to average only the previous entries)
Does anyone have an idea? Much appreciated!:
df.group_by("Name").agg(pl.col("Number").rolling_mean(window_size=2))
4 61
4