Question

Polars - groupby mean on list

I want to make mean on groups of embeddings vectors. For examples:

import polars as pl

pl.DataFrame({
    "id": [1,1 ,2,2],
    "values": [
        [1,1,1], [3, 3, 3],
        [1,1,1], [2, 2, 2]
    ]
})
shape: (4, 2)
id  values
i64 list[i64]
1   [1, 1, 1]
1   [3, 3, 3]
2   [1, 1, 1]
2   [2, 2, 2]

Expected result.

import numpy as np

pl.DataFrame({
    "id":[1,2],
    "values": np.array([
        [[1,1,1], [3, 3, 3]],
        [[1,1,1], [2, 2, 2]]
    ]).mean(axis=1)
})
shape: (2, 2)
id  values
i64 list[f64]
1   [2.0, 2.0, 2.0]
2   [1.5, 1.5, 1.5]
 3  53  3
1 Jan 1970

Solution

 1
(
    df
    .with_columns(i = pl.int_ranges(pl.col.values.list.len()))
    .explode('values', 'i')
    .group_by('id', 'i', maintain_order = True)
    .mean()
    .group_by('id', maintain_order = True)
    .agg('values')
)

┌─────┬─────────────────┐
│ id  ┆ values          │
│ --- ┆ ---             │
│ i64 ┆ list[f64]       │
╞═════╪═════════════════╡
│ 1   ┆ [2.0, 2.0, 2.0] │
│ 2   ┆ [1.5, 1.5, 1.5] │
└─────┴─────────────────┘
2024-07-03
Roman Pekar