Question
Distinct elements across subgroups for each group in polars
Consider this data frame:
df = (polars
.DataFrame(
dict(
j=[1,1,1,1,2,2,3,3,3,3,3,3,3],
k=[1,1,2,2,3,3,4,4,5,5,6,6,6],
l=[1,2,1,2,2,2,3,4,3,3,3,4,3],
u=[1,1,1,1,2,2,3,3,3,3,3,3,3],
)
)
)
j k l u
i64 i64 i64 i64
1 1 1 1
1 1 2 1
1 2 1 1
1 2 2 1
2 3 2 2
2 3 2 2
3 4 3 3
3 4 4 3
3 5 3 3
3 5 3 3
3 6 3 3
3 6 4 3
3 6 3 3
shape: (13, 4)
I can do this:
dfj = (df
.group_by('j', 'k', maintain_order=True)
.agg(
i=polars.struct('l', 'u').unique()
)
)
j k i
i64 i64 list[struct[2]]
1 1 [{2,1}, {1,1}]
1 2 [{2,1}, {1,1}]
2 3 [{2,2}]
3 4 [{3,3}, {4,3}]
3 5 [{3,3}]
3 6 [{3,3}, {4,3}]
shape: (6, 3)
Question 1: Why did agg
not aggregate rows in order? E.g. I'd expect the first row to be this instead:
j k i
1 1 [{1,1}, {2,1}]
Can this be resolved somehow?
I can then do this:
dfk = (dfj
.group_by('j', maintain_order=True)
.agg(
o=polars.col('i').unique()
)
)
polars.exceptions.InvalidOperationError: `unique` operation not supported for dtype `list[struct[2]]`
Question 2: Why is unique not working for lists (of structs)?
The above is my XY problem.
Now consider j
a group, k
a subgroup within j
and let's call i
a value of a subgroup. Consider the first group_by
above: after fixing the ordering of structs within i
, it logically breaks into:
j k i
i64 i64 list[struct[2]]
---
1 1 [{1,1}, {2,1}]
1 2 [{1,1}, {2,1}]
---
2 3 [{2,2}]
---
3 4 [{3,3}, {4,3}]
3 5 [{3,3}]
3 6 [{3,3}, {4,3}]
shape: (6, 3)
If you look at all groups:
- Group 1 has two subgroups 1 and 2 and both their values are the same
[{1,1}, {2,1}]
- Group 2 has one subgroup 3 and a unique value [{2,2}]
- Group 3 has 3 subgroups 4, 5 and 6 and they have two distinct values
[{3,3}, {4,3}]
and[{3,3}]
Question 3: (My second-level XY problem) Given a starting data frame with at least 3 columns, where j
is a group and k
is a subgroup, get a list of all unique values in all the remaining columns, per group. For the above, I'd expect this output (or something similar - e.g. the order within the list may not matter necessarily):
j i
i64 list[list[struct[2]]]
1 [[{1,1}, {2,1}]]
2 [[{2,2}]]
3 [[{3,3}, {4,3}], [{3,3}]]
shape: (3, 2)
I'm only sligtly interested in both XY problems, as they require some data wrangling that might be useful in general / in the future.
Question 4: (This is as far from an XY problem as I managed to get) Is there a way to list all subgroups that have different set of values in the remaining columns? For example, the output could be:
j k l u
i64 i64 i64 i64
3 4 3 3
3 4 4 3
3 5 3 3
3 5 3 3
3 6 3 3
3 6 4 3
3 6 3 3
shape: (7, 4)
or any other structural variation of that that's more convenient - as long as it lists either j
or k
in some way (e.g. as separate rows or packed into a struct or a list) and at least one row / list element for each of the distinct (l
, u
) values. This would be another acceptable output:
k l u
i64 i64 i64
4 3 3
4 4 3
5 3 3
6 3 3
6 4 3
shape: (5, 3)