Question

Distinct elements across subgroups for each group in polars

Consider this data frame:

df = (polars
  .DataFrame(
    dict(
      j=[1,1,1,1,2,2,3,3,3,3,3,3,3],
      k=[1,1,2,2,3,3,4,4,5,5,6,6,6],
      l=[1,2,1,2,2,2,3,4,3,3,3,4,3],
      u=[1,1,1,1,2,2,3,3,3,3,3,3,3],
      )
    )
  )

 j    k    l    u
 i64  i64  i64  i64
 1    1    1    1
 1    1    2    1
 1    2    1    1
 1    2    2    1
 2    3    2    2
 2    3    2    2
 3    4    3    3
 3    4    4    3
 3    5    3    3
 3    5    3    3
 3    6    3    3
 3    6    4    3
 3    6    3    3
shape: (13, 4)

I can do this:

dfj = (df
  .group_by('j', 'k', maintain_order=True)
  .agg(
    i=polars.struct('l', 'u').unique()
    )
  )

 j    k    i
 i64  i64  list[struct[2]]
 1    1    [{2,1}, {1,1}]
 1    2    [{2,1}, {1,1}]
 2    3    [{2,2}]
 3    4    [{3,3}, {4,3}]
 3    5    [{3,3}]
 3    6    [{3,3}, {4,3}]
shape: (6, 3)

Question 1: Why did agg not aggregate rows in order? E.g. I'd expect the first row to be this instead:

 j    k    i
 1    1    [{1,1}, {2,1}]

Can this be resolved somehow?

I can then do this:

dfk = (dfj
  .group_by('j', maintain_order=True)
  .agg(
    o=polars.col('i').unique()
    )
  )

polars.exceptions.InvalidOperationError: `unique` operation not supported for dtype `list[struct[2]]`

Question 2: Why is unique not working for lists (of structs)?

The above is my XY problem.

Now consider j a group, k a subgroup within j and let's call i a value of a subgroup. Consider the first group_by above: after fixing the ordering of structs within i, it logically breaks into:

 j    k    i
 i64  i64  list[struct[2]]
---
 1    1    [{1,1}, {2,1}]
 1    2    [{1,1}, {2,1}]
--- 
 2    3    [{2,2}]
--- 
 3    4    [{3,3}, {4,3}]
 3    5    [{3,3}]
 3    6    [{3,3}, {4,3}]
shape: (6, 3)

If you look at all groups:

  • Group 1 has two subgroups 1 and 2 and both their values are the same [{1,1}, {2,1}]
  • Group 2 has one subgroup 3 and a unique value [{2,2}]
  • Group 3 has 3 subgroups 4, 5 and 6 and they have two distinct values [{3,3}, {4,3}] and [{3,3}]

Question 3: (My second-level XY problem) Given a starting data frame with at least 3 columns, where j is a group and k is a subgroup, get a list of all unique values in all the remaining columns, per group. For the above, I'd expect this output (or something similar - e.g. the order within the list may not matter necessarily):

 j    i
 i64  list[list[struct[2]]]
 1    [[{1,1}, {2,1}]]
 2    [[{2,2}]]
 3    [[{3,3}, {4,3}], [{3,3}]]
shape: (3, 2)

I'm only sligtly interested in both XY problems, as they require some data wrangling that might be useful in general / in the future.

Question 4: (This is as far from an XY problem as I managed to get) Is there a way to list all subgroups that have different set of values in the remaining columns? For example, the output could be:

 j    k    l    u
 i64  i64  i64  i64
 3    4    3    3
 3    4    4    3
 3    5    3    3
 3    5    3    3
 3    6    3    3
 3    6    4    3
 3    6    3    3
shape: (7, 4)

or any other structural variation of that that's more convenient - as long as it lists either j or k in some way (e.g. as separate rows or packed into a struct or a list) and at least one row / list element for each of the distinct (l, u) values. This would be another acceptable output:

 k    l    u
 i64  i64  i64
 4    3    3
 4    4    3
 5    3    3
 6    3    3
 6    4    3
shape: (5, 3)
 3  71  3
1 Jan 1970

Solution

 3

For question one - there's maintain_order parameter of unique() method:

(
    df
    .group_by('j', 'k', maintain_order=True)
    .agg(i = pl.struct('l', 'u').unique(maintain_order=True))
)

┌─────┬─────┬─────────────────┐
│ j   ┆ k   ┆ i               │
│ --- ┆ --- ┆ ---             │
│ i64 ┆ i64 ┆ list[struct[2]] │
╞═════╪═════╪═════════════════╡
│ 1   ┆ 1   ┆ [{1,1}, {2,1}]  │
│ 1   ┆ 2   ┆ [{1,1}, {2,1}]  │
│ 2   ┆ 3   ┆ [{2,2}]         │
│ 3   ┆ 4   ┆ [{3,3}, {4,3}]  │
│ 3   ┆ 5   ┆ [{3,3}]         │
│ 3   ┆ 6   ┆ [{3,3}, {4,3}]  │
└─────┴─────┴─────────────────┘

For question 2 - you could probably convert your data to string and use list.join() so you can aggregate unique strings into list.

(
    df
    .group_by('j', 'k', maintain_order=True)
    .agg(i = pl.struct('l', 'u').unique(maintain_order=True).cast(pl.String))
    .with_columns(pl.col.i.list.join(','))
    .group_by('j').agg(pl.col.i.unique())
)

┌─────┬──────────────────────────┐
│ j   ┆ i                        │
│ --- ┆ ---                      │
│ i64 ┆ list[str]                │
╞═════╪══════════════════════════╡
│ 3   ┆ ["{3,3},{4,3}", "{3,3}"] │
│ 2   ┆ ["{2,2}"]                │
│ 1   ┆ ["{1,1},{2,1}"]          │
└─────┴──────────────────────────┘

edit Ok, after some additional comments I think the idea is to get all the rows which contain different subgroups within j.

It would've been helpful to check unique list of structs (or lists) but as it doesn't work, maybe we can turn it around? we can check that for each subgroup of l, u we have the same list of k values within j.

(
    df
    .unique(maintain_order=True)
    .group_by('j','l','u', maintain_order=True).agg(pl.col.k)
)

┌─────┬─────┬─────┬───────────┐
│ j   ┆ l   ┆ u   ┆ k         │
│ --- ┆ --- ┆ --- ┆ ---       │
│ i64 ┆ i64 ┆ i64 ┆ list[i64] │
╞═════╪═════╪═════╪═══════════╡
│ 1   ┆ 1   ┆ 1   ┆ [1, 2]    │
│ 1   ┆ 2   ┆ 1   ┆ [1, 2]    │
│ 2   ┆ 2   ┆ 2   ┆ [3]       │
│ 3   ┆ 3   ┆ 3   ┆ [4, 5, 6] │
│ 3   ┆ 4   ┆ 3   ┆ [4, 6]    │
└─────┴─────┴─────┴───────────┘

And now filter out rows where we have the same list of k for all the subgroups.

(
    df
    .unique(maintain_order=True)
    .group_by('j','l','u').agg(pl.col.k)
    .filter(pl.col.k.unique().len().over('j') != 1)
    .explode('k')
)

┌─────┬─────┬─────┬─────┐
│ j   ┆ l   ┆ u   ┆ k   │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ 3   ┆ 3   ┆ 3   ┆ 4   │
│ 3   ┆ 3   ┆ 3   ┆ 5   │
│ 3   ┆ 3   ┆ 3   ┆ 6   │
│ 3   ┆ 4   ┆ 3   ┆ 4   │
│ 3   ┆ 4   ┆ 3   ┆ 6   │
└─────┴─────┴─────┴─────┘
2024-07-09
Roman Pekar