Question
Efficiently reparsing string series (in a dataframe) into a struct, recasting the fields of the struct and then unnesting it
Consider the following toy example:
import polars as pl
xs = pl.DataFrame(
[
pl.Series(
"date",
["2024 Jan", "2024 Feb", "2024 Jan", "2024 Jan"],
dtype=pl.String,
)
]
)
ys = (
xs.with_columns(
pl.col("date").str.split(" ").list.to_struct(fields=["year", "month"]),
)
.with_columns(
pl.col("date").struct.with_fields(pl.field("year").cast(pl.Int16()))
)
.unnest("date")
)
ys
shape: (4, 2)
┌──────┬───────┐
│ year ┆ month │
│ --- ┆ --- │
│ i16 ┆ str │
╞══════╪═══════╡
│ 2024 ┆ Jan │
│ 2024 ┆ Feb │
│ 2024 ┆ Jan │
│ 2024 ┆ Jan │
└──────┴───────┘
I think it would be more efficient to do the operations on a unique series of date data (I could use replace
, but I have opted for join
for no good reason):
unique_dates = (
pl.DataFrame([xs["date"].unique()])
.with_columns(
pl.col("date")
.str.split(" ")
.list.to_struct(fields=["year", "month"])
.alias("struct_date")
)
.with_columns(
pl.col("struct_date").struct.with_fields(
pl.field("year").cast(pl.Int16())
)
)
)
unique_dates
shape: (2, 2)
┌──────────┬──────────────┐
│ date ┆ struct_date │
│ --- ┆ --- │
│ str ┆ struct[2] │
╞══════════╪══════════════╡
│ 2024 Jan ┆ {2024,"Jan"} │
│ 2024 Feb ┆ {2024,"Feb"} │
└──────────┴──────────────┘
zs = (
xs.join(unique_dates, on="date", left_on="date", right_on="struct_date")
.drop("date")
.rename({"struct_date": "date"})
.unnest("date")
)
zs
shape: (4, 2)
┌──────┬───────┐
│ year ┆ month │
│ --- ┆ --- │
│ i16 ┆ str │
╞══════╪═══════╡
│ 2024 ┆ Jan │
│ 2024 ┆ Feb │
│ 2024 ┆ Jan │
│ 2024 ┆ Jan │
└──────┴───────┘
What can I do to improve the efficiency of this operation even further? Am I using polars
idiomatically enough?
3 54
3