Question
Summing Values Based on Date Ranges in a DataFrame using Polars
I have a DataFrame (df
) that contains columns: ID
, Initial Date
, Final Date
, and Value
, and another DataFrame (dates
) that contains all the days for each ID from df
.
On the dates
dataframe i want to sum the values if exist on the range of each ID
Here is my code
import polars as pl
from datetime import datetime
data = {
"ID" : [1, 2, 3, 4, 5],
"Initial Date" : ["2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05"],
"Final Date" : ["2022-01-03", "2022-01-06", "2022-01-07", "2022-01-09", "2022-01-07"],
"Value" : [10, 20, 30, 40, 50]
}
df = pl.DataFrame(data)
dates = pl.datetime_range(
start=datetime(2022,1,1),
end=datetime(2022,1,7),
interval="1d",
eager = True,
closed = "both"
).to_frame("date")
shape: (5, 4)
┌─────┬──────────────┬────────────┬───────┐
│ ID ┆ Initial Date ┆ Final Date ┆ Value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════╪══════════════╪════════════╪═══════╡
│ 1 ┆ 2022-01-01 ┆ 2022-01-03 ┆ 10 │
│ 2 ┆ 2022-01-02 ┆ 2022-01-06 ┆ 20 │
│ 3 ┆ 2022-01-03 ┆ 2022-01-07 ┆ 30 │
│ 4 ┆ 2022-01-04 ┆ 2022-01-09 ┆ 40 │
│ 5 ┆ 2022-01-05 ┆ 2022-01-07 ┆ 50 │
└─────┴──────────────┴────────────┴───────┘
shape: (7, 1)
┌─────────────────────┐
│ date │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2022-01-01 00:00:00 │
│ 2022-01-02 00:00:00 │
│ 2022-01-03 00:00:00 │
│ 2022-01-04 00:00:00 │
│ 2022-01-05 00:00:00 │
│ 2022-01-06 00:00:00 │
│ 2022-01-07 00:00:00 │
└─────────────────────┘
In this case, on 2022-01-01 the value would be 10. On 2022-01-02, it would be 10 + 20, and on 2022-01-03, it would be 10 + 20 + 30, and so on. In other words, I want to check if the date exists within the range of each row in the DataFrame (df
), and if it does, sum the values.
I think the aproach for this is like this:
(
dates.with_columns(
pl.sum(
pl.when(
(df["Initial Date"] <= pl.col("date")) & (df["Final Date"] >= pl.col("date"))
).then(df["Value"]).otherwise(0)
).alias("Summed Value")
)
)