Question

Summing Values Based on Date Ranges in a DataFrame using Polars

I have a DataFrame (df) that contains columns: ID, Initial Date, Final Date, and Value, and another DataFrame (dates) that contains all the days for each ID from df.

On the dates dataframe i want to sum the values if exist on the range of each ID

Here is my code

import polars as pl
from datetime import datetime

data = {
    "ID" : [1, 2, 3, 4, 5],
    "Initial Date" : ["2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05"],
    "Final Date" : ["2022-01-03", "2022-01-06", "2022-01-07", "2022-01-09", "2022-01-07"],
    "Value" : [10, 20, 30, 40, 50]


}

df = pl.DataFrame(data)

dates = pl.datetime_range(
    start=datetime(2022,1,1),
    end=datetime(2022,1,7),
    interval="1d",
    eager = True,
    closed = "both"
    ).to_frame("date")
shape: (5, 4)
┌─────┬──────────────┬────────────┬───────┐
│ ID  ┆ Initial Date ┆ Final Date ┆ Value │
│ --- ┆ ---          ┆ ---        ┆ ---   │
│ i64 ┆ str          ┆ str        ┆ i64   │
╞═════╪══════════════╪════════════╪═══════╡
│ 1   ┆ 2022-01-01   ┆ 2022-01-03 ┆ 10    │
│ 2   ┆ 2022-01-02   ┆ 2022-01-06 ┆ 20    │
│ 3   ┆ 2022-01-03   ┆ 2022-01-07 ┆ 30    │
│ 4   ┆ 2022-01-04   ┆ 2022-01-09 ┆ 40    │
│ 5   ┆ 2022-01-05   ┆ 2022-01-07 ┆ 50    │
└─────┴──────────────┴────────────┴───────┘
shape: (7, 1)
┌─────────────────────┐
│ date                │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2022-01-01 00:00:00 │
│ 2022-01-02 00:00:00 │
│ 2022-01-03 00:00:00 │
│ 2022-01-04 00:00:00 │
│ 2022-01-05 00:00:00 │
│ 2022-01-06 00:00:00 │
│ 2022-01-07 00:00:00 │
└─────────────────────┘

In this case, on 2022-01-01 the value would be 10. On 2022-01-02, it would be 10 + 20, and on 2022-01-03, it would be 10 + 20 + 30, and so on. In other words, I want to check if the date exists within the range of each row in the DataFrame (df), and if it does, sum the values.

I think the aproach for this is like this:

(
    dates.with_columns(
        pl.sum(
            pl.when(
                (df["Initial Date"] <= pl.col("date")) & (df["Final Date"] >= pl.col("date"))
            ).then(df["Value"]).otherwise(0)
        ).alias("Summed Value")
    )
    
)
 6  95  6
1 Jan 1970

Solution

 3

If you just want to know sum of values on each date within ranges in df, you don't even need dates dataframe.

(
    df
    .with_columns(date = pl.date_ranges("Initial Date", "Final Date"))
    .explode("date")
    .group_by("date", maintain_order = True)
    .agg(pl.col.Value.sum())
)

┌────────────┬───────┐
│ date       ┆ Value │
│ ---        ┆ ---   │
│ date       ┆ i64   │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 10    │
│ 2022-01-02 ┆ 30    │
│ 2022-01-03 ┆ 60    │
│ 2022-01-04 ┆ 90    │
│ 2022-01-05 ┆ 140   │
│ 2022-01-06 ┆ 140   │
│ 2022-01-07 ┆ 120   │
│ 2022-01-08 ┆ 40    │
│ 2022-01-09 ┆ 40    │
└────────────┴───────┘

If you really want to use dates then you can join() result on dates dataframe:

(
    df
    .with_columns(date = pl.date_ranges("Initial Date", "Final Date"))
    .explode("date")
    .group_by("date", maintain_order = True)
    .agg(pl.col.Value.sum())
    .join(dates, on="date", how="semi")
)

┌────────────┬───────┐
│ date       ┆ Value │
│ ---        ┆ ---   │
│ date       ┆ i64   │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 10    │
│ 2022-01-02 ┆ 30    │
│ 2022-01-03 ┆ 60    │
│ 2022-01-04 ┆ 90    │
│ 2022-01-05 ┆ 140   │
│ 2022-01-06 ┆ 140   │
│ 2022-01-07 ┆ 120   │
└────────────┴───────┘

Or just filter() the result:

(
    df
    .with_columns(date = pl.date_ranges("Initial Date", "Final Date"))
    .explode("date")
    .group_by("date", maintain_order = True)
    .agg(pl.col.Value.sum())
    .filter(pl.col.date.is_between(datetime(2022,1,1), datetime(2022,1,7)))
)

┌────────────┬───────┐
│ date       ┆ Value │
│ ---        ┆ ---   │
│ date       ┆ i64   │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 10    │
│ 2022-01-02 ┆ 30    │
│ 2022-01-03 ┆ 60    │
│ 2022-01-04 ┆ 90    │
│ 2022-01-05 ┆ 140   │
│ 2022-01-06 ┆ 140   │
│ 2022-01-07 ┆ 120   │
└────────────┴───────┘

Alternative solution would be to use join on inequality, but polars is not great on it (yet). But in this case you can use DuckDB integration with Polars.

duckdb.sql("""
    select
        d.date,
        sum(df.value) as value
    from df
        inner join dates as d on
            d.date between df."Initial Date" and df."Final Date"
    group by
        d.date
    order by
        d.date
""").pl()

┌─────────────────────┬───────────────┐
│ date                ┆ value         │
│ ---                 ┆ ---           │
│ datetime[μs]        ┆ decimal[38,0] │
╞═════════════════════╪═══════════════╡
│ 2022-01-01 00:00:00 ┆ 10            │
│ 2022-01-02 00:00:00 ┆ 30            │
│ 2022-01-03 00:00:00 ┆ 60            │
│ 2022-01-04 00:00:00 ┆ 90            │
│ 2022-01-05 00:00:00 ┆ 140           │
│ 2022-01-06 00:00:00 ┆ 140           │
│ 2022-01-07 00:00:00 ┆ 120           │
└─────────────────────┴───────────────┘
2024-07-04
Roman Pekar