Question

Create dates ranges using sample rate and number of samples using Polars

I have a time-series dataframe in Polars.

df = pl.DataFrame(
    {
        "sample_started_at": [
            datetime(2022, 1, 1, hour=1, minute=1, second=1), 
            datetime(2022, 1, 1, hour=2, minute=1, second=1), 
            datetime(2022, 1, 1, hour=3, minute=1, second=1)
        ],
        "sample_rate": [25600, 25600, 51200],
        "sample_size": [100, 200, 100],
    }
)

With columns:

  • sample_started_at: when the sample started.
  • sample_rate: how many samples took per second.
  • sample_size: number of samples in the measurement.

I want to add an array with when the sample was took. The only way that I was able to do it is with pl.datetime_ranges and hard-coded SAMPLE_SIZE, SAMPLE_RATE.

import polars as pl
from datetime import datetime, timedelta

SAMPLE_SIZE = 100
SAMPLE_RATE = 25600

df.with_columns(
    ranges=pl.datetime_ranges(
        start=pl.col("sample_started_at"),
        end=pl.col("sample_started_at") + timedelta(seconds=1/SAMPLE_RATE * (SAMPLE_SIZE -1)),
        interval=timedelta(seconds=1/SAMPLE_RATE),
    ),
).select(
    pl.col("sample_started_at"),
    pl.col("ranges"),
    ranges_len=pl.col("ranges").list.len()
)

But since those values might change over per sample I need to use dynamic values in the columns.

Is there other way?

Thanks

 3  34  3
1 Jan 1970

Solution

 1

Currently, pl.datetime_ranges does not support dynamic values the interval parameter. Different date ranges can still be created using pl.Expr.map_elements.

However, it seems like there are some rounding issues for large sample rates (sample_rate = 51200).

def create_range(x):
    sample_started_at, sample_rate, sample_size = x.values()
    return pl.datetime_range(
        start=sample_started_at,
        end=sample_started_at + timedelta(seconds=(sample_size-1) / sample_rate),
        interval=timedelta(seconds=1.0 / sample_rate),
        eager=True,
    )

(
    df
    .with_columns(
        pl.struct("sample_started_at", "sample_rate", "sample_size")
        .map_elements(create_range, return_dtype=pl.List(pl.Datetime))
        .alias("ranges")
    )
    .with_columns(
        ranges_len=pl.col("ranges").list.len(),
    )
)
shape: (3, 5)
┌─────────────────────┬─────────────┬─────────────┬─────────────────────────────────┬────────────┐
│ sample_started_at   ┆ sample_rate ┆ sample_size ┆ ranges                          ┆ ranges_len │
│ ---                 ┆ ---         ┆ ---         ┆ ---                             ┆ ---        │
│ datetime[μs]        ┆ i64         ┆ i64         ┆ list[datetime[μs]]              ┆ u32        │
╞═════════════════════╪═════════════╪═════════════╪═════════════════════════════════╪════════════╡
│ 2022-01-01 01:01:01 ┆ 25600       ┆ 100         ┆ [2022-01-01 01:01:01, 2022-01-… ┆ 100        │
│ 2022-01-01 02:01:01 ┆ 25600       ┆ 200         ┆ [2022-01-01 02:01:01, 2022-01-… ┆ 200        │
│ 2022-01-01 03:01:01 ┆ 51200       ┆ 100         ┆ [2022-01-01 03:01:01, 2022-01-… ┆ 97         │
└─────────────────────┴─────────────┴─────────────┴─────────────────────────────────┴────────────┘
2024-06-30
Hericks