slicing-multiple-chunks-in-a-polars-dataframe

Slicing multiple chunks in a polars dataframe

Consider the following dataframe.

df = pl.DataFrame(data={"col1": range(10)})

┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 0    │
│ 1    │
│ 2    │
│ 3    │
│ 4    │
│ 5    │
│ 6    │
│ 7    │
│ 8    │
│ 9    │
└──────┘

Let's say I have a list of tuples, where the first value represents the start index and the second value a length value (as used in pl.DataFrame.slice). This might look like this:

slices = [(1,2), (5,3)]

Now, what's a good way to slice/extract two chunks out of df, whereby the first slice starts in row 1 and has a length of 2, while the second chunk starts at row 5 and has a length of 3.

Here's what I am looking for:

┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 5    │
│ 6    │
│ 7    │
└──────┘

3 62 3

1 Jan 1970

Solution

You could use pl.DataFrame.slice to obtain each slice separately and then use pl.concat to concatenate all slices.

pl.concat(df.slice(*slice) for slice in slices)

shape: (5, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 5    │
│ 6    │
│ 7    │
└──────┘

Edit. As an attempt for a vectorized approach, you could first use the list of slice parameters to create a dataframe of indices (using pl.int_ranges and pl.DataFrame.explode). Afterwards, this dataframe of indices can be used to slice the df with join.

indices = (
    pl.DataFrame(slices, orient="row", schema=["offset", "length"])
    .select(
        index=pl.int_ranges("offset", pl.col("offset") + pl.col("length"))
    )
    .explode("index")
)

shape: (5, 1)
┌───────┐
│ index │
│ ---   │
│ i64   │
╞═══════╡
│ 1     │
│ 2     │
│ 5     │
│ 6     │
│ 7     │
└───────┘

(
    indices
    .join(
        df,
        left_on="index",
        right_on=pl.int_range(pl.len()),
        how="left",
        coalesce=True,
    )
    .drop("index")
)

shape: (5, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 5    │
│ 6    │
│ 7    │
└──────┘

2024-07-05

Hericks