Question

How to compute a column in polars dataframe using np.linspace

Consider the following pl.DataFrame:

df = pl.DataFrame(
    data={
        "np_linspace_start": [0, 0, 0], 
        "np_linspace_stop": [8, 6, 7],
        "np_linspace_num": [5, 4, 4]
    }
)

shape: (3, 3)
┌───────────────────┬──────────────────┬─────────────────┐
│ np_linspace_start ┆ np_linspace_stop ┆ np_linspace_num │
│ ---               ┆ ---              ┆ ---             │
│ i64               ┆ i64              ┆ i64             │
╞═══════════════════╪══════════════════╪═════════════════╡
│ 0                 ┆ 8                ┆ 5               │
│ 0                 ┆ 6                ┆ 4               │
│ 0                 ┆ 7                ┆ 4               │
└───────────────────┴──────────────────┴─────────────────┘

How can I create a new column ls, that is the result of the np.linspace function? This column will hold an np.array.

I was looking for something along those lines:

df.with_columns(
    ls=np.linspace(
        start=pl.col("np_linspace_start"),
        stop=pl.col("np_linspace_stop"),
        num=pl.col("np_linspace_num")
    )
)

Is there a polars equivalent to np.linspace?

4 121 4

1 Jan 1970

Solution

As mentioned in the comments, adding an np.linspace-style function to polars is an open feature request. Until this is implemented a simple implementation using polars' native expression API could look as follows.

First, we use pl.int_range (thanks to @Dean MacGregor) to create a range of integers from 0 to num (exclusive). Next, we rescale and shift the range according to start, stop, and num. Finally, we implode the column with pl.Expr.implode to obtain a column with the range as list for each row.

def pl_linspace(start: pl.Expr, stop: pl.Expr, num: pl.Expr) -> pl.Expr:
    grid = pl.int_range(num)
    _scale = (stop - start) / (num - 1)
    _offset = start
    return (grid * _scale + _offset).implode().over(pl.int_range(pl.len()))

df.with_columns(
    pl_linspace(
        start=pl.col("np_linspace_start"),
        stop=pl.col("np_linspace_stop"),
        num=pl.col("np_linspace_num"),
    ).alias("pl_linspace")
)

shape: (3, 4)
┌───────────────────┬──────────────────┬─────────────────┬────────────────────────────────┐
│ np_linspace_start ┆ np_linspace_stop ┆ np_linspace_num ┆ pl_linspace                    │
│ ---               ┆ ---              ┆ ---             ┆ ---                            │
│ i64               ┆ i64              ┆ i64             ┆ list[f64]                      │
╞═══════════════════╪══════════════════╪═════════════════╪════════════════════════════════╡
│ 0                 ┆ 8                ┆ 5               ┆ [0.0, 2.0, 4.0, 6.0, 8.0]      │
│ 0                 ┆ 6                ┆ 4               ┆ [0.0, 2.0, 4.0, 6.0]           │
│ 0                 ┆ 7                ┆ 4               ┆ [0.0, 2.333333, 4.666667, 7.0] │
└───────────────────┴──────────────────┴─────────────────┴────────────────────────────────┘

Note. If num is 1, the division when computing _scale will result in infinite values. This can be avoided by adding the following to pl_linspace.

_scale = pl.when(_scale.is_infinite()).then(pl.lit(0)).otherwise(_scale)

2024-07-11

Hericks

Solution

I really like @Hericks's answer but it brings up an issue that comes up from time to time where we need to work around the inability to have lists do operations against other lists by explodeing and then implodeing with an over. That over is really expensive and if we bring in our old pal pyarrow to help us out, we can avoid that.

Here's an alternative to pl_linspace that uses pyarrow.

import pyarrow as pa
def pals_linspace(
    start: pl.Expr | str, stop: pl.Expr | str, num: pl.Expr | str
) -> pl.Expr:
    if isinstance(start, str):
        start = pl.col(start)
    if isinstance(stop, str):
        stop = pl.col(stop)
    if isinstance(num, str):
        num = pl.col(num)
        
    grid = pl.int_ranges(num)
    # Note the repeat_by here to broadcast the scalers to the same shape as grid
    scale = ((stop - start) / (num - 1)).repeat_by(grid.list.len())
    offset = start.repeat_by(grid.list.len())
    
    # Need to make a struct so all the columns we need are in a Series
    all_struct = pl.struct(
        grid.alias("grid"), scale.alias("scale"), offset.alias("offset")
    )
    return all_struct.map_batches(
        lambda s: (
            pl.from_arrow(
                # This is the key to avoiding over, its first argument is the offsets
                # of the list we want to end up with. Since we want to end up with the 
                # same shape as `grid` we can just use that as-is
                
                # The second argument is just @Herick's formula
                pa.LargeListArray.from_arrays(
                    s.struct.field("grid").to_arrow().offsets,
                    (
                        s.struct.field("grid").explode()
                        * s.struct.field("scale").explode()
                        + s.struct.field("offset").explode()
                    ).to_arrow(),
                )
            )
        )
    )

Performance

If I do a %%timeit of pl_linspace and pals_linspace with the given df then pl_linspace is faster. However if we do

df = pl.DataFrame(
    data={
        "np_linspace_start": [0, 0, 0] * 100,
        "np_linspace_stop": [8, 6, 7] * 100,
        "np_linspace_num": [5, 4, 4] * 100,
    }
)

then

%%timeit 
df.with_columns(
    ls=pals_linspace("np_linspace_start", "np_linspace_stop", "np_linspace_num")
)
2.95 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.with_columns(
    pl_linspace(
        start=pl.col("np_linspace_start"),
        stop=pl.col("np_linspace_stop"),
        num=pl.col("np_linspace_num"),
    ).alias("pl_linspace")
)
9.9 ms ± 722 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2024-07-11

Dean MacGregor

Solution

I think both answers are good, but as I was reading @Dean MacGregor's answer I got curious if we can do it faster.

At the moment I don't think we can do pure polars solution which is faster, but I found that usually combination of Polars + DuckDB works really well.

So, here's a duckdb solution which uses generate_series() and list_transform() functions:

duckdb.sql("""
    select
        np_linspace_start,
        np_linspace_stop,
        np_linspace_num,
        list_transform(
            generate_series(0, np_linspace_num - 1),
            x -> x * np_linspace_stop / (np_linspace_num - 1)
        ) as pl_linspace
    from df
""")

┌───────────────────┬──────────────────┬─────────────────┬────────────────────────────────┐
│ np_linspace_start ┆ np_linspace_stop ┆ np_linspace_num ┆ pl_linspace                    │
│ ---               ┆ ---              ┆ ---             ┆ ---                            │
│ i64               ┆ i64              ┆ i64             ┆ list[f64]                      │
╞═══════════════════╪══════════════════╪═════════════════╪════════════════════════════════╡
│ 0                 ┆ 8                ┆ 5               ┆ [0.0, 2.0, 4.0, 6.0, 8.0]      │
│ 0                 ┆ 6                ┆ 4               ┆ [0.0, 2.0, 4.0, 6.0]           │
│ 0                 ┆ 7                ┆ 4               ┆ [0.0, 2.333333, 4.666667, 7.0] │
└───────────────────┴──────────────────┴─────────────────┴────────────────────────────────┘

In my tests this worked ~5 times faster than polars solution with explode() and ~2 times faster than pyarrow solution:

%%timeit

duckdb.sql("""
    select
        np_linspace_start,
        np_linspace_stop,
        np_linspace_num,
        list_transform(
            generate_series(0, np_linspace_num - 1),
            x -> x * np_linspace_stop / (np_linspace_num - 1)
        ) as pl_linspace
    from df
""")

472 µs ± 47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit

df.with_columns(
    pl_linspace(
        start=pl.col("np_linspace_start"),
        stop=pl.col("np_linspace_stop"),
        num=pl.col("np_linspace_num"),
    ).alias("pl_linspace")
)

2.1 ms ± 341 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit

df.with_columns(
    ls=pals_linspace("np_linspace_start", "np_linspace_stop", "np_linspace_num")
)

826 µs ± 69.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

update As Dean MacGregor pointed out in the comments, duckdb code doesn't include .pl() method which would convert results back to polars. I didn't do it cause in my experience conversion to polars does not scale linearly with the size of the dataframe. Our test dataframe is only 300 rows so conversion to polars might take significant part of the execution time.

I've tried all three methods on larger dataframe (3M rows), it's more swingy of course cause I didn't run it in the loop. It looks like pyarrow method wins with duckdb being close second. On 30M rows I didn't care to wait long enough for explode() solution to finish, but both pyarrow and duckdb solutions worked within 20sec.

%%time

a = df.with_columns(
    pl_linspace(
        start=pl.col("np_linspace_start"),
        stop=pl.col("np_linspace_stop"),
        num=pl.col("np_linspace_num"),
    ).alias("pl_linspace")
)

CPU times: total: 6.36 s
Wall time: 18 s

%%time

a = duckdb.sql("""
    select
        np_linspace_start,
        np_linspace_stop,
        np_linspace_num,
        list_transform(generate_series(0, np_linspace_num - 1), x -> x * np_linspace_stop / (np_linspace_num - 1)) as pl_linspace
    from df
""").pl()

CPU times: total: 2.08 s
Wall time: 1.03 s

%%time

a = df.with_columns(
    ls=pals_linspace("np_linspace_start", "np_linspace_stop", "np_linspace_num")
)

CPU times: total: 781 ms
Wall time: 1.29 s

2024-07-11

Roman Pekar

Solution

Resulting lists should look something like ._ranges(start, pl.col.stop + 1, (stop - start) / (num - 1). The problem we're facing now is (stop - start) / (num - 1) is not integer, so we can't use int_ranges().

The question is - could we modify the solution somehow that we could create int_ranges() and then convert them to resulting float_ranges()?

We can't multiply list elements by other column (yet). But what we can do is to multiply or divide elements of the list by variable.

So what if we had such number p so that for all rows expression p * (stop - start) / (num - 1) would result in integer number? Then we could create int_ranges() lists like .int_ranges(p * start, p * pl.col.stop + 1, p * (stop - start) / (num - 1) lists and then just divide all the elements by p.

To do that we need to find least common multiple of all (num - 1) numbers and divide it by greatest common divisor of (stop - start). I don't think polars function for that exists, but we can use numpy.lcm() and numpy.gcd().

The solution then would look like:

n1 = list(df.select(pl.col.np_linspace_num - 1).unique().to_series())
n2 = list(df.select(pl.col.np_linspace_stop - pl.col.np_linspace_start).unique().to_series())
p = np.lcm.reduce(n1) / np.gcd.reduce(n2)

df.with_columns(
    pl
    .int_ranges(
        p * pl.col.np_linspace_start,
        p * (pl.col.np_linspace_stop + 1),
        p * (pl.col.np_linspace_stop - pl.col.np_linspace_start) / (pl.col.np_linspace_num - 1)
    )
    .list.eval(pl.element() / p)
    .alias("ls")
)

┌───────────────────┬──────────────────┬─────────────────┬────────────────────────────────┐
│ np_linspace_start ┆ np_linspace_stop ┆ np_linspace_num ┆ ls                             │
│ ---               ┆ ---              ┆ ---             ┆ ---                            │
│ i64               ┆ i64              ┆ i64             ┆ list[f64]                      │
╞═══════════════════╪══════════════════╪═════════════════╪════════════════════════════════╡
│ 0                 ┆ 8                ┆ 5               ┆ [0.0, 2.0, 4.0, 6.0, 8.0]      │
│ 0                 ┆ 6                ┆ 4               ┆ [0.0, 2.0, 4.0, 6.0]           │
│ 0                 ┆ 7                ┆ 4               ┆ [0.0, 2.333333, 4.666667, 7.0] │
└───────────────────┴──────────────────┴─────────────────┴────────────────────────────────┘

It's also seems to be quite fast, on 3M rows:

%%time

n1 = list(df.select(pl.col.np_linspace_num - 1).unique().to_series())
n2 = list(df.select(pl.col.np_linspace_stop - pl.col.np_linspace_start).unique().to_series())
p = np.lcm.reduce(n1) / np.gcd.reduce(n2)

a = df.with_columns(
    pl
    .int_ranges(
        p * pl.col.np_linspace_start,
        p * (pl.col.np_linspace_stop + 1),
        p * (pl.col.np_linspace_stop - pl.col.np_linspace_start) / (pl.col.np_linspace_num - 1)
    )
    .list.eval(pl.element() / p)
    .alias("ls")
)

CPU times: total: 750 ms
Wall time: 908 ms

2024-07-11

Roman Pekar