I think both answers are good, but as I was reading @Dean MacGregor's answer I got curious if we can do it faster.
At the moment I don't think we can do pure polars solution which is faster, but I found that usually combination of Polars + DuckDB works really well.
So, here's a duckdb
solution which uses generate_series()
and list_transform()
functions:
duckdb.sql("""
select
np_linspace_start,
np_linspace_stop,
np_linspace_num,
list_transform(
generate_series(0, np_linspace_num - 1),
x -> x * np_linspace_stop / (np_linspace_num - 1)
) as pl_linspace
from df
""")
┌───────────────────┬──────────────────┬─────────────────┬────────────────────────────────┐
│ np_linspace_start ┆ np_linspace_stop ┆ np_linspace_num ┆ pl_linspace │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ list[f64] │
╞═══════════════════╪══════════════════╪═════════════════╪════════════════════════════════╡
│ 0 ┆ 8 ┆ 5 ┆ [0.0, 2.0, 4.0, 6.0, 8.0] │
│ 0 ┆ 6 ┆ 4 ┆ [0.0, 2.0, 4.0, 6.0] │
│ 0 ┆ 7 ┆ 4 ┆ [0.0, 2.333333, 4.666667, 7.0] │
└───────────────────┴──────────────────┴─────────────────┴────────────────────────────────┘
In my tests this worked ~5 times faster than polars solution with explode()
and ~2 times faster than pyarrow
solution:
%%timeit
duckdb.sql("""
select
np_linspace_start,
np_linspace_stop,
np_linspace_num,
list_transform(
generate_series(0, np_linspace_num - 1),
x -> x * np_linspace_stop / (np_linspace_num - 1)
) as pl_linspace
from df
""")
472 µs ± 47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
df.with_columns(
pl_linspace(
start=pl.col("np_linspace_start"),
stop=pl.col("np_linspace_stop"),
num=pl.col("np_linspace_num"),
).alias("pl_linspace")
)
2.1 ms ± 341 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.with_columns(
ls=pals_linspace("np_linspace_start", "np_linspace_stop", "np_linspace_num")
)
826 µs ± 69.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
update As Dean MacGregor pointed out in the comments, duckdb
code doesn't include .pl()
method which would convert results back to polars.
I didn't do it cause in my experience conversion to polars does not scale linearly with the size of the dataframe. Our test dataframe is only 300 rows so conversion to polars might take significant part of the execution time.
I've tried all three methods on larger dataframe (3M rows), it's more swingy of course cause I didn't run it in the loop. It looks like pyarrow
method wins with duckdb
being close second. On 30M rows I didn't care to wait long enough for explode()
solution to finish, but both pyarrow
and duckdb
solutions worked within 20sec.
%%time
a = df.with_columns(
pl_linspace(
start=pl.col("np_linspace_start"),
stop=pl.col("np_linspace_stop"),
num=pl.col("np_linspace_num"),
).alias("pl_linspace")
)
CPU times: total: 6.36 s
Wall time: 18 s
%%time
a = duckdb.sql("""
select
np_linspace_start,
np_linspace_stop,
np_linspace_num,
list_transform(generate_series(0, np_linspace_num - 1), x -> x * np_linspace_stop / (np_linspace_num - 1)) as pl_linspace
from df
""").pl()
CPU times: total: 2.08 s
Wall time: 1.03 s
%%time
a = df.with_columns(
ls=pals_linspace("np_linspace_start", "np_linspace_stop", "np_linspace_num")
)
CPU times: total: 781 ms
Wall time: 1.29 s