Question

ComputeError: dynamic pattern length in 'str.replace' expressions is not supported yet

What is the polars expression way to achieve this,

df = pl.from_repr("""
┌───────────────────────────────┬───────────────────────────┐
│ document_url                  ┆ matching_seed_url         │
│ ---                           ┆ ---                       │
│ str                           ┆ str                       │
╞═══════════════════════════════╪═══════════════════════════╡
│ https://document_url.com/1234 ┆ https://document_url.com/ │
│ https://document_url.com/5678 ┆ https://document_url.com/ │
└───────────────────────────────┴───────────────────────────┘""")
df = df.with_columns(
    pl.when(pl.col("matching_seed_url").is_not_null())
    .then(pl.col("document_url").str.replace(pl.col("matching_seed_url"), ""))
    .otherwise(pl.lit(""))
    .alias("extracted_id"))

I get,

ComputeError: dynamic pattern length in 'str.replace' expressions is not supported yet

how do I extract 1234, 5678 here

2 58 2

1 Jan 1970

Solution

If document_url always starts with matching_seed_url you can use the following:

df.with_columns(
    pl.when(pl.col.matching_seed_url.is_not_null())
      .then(
          pl.col.document_url.str.strip_prefix(pl.col.matching_seed_url)
      )
      .alias("extracted_id")
)

2024-07-24

MaxDre

Solution

There is a feature request to allow that, which is yet to be implemented:

https://github.com/pola-rs/polars/issues/14367

It can be emulated using a Window function.

df = pl.DataFrame({
    "document_url": ["https://document_url.com/1234", "yolo", "https://document_url.com/5678"],
    "matching_seed_url": ["https://document_url.com/", None, "https://document_url.com/"]
})

If you group over each pattern, you can pass the .first() item to .replace()

df.with_columns(
    pl.when(pl.col.matching_seed_url.is_not_null())
      .then(
          pl.col.document_url.str.replace(pl.col.matching_seed_url.fill_null("").first(), "", literal=True)
      )
      .over("matching_seed_url")
      .alias("extracted_id")
)

notes:

.str.replace treats the pattern as a regex, we use literal=True to escape the metachars in URLs
nulls are not allowed in the pattern, so we .fill_null("") to avoid an error

If you don't need regular expressions, .str.replace_many() could also be an option.

df.with_columns(
    pl.when(pl.col.matching_seed_url.is_not_null())
      .then(
          pl.col.document_url.str.replace_many(pl.col.matching_seed_url.drop_nulls(), "")
      )
      .alias("extracted_id")
)

shape: (3, 3)
┌───────────────────────────────┬───────────────────────────┬──────────────┐
│ document_url                  ┆ matching_seed_url         ┆ extracted_id │
│ ---                           ┆ ---                       ┆ ---          │
│ str                           ┆ str                       ┆ str          │
╞═══════════════════════════════╪═══════════════════════════╪══════════════╡
│ https://document_url.com/1234 ┆ https://document_url.com/ ┆ 1234         │
│ yolo                          ┆ null                      ┆ null         │
│ https://document_url.com/5678 ┆ https://document_url.com/ ┆ 5678         │
└───────────────────────────────┴───────────────────────────┴──────────────┘

replace_many does have different semantics: all patterns will be applied to each "row".

So depending on your actual data, this may or may not be desired:

df = pl.DataFrame({"foo": ["abcdef", "abcdgh"], "bar": ["ab", "cd"]})

df.with_columns(out = pl.col.foo.str.replace_many(pl.col.bar, ""))

shape: (2, 3)
┌────────┬─────┬─────┐
│ foo    ┆ bar ┆ out │
│ ---    ┆ --- ┆ --- │
│ str    ┆ str ┆ str │
╞════════╪═════╪═════╡
│ abcdef ┆ ab  ┆ ef  │
│ abcdgh ┆ cd  ┆ gh  │
└────────┴─────┴─────┘

2024-06-29

jqurious