Question

Polars - Filter DataFrame using another DataFrame's row's

I have two Dataframes - graph and search with the same schema

Schema for graph:

SCHEMA = {
    START_RANGE: pl.Int64,
    END_RANGE: pl.Int64,
}
Schema for search:

SCHEMA = {
    START: pl.Int64,
    END: pl.Int64,
}

I want to search the graph dataframe. For every row in search dataframe, I want to find all the rows in the graph dataframe where the START_RANGE and END_RANGE values are strictly within the range of the search dataframe.

How to achieve this using polars only?

Example -

graph = pl.DataFrame(
    {
        "START_RANGE": [1, 10, 20, 30],
        "END_RANGE": [5, 15, 25, 35],
    },
)

search = pl.DataFrame(
    {
        "START": [2, 11, 7],
        "END": [5, 14, 12], 
    },
)
# Expected output

[2,5] is in range [1,5]      ->    Where 2>=1   and 5<=5
[11,14] is in range [10,15]  ->    Where 11>=10 and 14<=15
[7,12] is not in any range 

output = pl.DataFrame(
    {
        "START_RANGE": [1, 10],
        "END_RANGE": [5, 15],
    },
)
 3  64  3
1 Jan 1970

Solution

 3

Polars doesn't have full range joins but it does have join_asof which you can combine with a filter to get better performance than a full cross join filter. It's a left join so it'll have a return for every row in the left df.

It would look like

(
    search
    .sort('START')
    .join_asof(
        graph.sort('START_RANGE'), 
        right_on='START_RANGE', left_on='START'
        )
    .filter(pl.col('END')<=pl.col('END_RANGE'))
    .select('START_RANGE','END_RANGE')
)
shape: (2, 2)
┌─────────────┬───────────┐
│ START_RANGE ┆ END_RANGE │
│ ---         ┆ ---       │
│ i64         ┆ i64       │
╞═════════════╪═══════════╡
│ 1           ┆ 5         │
│ 10          ┆ 15        │
└─────────────┴───────────┘

The join_asof ensures that START>=START_RANGE and then the post filter ensures that END<=END_RANGE

You can also approach it from the opposite angle like

(
    
    graph.sort('START_RANGE')
    .join_asof(
        search.sort('START'), 
       left_on ='START_RANGE', right_on='START',
       strategy='forward'
        )
    .filter(pl.col('START')>=pl.col('START_RANGE'))
    .select('START_RANGE','END_RANGE')
)

Depending on your data, one of those will be better than the other. For instance if you have

search = pl.DataFrame(
    {
        "START": [2, 3, 11, 7],
        "END": [5, 4, 14, 12], 
    },
)

then the second approach returns the same thing as the original search but the first approach will return an extra row of (1, 5) because of the (3,4) row.

2024-07-25
Dean MacGregor

Solution

 3

As the others have answered, inequality joins are not yet fully supported by Polars. If you have duckdb installed, you can leverage its support for this type of join and transition back to your Polars DataFrame quite readyily:

import polars as pl
import duckdb

graph = pl.DataFrame({
    "START_RANGE": [1, 10, 20, 30],
    "END_RANGE": [5, 15, 25, 35],
})

search = pl.DataFrame({
    "START": [2, 11, 7],
    "END": [5, 14, 12],
})

result = duckdb.sql('''
    select * from
    graph g, search s
    where (
      g.START_RANGE <= s.START
      and s.END <= g.END_RANGE
    );
''').pl()

print(result)
# shape: (2, 4)
# ┌─────────────┬───────────┬───────┬─────┐
# │ START_RANGE ┆ END_RANGE ┆ START ┆ END │
# │ ---         ┆ ---       ┆ ---   ┆ --- │
# │ i64         ┆ i64       ┆ i64   ┆ i64 │
# ╞═════════════╪═══════════╪═══════╪═════╡
# │ 1           ┆ 5         ┆ 2     ┆ 5   │
# │ 10          ┆ 15        ┆ 11    ┆ 14  │
# └─────────────┴───────────┴───────┴─────┘
2024-07-25
Cameron Riddell

Solution

 0

You could perform a cross-join then filter rows for which START_RANGE and END_RANGE is_between START/END. Optionally drop the intermediate columns and potential duplicates with unique if several ranges can be matched:

out = (graph.join(search, how='cross')
       .filter(pl.col('START').is_between(pl.col('START_RANGE'), pl.col('END_RANGE'))
              &pl.col('END').is_between(pl.col('START_RANGE'), pl.col('END_RANGE'))
              )
       .drop(['START', 'END'])
       .unique()
      )

If you are sure that START ≥ END, you could simplify to:

out = (graph.join(search, how='cross')
       .filter(pl.col('START').ge(pl.col('START_RANGE'))
              &pl.col('END').le(pl.col('END_RANGE'))
              )
       .drop(['START', 'END'])
       .unique()
      )

Output:

┌─────────────┬───────────┐
│ START_RANGE ┆ END_RANGE │
│ ---         ┆ ---       │
│ i64         ┆ i64       │
╞═════════════╪═══════════╡
│ 10          ┆ 15        │
│ 1           ┆ 5         │
└─────────────┴───────────┘
2024-07-25
mozway