Question

Impact on Performance When Using SQL-Like Padding in Polars

Description:

I have been using SQL-like padding in my Polars DataFrame queries for better readability and easier commenting of conditions. This approach involves using True (analogous to 1=1 in SQL) to pad the conditional logic.

SQL Example

In SQL, padding with 1=1 makes it easy to add or remove conditions:

SELECT *
FROM employees
WHERE 1=1
  AND department = 'Sales'
  AND salary > 50000
  AND hire_date > '2020-01-01';
SELECT *
FROM employees
WHERE 1=1
  AND department = 'Sales'
  -- AND salary > 50000
  AND hire_date > '2020-01-01';

(>>>> in my opinion <<<<) This makes adding, removing, and replacing operators very easy. Also it simplifies the git diff because adding and removing conditions appear as a single diff.

Polars Example

Here's the Polars equivalent that I have been using:

# Example data
data = {
    "name": ["Alice", "Bob", "Charlie", "David"],
    "department": ["Sales", "HR", "Sales", "IT"],
    "salary": [60000, 45000, 70000, 50000],
    "hire_date": ["2021-06-01", "2019-03-15", "2020-08-20", "2018-11-05"]
}

# Create a DataFrame
df = pl.DataFrame(data)

# Filter with padding
filtered_df = df.filter(
      (
          True
          & pl.col("department").eq("Sales")
          # & pl.col("salary").gt(50000)
          & pl.col("hire_date").gt("2020-01-01")
          & True
      )
)

print(filtered_df)

Concern

I am concerned about the potential performance impact of this approach in Polars. Does padding the filter conditions with True significantly affect the performance of query execution in Polars? Also is there a better way to do this ?

Thank you for your assistance.

 4  58  4
1 Jan 1970

Solution

 3

Personally for SQL i switched to “and at the end of the line” cause i think it’s much more readable like that. It does make commenting out the last line trickier, but i just don’t like 1=1 dummy condition.

Luckily, in modern dataframe library like polars you don’t need to do that anymore. Just pass multiple filter condutions instead of combined one.

filtered_df = df.filter(
    pl.col("department").eq("Sales"),
    # pl.col("salary").gt(50000),
    pl.col("hire_date").gt("2020-01-01"),
)
2024-07-14
Roman Pekar