Question

How to use DataFrameMapper to delete rows with a null value in a specific column?

I am using sklearn-pandas.DataFrameMapper to preprocess my data. I don't want to input for a specific column. I just want to drop the row if this column is Null. Is there a way to do that?

6 113856 6

1 Jan 1970

Solution

I would recommend an approach that filters before transformation, otherwise you lose efficiency if your dataset contains a lot of null values:

import pandas as pd
df = df.dropna(subset=['xxx'])

Then proceed with DataFrameMapper similar to the following:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler

mapper = DataFrameMapper([
    ('xxx', StandardScaler()),
], df_out=True)

# Transform the data
new_data = mapper.fit_transform(df.copy())

2024-07-24

Melanie Shebel

Solution

This can be done very easily before or after using the DataFrameMapper:

df_filtered = df [~df['specific column name'].isnull()]

To do it using the DataFrameMapper itself, you would need to build a transformer as so:

class DropNullTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.dropna(subset=[self.column])

From there, you include this transformer when building the DataFrameMapper:

dfm = DataFrameMapper([
    ([specificColumnName], DropNullTransformer(specificColumnName))
])

Then, the fit and transformation function will perform the drop for you. To learn more about custom transformers, you can read the Sklearn guide.

2024-07-24

Kartik Chugh