Question

Efficiently remove rows from pandas df based on second latest time in column

I have a pandas Dataframe that looks similar to this:

Index ID time_1 time_2
0 101 2024-06-20 14:32:22 2024-06-20 14:10:31
1 101 2024-06-20 15:21:31 2024-06-20 14:32:22
2 101 2024-06-20 15:21:31 2024-06-20 15:21:31
3 102 2024-06-20 16:26:51 2024-06-20 15:21:31
4 102 2024-06-20 16:26:51 2024-06-20 16:56:24
5 103 2024-06-20 20:05:44 2024-06-20 21:17:35
6 103 2024-06-20 22:41:22 2024-06-20 22:21:31
7 103 2024-06-20 23:11:56 2024-06-20 23:01:31

For each ID in my df I want to take the second latest time_1 (if it exists). I then want to compare this time with the timestamps in time_2 and remove all rows from my df where time_2 is earlier than this time. My expected output would be:

Index ID time_1 time_2
1 101 2024-06-20 15:21:31 2024-06-20 14:32:22
2 101 2024-06-20 15:21:31 2024-06-20 15:21:31
3 102 2024-06-20 16:26:51 2024-06-20 15:21:31
4 102 2024-06-20 16:26:51 2024-06-20 16:56:24
7 103 2024-06-20 23:11:56 2024-06-20 23:01:31

This problem is above my pandas level. I asked ChatGPT and this is the solution I got which in principle does what I want:

import pandas as pd

ids = [101, 101, 101, 102, 102, 103, 103, 103]
time_1 = ['2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:26:51', '2024-06-20 16:26:51', '2024-06-20 20:05:44', '2024-06-20 22:41:22', '2024-06-20 23:11:56']
time_2 = ['2024-06-20 14:10:31', '2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:56:24', '2024-06-20 21:17:35', '2024-06-20 22:21:31', '2024-06-20 23:01:31']


df = pd.DataFrame({
    'id': ids,
    'time_1': pd.to_datetime(time_1),
    'time_2': pd.to_datetime(time_2)
})

grouped = df.groupby('id')['time_1']
mask = pd.Series(False, index=df.index)

for id_value, group in df.groupby('id'):
    # Remove duplicates and sort timestamps
    unique_sorted_times = group['time_1'].drop_duplicates().sort_values()

    # Check if there's more than one unique time
    if len(unique_sorted_times) > 1:
        # Select the second last time
        second_last_time = unique_sorted_times.iloc[-2]
        # Update the mask for rows with time_2 greater than or equal to the second last time_1
        mask |= (df['id'] == id_value) & (df['time_2'] >= second_last_time)
    else:
        # If there's only one unique time, keep the row(s)
        mask |= (df['id'] == id_value)

filtered_data = df[mask]

My issue with this solution is the for-loop. This seems rather inefficient and my real data is quite large. And also I am curious if there is a better, more efficient solution for this.

 3  95  3
1 Jan 1970

Solution

 2

IIUC, you could use groupby.transform with drop_duplicates to remove the duplicates values, and .iloc[-2:-1].squeeze() to get the second to last if any, else NaT. Then perform boolean indexing:

out = df.loc[df.sort_values(by='time_1') # optional, if not already sorted
               .groupby('id')['time_1']
               .transform(lambda x: x.drop_duplicates().iloc[-2:-1].squeeze())
               .fillna(pd.Timestamp(0))
               .le(df['time_2'])
            ]

Output:

    id              time_1              time_2
1  101 2024-06-20 15:21:31 2024-06-20 14:32:22
2  101 2024-06-20 15:21:31 2024-06-20 15:21:31
3  102 2024-06-20 16:26:51 2024-06-20 15:21:31
4  102 2024-06-20 16:26:51 2024-06-20 16:56:24
7  103 2024-06-20 23:11:56 2024-06-20 23:01:31

Intermediates:

    id              time_1              time_2           transform              fillna
0  101 2024-06-20 14:32:22 2024-06-20 14:10:31 2024-06-20 14:32:22 2024-06-20 14:32:22
1  101 2024-06-20 15:21:31 2024-06-20 14:32:22 2024-06-20 14:32:22 2024-06-20 14:32:22
2  101 2024-06-20 15:21:31 2024-06-20 15:21:31 2024-06-20 14:32:22 2024-06-20 14:32:22
3  102 2024-06-20 16:26:51 2024-06-20 15:21:31                 NaT 1970-01-01 00:00:00
4  102 2024-06-20 16:26:51 2024-06-20 16:56:24                 NaT 1970-01-01 00:00:00
5  103 2024-06-20 20:05:44 2024-06-20 21:17:35 2024-06-20 22:41:22 2024-06-20 22:41:22
6  103 2024-06-20 22:41:22 2024-06-20 22:21:31 2024-06-20 22:41:22 2024-06-20 22:41:22
7  103 2024-06-20 23:11:56 2024-06-20 23:01:31 2024-06-20 22:41:22 2024-06-20 22:41:22
2024-07-08
mozway

Solution

 1

Here is a possible solution using groupby

I added an example with a single element within a group

import pandas as pd

ids = [101, 101, 101, 102, 102, 103, 103, 103, 104]
time_1 = [
    '2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31',
    '2024-06-20 16:26:51', '2024-06-20 16:26:51', '2024-06-20 20:05:44',
    '2024-06-20 22:41:22', '2024-06-20 23:11:56', '2024-06-20 23:11:56']
time_2 = [
    '2024-06-20 14:10:31', '2024-06-20 14:32:22', '2024-06-20 15:21:31',
    '2024-06-20 15:21:31', '2024-06-20 16:56:24', '2024-06-20 21:17:35',
    '2024-06-20 22:21:31', '2024-06-20 23:01:31', '2024-06-20 23:01:31']


df = pd.DataFrame({
    'id': ids,
    'time_1': pd.to_datetime(time_1),
    'time_2': pd.to_datetime(time_2)
})

We define a function that takes in account the logic within a group

def fun(x):
    if len(x) > 1:
        unique_times = x['time_1'].unique()
        if len(unique_times) >= 2:
            second_last_time = unique_times[-2]
        else:
            second_last_time = unique_times[0]
        x = x[x['time_2'].ge(second_last_time)]
    return x
df.groupby('id').apply(lambda x: fun(x)).reset_index(drop=True)
    id              time_1              time_2
0  101 2024-06-20 15:21:31 2024-06-20 14:32:22
1  101 2024-06-20 15:21:31 2024-06-20 15:21:31
2  102 2024-06-20 16:26:51 2024-06-20 16:56:24
3  103 2024-06-20 23:11:56 2024-06-20 23:01:31
4  104 2024-06-20 23:11:56 2024-06-20 23:01:31

With this approach, you can see benefit if your df is getting bigger. With a 90,000 row dataframe, I saw a 25% improvement.

2024-07-08
rpanai

Solution

 1

You can use .transform() to create the mask.

Sorting is not necessary when you can just use .nlargest() and select the second one if it exists. Or if time_1 is already sorted, you can even skip .nlargest() (or sorting) entirely.

Then you just need to replace NaT with the smallest possible Timestamp value so that time_2 can't be earlier than it when you do the comparison.

second_last_times = df.groupby('id')['time_1'].transform(
    lambda s: s.drop_duplicates().nlargest(2).iloc[1:].squeeze())
mask = second_last_times.fillna(pd.Timestamp.min).le(df['time_2'])
df[mask]

Result:

    id              time_1              time_2
1  101 2024-06-20 15:21:31 2024-06-20 14:32:22
2  101 2024-06-20 15:21:31 2024-06-20 15:21:31
3  102 2024-06-20 16:26:51 2024-06-20 15:21:31
4  102 2024-06-20 16:26:51 2024-06-20 16:56:24
7  103 2024-06-20 23:11:56 2024-06-20 23:01:31

For reference, second_last_times:

0   2024-06-20 14:32:22
1   2024-06-20 14:32:22
2   2024-06-20 14:32:22
3                   NaT
4                   NaT
5   2024-06-20 22:41:22
6   2024-06-20 22:41:22
7   2024-06-20 22:41:22
Name: time_1, dtype: datetime64[ns]

If you want to generalize this, replace .nlargest(2).iloc[1:] with .nlargest(n).iloc[n-1:].


P.S. This is similar to mozway's solution, but I actually wrote the code before they posted, except the squeeze technique - thanks for that.

2024-07-08
wjandrea