Question

Efficiently remove rows from pandas df based on second latest time in column

I have a pandas Dataframe that looks similar to this:

Index	ID	time_1	time_2
0	101	2024-06-20 14:32:22	2024-06-20 14:10:31
1	101	2024-06-20 15:21:31	2024-06-20 14:32:22
2	101	2024-06-20 15:21:31	2024-06-20 15:21:31
3	102	2024-06-20 16:26:51	2024-06-20 15:21:31
4	102	2024-06-20 16:26:51	2024-06-20 16:56:24
5	103	2024-06-20 20:05:44	2024-06-20 21:17:35
6	103	2024-06-20 22:41:22	2024-06-20 22:21:31
7	103	2024-06-20 23:11:56	2024-06-20 23:01:31

For each ID in my df I want to take the second latest time_1 (if it exists). I then want to compare this time with the timestamps in time_2 and remove all rows from my df where time_2 is earlier than this time. My expected output would be:

Index	ID	time_1	time_2
1	101	2024-06-20 15:21:31	2024-06-20 14:32:22
2	101	2024-06-20 15:21:31	2024-06-20 15:21:31
3	102	2024-06-20 16:26:51	2024-06-20 15:21:31
4	102	2024-06-20 16:26:51	2024-06-20 16:56:24
7	103	2024-06-20 23:11:56	2024-06-20 23:01:31

This problem is above my pandas level. I asked ChatGPT and this is the solution I got which in principle does what I want:

import pandas as pd

ids = [101, 101, 101, 102, 102, 103, 103, 103]
time_1 = ['2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:26:51', '2024-06-20 16:26:51', '2024-06-20 20:05:44', '2024-06-20 22:41:22', '2024-06-20 23:11:56']
time_2 = ['2024-06-20 14:10:31', '2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:56:24', '2024-06-20 21:17:35', '2024-06-20 22:21:31', '2024-06-20 23:01:31']


df = pd.DataFrame({
    'id': ids,
    'time_1': pd.to_datetime(time_1),
    'time_2': pd.to_datetime(time_2)
})

grouped = df.groupby('id')['time_1']
mask = pd.Series(False, index=df.index)

for id_value, group in df.groupby('id'):
    # Remove duplicates and sort timestamps
    unique_sorted_times = group['time_1'].drop_duplicates().sort_values()

    # Check if there's more than one unique time
    if len(unique_sorted_times) > 1:
        # Select the second last time
        second_last_time = unique_sorted_times.iloc[-2]
        # Update the mask for rows with time_2 greater than or equal to the second last time_1
        mask |= (df['id'] == id_value) & (df['time_2'] >= second_last_time)
    else:
        # If there's only one unique time, keep the row(s)
        mask |= (df['id'] == id_value)

filtered_data = df[mask]

My issue with this solution is the for-loop. This seems rather inefficient and my real data is quite large. And also I am curious if there is a better, more efficient solution for this.

3 95 3

1 Jan 1970

Solution

IIUC, you could use groupby.transform with drop_duplicates to remove the duplicates values, and .iloc[-2:-1].squeeze() to get the second to last if any, else NaT. Then perform boolean indexing:

out = df.loc[df.sort_values(by='time_1') # optional, if not already sorted
               .groupby('id')['time_1']
               .transform(lambda x: x.drop_duplicates().iloc[-2:-1].squeeze())
               .fillna(pd.Timestamp(0))
               .le(df['time_2'])
            ]

Output:

    id              time_1              time_2
1  101 2024-06-20 15:21:31 2024-06-20 14:32:22
2  101 2024-06-20 15:21:31 2024-06-20 15:21:31
3  102 2024-06-20 16:26:51 2024-06-20 15:21:31
4  102 2024-06-20 16:26:51 2024-06-20 16:56:24
7  103 2024-06-20 23:11:56 2024-06-20 23:01:31

Intermediates:

    id              time_1              time_2           transform              fillna
0  101 2024-06-20 14:32:22 2024-06-20 14:10:31 2024-06-20 14:32:22 2024-06-20 14:32:22
1  101 2024-06-20 15:21:31 2024-06-20 14:32:22 2024-06-20 14:32:22 2024-06-20 14:32:22
2  101 2024-06-20 15:21:31 2024-06-20 15:21:31 2024-06-20 14:32:22 2024-06-20 14:32:22
3  102 2024-06-20 16:26:51 2024-06-20 15:21:31                 NaT 1970-01-01 00:00:00
4  102 2024-06-20 16:26:51 2024-06-20 16:56:24                 NaT 1970-01-01 00:00:00
5  103 2024-06-20 20:05:44 2024-06-20 21:17:35 2024-06-20 22:41:22 2024-06-20 22:41:22
6  103 2024-06-20 22:41:22 2024-06-20 22:21:31 2024-06-20 22:41:22 2024-06-20 22:41:22
7  103 2024-06-20 23:11:56 2024-06-20 23:01:31 2024-06-20 22:41:22 2024-06-20 22:41:22

2024-07-08

mozway

Solution

Here is a possible solution using groupby

I added an example with a single element within a group

import pandas as pd

ids = [101, 101, 101, 102, 102, 103, 103, 103, 104]
time_1 = [
    '2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31',
    '2024-06-20 16:26:51', '2024-06-20 16:26:51', '2024-06-20 20:05:44',
    '2024-06-20 22:41:22', '2024-06-20 23:11:56', '2024-06-20 23:11:56']
time_2 = [
    '2024-06-20 14:10:31', '2024-06-20 14:32:22', '2024-06-20 15:21:31',
    '2024-06-20 15:21:31', '2024-06-20 16:56:24', '2024-06-20 21:17:35',
    '2024-06-20 22:21:31', '2024-06-20 23:01:31', '2024-06-20 23:01:31']


df = pd.DataFrame({
    'id': ids,
    'time_1': pd.to_datetime(time_1),
    'time_2': pd.to_datetime(time_2)
})

We define a function that takes in account the logic within a group

def fun(x):
    if len(x) > 1:
        unique_times = x['time_1'].unique()
        if len(unique_times) >= 2:
            second_last_time = unique_times[-2]
        else:
            second_last_time = unique_times[0]
        x = x[x['time_2'].ge(second_last_time)]
    return x

df.groupby('id').apply(lambda x: fun(x)).reset_index(drop=True)

    id              time_1              time_2
0  101 2024-06-20 15:21:31 2024-06-20 14:32:22
1  101 2024-06-20 15:21:31 2024-06-20 15:21:31
2  102 2024-06-20 16:26:51 2024-06-20 16:56:24
3  103 2024-06-20 23:11:56 2024-06-20 23:01:31
4  104 2024-06-20 23:11:56 2024-06-20 23:01:31

With this approach, you can see benefit if your df is getting bigger. With a 90,000 row dataframe, I saw a 25% improvement.

2024-07-08

rpanai

Solution

You can use .transform() to create the mask.

Sorting is not necessary when you can just use .nlargest() and select the second one if it exists. Or if time_1 is already sorted, you can even skip .nlargest() (or sorting) entirely.

Then you just need to replace NaT with the smallest possible Timestamp value so that time_2 can't be earlier than it when you do the comparison.

second_last_times = df.groupby('id')['time_1'].transform(
    lambda s: s.drop_duplicates().nlargest(2).iloc[1:].squeeze())
mask = second_last_times.fillna(pd.Timestamp.min).le(df['time_2'])
df[mask]

Result:

    id              time_1              time_2
1  101 2024-06-20 15:21:31 2024-06-20 14:32:22
2  101 2024-06-20 15:21:31 2024-06-20 15:21:31
3  102 2024-06-20 16:26:51 2024-06-20 15:21:31
4  102 2024-06-20 16:26:51 2024-06-20 16:56:24
7  103 2024-06-20 23:11:56 2024-06-20 23:01:31

For reference, second_last_times:

0   2024-06-20 14:32:22
1   2024-06-20 14:32:22
2   2024-06-20 14:32:22
3                   NaT
4                   NaT
5   2024-06-20 22:41:22
6   2024-06-20 22:41:22
7   2024-06-20 22:41:22
Name: time_1, dtype: datetime64[ns]

If you want to generalize this, replace .nlargest(2).iloc[1:] with .nlargest(n).iloc[n-1:].

P.S. This is similar to mozway's solution, but I actually wrote the code before they posted, except the squeeze technique - thanks for that.

2024-07-08

wjandrea