Question
Efficiently remove rows from pandas df based on second latest time in column
I have a pandas Dataframe that looks similar to this:
Index | ID | time_1 | time_2 |
---|---|---|---|
0 | 101 | 2024-06-20 14:32:22 | 2024-06-20 14:10:31 |
1 | 101 | 2024-06-20 15:21:31 | 2024-06-20 14:32:22 |
2 | 101 | 2024-06-20 15:21:31 | 2024-06-20 15:21:31 |
3 | 102 | 2024-06-20 16:26:51 | 2024-06-20 15:21:31 |
4 | 102 | 2024-06-20 16:26:51 | 2024-06-20 16:56:24 |
5 | 103 | 2024-06-20 20:05:44 | 2024-06-20 21:17:35 |
6 | 103 | 2024-06-20 22:41:22 | 2024-06-20 22:21:31 |
7 | 103 | 2024-06-20 23:11:56 | 2024-06-20 23:01:31 |
For each ID in my df I want to take the second latest time_1 (if it exists). I then want to compare this time with the timestamps in time_2 and remove all rows from my df where time_2 is earlier than this time. My expected output would be:
Index | ID | time_1 | time_2 |
---|---|---|---|
1 | 101 | 2024-06-20 15:21:31 | 2024-06-20 14:32:22 |
2 | 101 | 2024-06-20 15:21:31 | 2024-06-20 15:21:31 |
3 | 102 | 2024-06-20 16:26:51 | 2024-06-20 15:21:31 |
4 | 102 | 2024-06-20 16:26:51 | 2024-06-20 16:56:24 |
7 | 103 | 2024-06-20 23:11:56 | 2024-06-20 23:01:31 |
This problem is above my pandas level. I asked ChatGPT and this is the solution I got which in principle does what I want:
import pandas as pd
ids = [101, 101, 101, 102, 102, 103, 103, 103]
time_1 = ['2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:26:51', '2024-06-20 16:26:51', '2024-06-20 20:05:44', '2024-06-20 22:41:22', '2024-06-20 23:11:56']
time_2 = ['2024-06-20 14:10:31', '2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:56:24', '2024-06-20 21:17:35', '2024-06-20 22:21:31', '2024-06-20 23:01:31']
df = pd.DataFrame({
'id': ids,
'time_1': pd.to_datetime(time_1),
'time_2': pd.to_datetime(time_2)
})
grouped = df.groupby('id')['time_1']
mask = pd.Series(False, index=df.index)
for id_value, group in df.groupby('id'):
# Remove duplicates and sort timestamps
unique_sorted_times = group['time_1'].drop_duplicates().sort_values()
# Check if there's more than one unique time
if len(unique_sorted_times) > 1:
# Select the second last time
second_last_time = unique_sorted_times.iloc[-2]
# Update the mask for rows with time_2 greater than or equal to the second last time_1
mask |= (df['id'] == id_value) & (df['time_2'] >= second_last_time)
else:
# If there's only one unique time, keep the row(s)
mask |= (df['id'] == id_value)
filtered_data = df[mask]
My issue with this solution is the for-loop. This seems rather inefficient and my real data is quite large. And also I am curious if there is a better, more efficient solution for this.