Question

How can I preserve the previous value to find the row that is greater than it?

This is my DataFrame:

import pandas as pd
df = pd.DataFrame(
    {
        'start': [3, 11, 9, 19, 22],
        'end': [10, 17, 10, 25, 30]
    }
)

And expected output is creating column x:

   start  end    x
0      3   10    10
1     11   17    17
2      9   10    NaN
3     19   25    25
4     22   30    NaN

Logic:

I explain it row by row. For row 0, x is df.end.iloc[0]. Now this value of x needs to be preserved until a greater value is found in the next rows and in the start column.

So 10 should be saved then the process moves to row 1. Is 11 > 10? If yes then x of second row is 17. For the next row, Is 9 > 17? No so the value is NaN.

The process moves to next row. Since no values is found that is greater than 17, 17 is preserved. Is 19 > 17? Yes so x is set to 25. And for the last row since 22 < 25, NaN is selected.

I have provided additional examples with different df and the desired outputs:

df = pd.DataFrame({'start': [3, 20, 11, 19, 22],'end': [10, 17, 21, 25, 30]})
   start  end     x
0      3   10  10.0
1     20   17  17.0
2     11   21   NaN
3     19   25   25.0
4     22   30   NaN

df = pd.DataFrame({'start': [3, 9, 11, 19, 22],'end': [10, 17, 21, 25, 30]})
   start  end     x
0      3   10   10.0
1      9   17   NaN
2     11   21   21.0
3     19   25   NaN
4     22   30   30.0

df = pd.DataFrame({'start': [3, 11, 9, 19, 22],'end': [10, 17, 21, 25, 30]})    
   start  end     x
0      3   10  10.0
1     11   17  17.0
2      9   21   NaN
3     19   25  25.0
4     22   30   NaN

This gives me the result. Is there a vectroized way to do this?

l = []
for ind, row in df.iterrows():
    if ind == 0:
        x = row['end']
        l.append(x)
        continue
    if row['start'] > x:
        x = row['end']
        l.append(x)
    else:
        l.append(np.NaN)

4 97 4

1 Jan 1970

Solution

updated answer

If the previous end should be propagated, then the logic cannot be vectorized. However, it is possible to be much faster than iterrows using numba:

from numba import jit

@jit(nopython=True)
def f(start, end):
    prev_e = -np.inf
    out = []
    for s, e in zip(start, end):
        if s>prev_e:
            out.append(e)
            prev_e = e
        else:
            out.append(None)
    return out

df['x'] = f(df['start'].to_numpy(), df['end'].to_numpy())

Output:

# example 1
   start  end     x
0      3   10  10.0
1     11   17  17.0
2      9   10   NaN
3     19   25  25.0
4     22   30   NaN

# example 2
   start  end     x
0      3   10  10.0
1     20   17  17.0
2     11   21   NaN
3     19   25  25.0
4     22   30   NaN

# example 3
   start  end     x
0      3   10  10.0
1      9   17   NaN
2     11   21  21.0
3     19   25   NaN
4     22   30  30.0

# example 4
   start  end     x
0      3   10  10.0
1     11   17  17.0
2      9   21   NaN
3     19   25  25.0
4     22   30   NaN

original answer

IIUC, you could use shift to form a boolean mask and mask to hide the non-valid values:

df['x'] = df['end'].mask(df['start'].le(df['end'].shift()))

The trick here is to compare start <= end.shift, which will result in False for the first row because of the NaN. If you want to exclude the first row then you should have used df['end'].where(df['start'].gt(df['end'].shift())).

Output:

   start  end     x
0      3   10  10.0
1     11   17  17.0
2      9   10   NaN
3     19   25  25.0
4     22   30   NaN

Intermediates:

   start  end     x  end.shift  start<=end.shift
0      3   10  10.0        NaN             False
1     11   17  17.0       10.0             False
2      9   10   NaN       17.0              True
3     19   25  25.0       10.0             False
4     22   30   NaN       25.0              True

2024-06-27

mozway

Solution

You can use pd.shift() to shift your end column by 1 to avoid iterating over each row individually.

First, create your empty 'x' column, and assign the first entry to be the first entry of your 'end' column, like this:

df['x'] = np.nan
df.loc[0, 'x'] = df.loc[0, 'end']

   start  end     x
0      3   10  10.0
1     11   17   NaN
2      9   10   NaN
3     19   25   NaN
4     22   30   NaN

Then, you can shift the 'end' column forwards by 1, and filter to the rows where this shifted value is less than the start value. You can put this in a .loc to assign the values of 'end' to your 'x' column only when that requirement is satisfied:

df.loc[(df['end'].shift(1) < df['start']), 'x'] = df['end']

Giving you:

   start  end     x
0      3   10  10.0
1     11   17  17.0
2      9   10   NaN
3     19   25  25.0
4     22   30   NaN

2024-06-27

Emi OB

Solution

The above answers should work but in case if you would like to use something based on iterrows:

x=df['end'].iloc[0]
final_values=[]
for index, row in df.iterrows():
    if index==0:
        final_values.append(x)
        
    else:
        if row['start']>x:
            x=row['end']
            final_values.append(x)
        elif row['start']<=x:
            final_values.append(np.nan)

2024-06-27

Pawan Tolani

Solution

Use numba.

import pandas as pd
import numpy as np
from numba import njit


@njit
def g(start, end):
    
    x = np.full(len(start), np.nan)
    last_valid_end = end[0]
    x[0] = last_valid_end
    
    for i in range(1, len(start)):
        if start[i] > last_valid_end:
            last_valid_end = end[i]
            x[i] = last_valid_end
            
    return x


def f(df):
    start = df['start'].values
    end = df['end'].values
    x = g(start, end)
    df['x'] = x
    return df


dfs = [
    pd.DataFrame({'start': [3, 20, 11, 19, 22], 'end': [10, 17, 21, 25, 30]}),
    pd.DataFrame({'start': [3, 9, 11, 19, 22], 'end': [10, 17, 21, 25, 30]}),
    pd.DataFrame({'start': [3, 11, 9, 19, 22], 'end': [10, 17, 21, 25, 30]}),
    pd.DataFrame({'start': [3, 11, 9, 19, 22], 'end': [10, 17, 10, 25, 30]})
]

for df in dfs:
    df = f(df)
    print(df)
    print()
'''
pydev debugger: starting (pid: 3988)
   start  end     x
0      3   10  10.0
1     20   17  17.0
2     11   21   NaN
3     19   25  25.0
4     22   30   NaN

   start  end     x
0      3   10  10.0
1      9   17   NaN
2     11   21  21.0
3     19   25   NaN
4     22   30  30.0

   start  end     x
0      3   10  10.0
1     11   17  17.0
2      9   21   NaN
3     19   25  25.0
4     22   30   NaN

   start  end     x
0      3   10  10.0
1     11   17  17.0
2      9   10   NaN
3     19   25  25.0
4     22   30   NaN
'''

The following method only works for only one dataset.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'start': [3, 11, 9, 19, 22],
    'end': [10, 17, 10, 25, 30]
})
print(df)

start = df.start.values 
end   = df.end.values

cumulative_max_end = np.maximum.accumulate(end) 
df['cumulative_max_end'] = cumulative_max_end

npc = np.concatenate(([end[0]] ,cumulative_max_end[:-1]))
df['npc'] =npc

mask = start > npc
df['mask'] = mask

x = np.full_like(start,np.nan,dtype=np.float64)
x[mask] = end[mask]
x[0] = end[0]

df['res'] = x
print(df)
'''
   start  end  cumulative_max_end  npc   mask   res
0      3   10                  10   10  False  10.0
1     11   17                  17   10   True  17.0
2      9   10                  17   17  False   NaN
3     19   25                  25   17   True  25.0
4     22   30                  30   25  False   NaN
'''

2024-06-30

Soudipta Dutta