Question

What is the fastest way to calculate a daily balance with compound interest in Pandas or Spark?

I have a DataFrame (DF) with deposits and withdrawals aggregated by day, and I want to know what is the fastest way to calculate the balance for each day. Because it must be able to scale. Answers in both Pandas and Spark are welcome! Here is an example of how the input DF looks like:

Input

date deposit withdrawal
2024-01-01 100.00 0.00
2024-01-02 0.00 0.00
2024-01-03 50.00 30.00
2024-01-04 0.00 0.00
2024-01-05 0.00 200.00
2024-01-06 20.00 0.00
2024-01-07 20.00 0.00
2024-01-08 0.00 0.00

These deposits and withdrawals are from an investment account that yields 10% per day. Unless the balance is negative. In this case, the daily return must be zero. The pseudo-code calculations to get the daily_return and balance columns are:

Movements = Previous day balance + Deposit - Withdrawal
Interest = 0.1 if Movements > 0 else 0
Daily return = Movements * Interest
Balance = Movements + Daily return

And below is an example of the desired output DF:

Desired output

date deposit withdrawal daily_return balance
2024-01-01 100.00 0.00 10.00 110.00
2024-01-02 0.00 0.00 11.00 121.00
2024-01-03 50.00 30.00 14.10 155.10
2024-01-04 0.00 0.00 15.51 170.61
2024-01-05 0.00 200.00 0.00 -29.39
2024-01-06 20.00 0.00 0.00 -9.39
2024-01-07 20.00 0.00 1.06 11.67
2024-01-08 0.00 0.00 1.17 12.84

What I have

I have a solution in Pandas that achieves the desired output, however it iterates over every line of the DF, i.e. it's slow. Is there a way to vectorize this calculation to speed it up? Or maybe another approach? Here is my implementation:

import pandas as pd

df = pd.DataFrame({
    "date": pd.date_range(start="2024-01-01", end="2024-01-08"),
    "deposit": [100.0, 0.0, 50.0, 0.0, 0.0, 20.0, 20.0, 0.0],
    "withdrawal": [0.0, 0.0, 30.0, 0.0, 200.0, 0.0, 0.0, 0.0]
})

daily_returns = []
balances = []
prev_balance = 0

for _, row in df.iterrows():

    movements = prev_balance + row["deposit"] - row["withdrawal"] 
    interest = 0.1 if movements > 0 else 0
    daily_return = movements * interest
    balance = movements + daily_return
    
    daily_returns.append(daily_return)
    balances.append(balance)
    
    prev_balance = balance

df["daily_return"] = daily_returns
df["balance"] = balances
 3  74  3
1 Jan 1970

Solution

 4

For this type of computations I'd use , e.g.:

from numba import njit


@njit
def calculate(deposits, withdrawals, out_daily_return, out_balance):
    prev_balance = 0

    for i, (deposit, withdrawal) in enumerate(zip(deposits, withdrawals)):
        movements = prev_balance + deposit - withdrawal
        interest = 0.1 if movements > 0 else 0
        daily_return = movements * interest
        balance = movements + daily_return

        out_daily_return[i] = daily_return
        out_balance[i] = balance

        prev_balance = balance


df["daily_return"] = 0.0
df["balance"] = 0.0

calculate(
    df["deposit"].values,
    df["withdrawal"].values,
    df["daily_return"].values,
    df["balance"].values,
)

print(df)

Prints:

        date  deposit  withdrawal  daily_return   balance
0 2024-01-01    100.0         0.0       10.0000  110.0000
1 2024-01-02      0.0         0.0       11.0000  121.0000
2 2024-01-03     50.0        30.0       14.1000  155.1000
3 2024-01-04      0.0         0.0       15.5100  170.6100
4 2024-01-05      0.0       200.0       -0.0000  -29.3900
5 2024-01-06     20.0         0.0       -0.0000   -9.3900
6 2024-01-07     20.0         0.0        1.0610   11.6710
7 2024-01-08      0.0         0.0        1.1671   12.8381

Quick benchmark:

from time import monotonic

df = pd.DataFrame(
    {
        "date": pd.date_range(start="2024-01-01", end="2024-01-08"),
        "deposit": [100.0, 0.0, 50.0, 0.0, 0.0, 20.0, 20.0, 0.0],
        "withdrawal": [0.0, 0.0, 30.0, 0.0, 200.0, 0.0, 0.0, 0.0],
    }
)

df = pd.concat([df] * 1_000_000)
print(f"{len(df)=}")

start_time = monotonic()

df["daily_return"] = 0.0
df["balance"] = 0.0

calculate(
    df["deposit"].values,
    df["withdrawal"].values,
    df["daily_return"].values,
    df["balance"].values,
)

print("Time =", monotonic() - start_time)

Prints on my AMD 5700x:

len(df)=8000000
Time = 0.11215395800536498
2024-07-02
Andrej Kesely