Question
What is the fastest way to calculate a daily balance with compound interest in Pandas or Spark?
I have a DataFrame (DF) with deposits and withdrawals aggregated by day, and I want to know what is the fastest way to calculate the balance for each day. Because it must be able to scale. Answers in both Pandas and Spark are welcome! Here is an example of how the input DF looks like:
Input
date | deposit | withdrawal |
---|---|---|
2024-01-01 | 100.00 | 0.00 |
2024-01-02 | 0.00 | 0.00 |
2024-01-03 | 50.00 | 30.00 |
2024-01-04 | 0.00 | 0.00 |
2024-01-05 | 0.00 | 200.00 |
2024-01-06 | 20.00 | 0.00 |
2024-01-07 | 20.00 | 0.00 |
2024-01-08 | 0.00 | 0.00 |
These deposits and withdrawals are from an investment account that yields 10% per day. Unless the balance is negative. In this case, the daily return must be zero. The pseudo-code calculations to get the daily_return
and balance
columns are:
Movements = Previous day balance + Deposit - Withdrawal
Interest = 0.1 if Movements > 0 else 0
Daily return = Movements * Interest
Balance = Movements + Daily return
And below is an example of the desired output DF:
Desired output
date | deposit | withdrawal | daily_return | balance |
---|---|---|---|---|
2024-01-01 | 100.00 | 0.00 | 10.00 | 110.00 |
2024-01-02 | 0.00 | 0.00 | 11.00 | 121.00 |
2024-01-03 | 50.00 | 30.00 | 14.10 | 155.10 |
2024-01-04 | 0.00 | 0.00 | 15.51 | 170.61 |
2024-01-05 | 0.00 | 200.00 | 0.00 | -29.39 |
2024-01-06 | 20.00 | 0.00 | 0.00 | -9.39 |
2024-01-07 | 20.00 | 0.00 | 1.06 | 11.67 |
2024-01-08 | 0.00 | 0.00 | 1.17 | 12.84 |
What I have
I have a solution in Pandas that achieves the desired output, however it iterates over every line of the DF, i.e. it's slow. Is there a way to vectorize this calculation to speed it up? Or maybe another approach? Here is my implementation:
import pandas as pd
df = pd.DataFrame({
"date": pd.date_range(start="2024-01-01", end="2024-01-08"),
"deposit": [100.0, 0.0, 50.0, 0.0, 0.0, 20.0, 20.0, 0.0],
"withdrawal": [0.0, 0.0, 30.0, 0.0, 200.0, 0.0, 0.0, 0.0]
})
daily_returns = []
balances = []
prev_balance = 0
for _, row in df.iterrows():
movements = prev_balance + row["deposit"] - row["withdrawal"]
interest = 0.1 if movements > 0 else 0
daily_return = movements * interest
balance = movements + daily_return
daily_returns.append(daily_return)
balances.append(balance)
prev_balance = balance
df["daily_return"] = daily_returns
df["balance"] = balances