Question

Interpolate dataframe in R subject to totals

I have the following dataset which has the number of A, B and C both in 2000 and 2005. I now need to inpolate the dataframe subject to the constraint that the sum of A, B and C must be consistent with the Total amount, which is available for every year. Since I am dealing with population data, the values have to be integers.

Picture of data frame

data <- data.frame( Year = c(2000, 2001, 2002, 2003, 2004, 2005),  
                   Total = c(50, 52, 53, 57, 60, 61),   
                       A = c(12, NA, NA, NA, NA, 17),  
                       B = c(22, NA, NA, NA, NA, 24),   
                       C = c(16, NA, NA, NA, NA, 20) )

So far, I tried using na.approx() from the zoo package. But I was unable to make the "Total"-constraint work.

Thank you very much for your help.

 2  146  2
1 Jan 1970

Solution

 4

You could first linearly interpolate columns using approx, then use proportions(., margin=1), multiply with the totals and round to integers.

> (proportions(sapply(lapply(data[, 3:5], approx, xout=seq_len(nrow(data))), 
+                    `[[`, "y"), margin=1)*data$Total) |> 
+   round() |> 
+   `mode<-`('integer')  ## optional
      A  B  C
[1,] 12 22 16
[2,] 13 22 17
[3,] 14 22 17
[4,] 15 23 19
[5,] 16 24 20
[6,] 17 24 20
2024-07-19
jay.sf

Solution

 3

You could interpolate each one individually, as you did using zoo::na.approx, and then use iterative proportional fitting (library ipfr) followed by optimized rounding using the roundfixS function from the sfsmisc package.

Interpolation

nr <- nrow(data) # here it is 6, but may differ for the real data
mtx <- matrix(data = sapply(data[,3:5], na.approx), 
              nrow = nr, ncol = 3)[-c(1, nr),]

addmargins(mtx)

                   Sum
    13 22.4 16.8  52.2
    14 22.8 17.6  54.4
    15 23.2 18.4  56.6
    16 23.6 19.2  58.8
Sum 58 92.0 72.0 222.0

Note the row sums don't agree with your constraint.

Iterative proportional fitting

library(ipfr)

row_targets <- data$Total[2:5]
column_targets <- colSums(mtx)
X <- ipu_matrix(mtx, row_targets, column_targets)
addmargins(X)

                               Sum
    12.94506 22.32052 16.73442  52
    13.63439 22.21969 17.14592  53
    15.10020 23.37088 18.52891  57
    16.32035 24.08890 19.59075  60
Sum 58.00000 92.00000 72.00000 222

Row totals now agree with your constraints, but the numbers inside are not integers.

Optimized rounding

library(sfsmisc)
X_round <- roundfixS(X)
addmargins(X_round)

             Sum
    13 22 17  52
    14 22 17  53
    15 23 19  57
    16 24 20  60
Sum 58 91 73 222

data <- data.frame( Year = c(2000, 2001, 2002, 2003, 2004, 2005),  
                   Total = c(50, 52, 53, 57, 60, 61),   
                       A = c(12, NA, NA, NA, NA, 17),  
                       B = c(22, NA, NA, NA, NA, 24),   
                       C = c(16, NA, NA, NA, NA, 20) )
2024-07-19
Edward

Solution

 2

1) We can solve this without rounding (as rounding may not guarantee that the constraints continue to hold) by defining the following quadratic integer programming problem with linear constraints. Define a least squares optimization problem based on this pseudo-code. The diff constraint ensures that each column is non-decreasing and the last constraint ensures that the first and last rows are not changed. (A variation of would be to use the sum of the absolute differences in the objective instead of the squared differences.)

ABC <- na.approx(data[-(1:2)])
Minimize sum((ABC - X)^2) over X
s.t. X is integer matrix and
     rowSums(X) == data$Total and
     diff(X) >= 0 and
     X[c(1, nr), ] == ABC[c(1, nr), ]

Function opt below implements this optimization. It has arguments ABC which is the A, B and C columns of na.approx(data[-(1:2)]) and Total which is the Total column of data. Using CVXR (since the problem is convex) define the X matrix to be solved for, the objective, the constraints and the problem (which is the objective plus the constraints) and then use CVXR's solve to solve that problem. If you prefer to use absolute differences rather than squared differences in the objective replace (ABC - X)^2 below with abs(ABC - X).

library(CVXR)
library(zoo)

opt <- function(ABC, Total) {
  nr <- nrow(ABC); nc <- ncol(ABC)
  X <- Variable(rows = nr, cols = nc, integer = TRUE)
  objective <- Minimize(sum((ABC - X)^2))
  constraints <- list(sum_entries(X, 1) == Total, diff(X) >= 0, 
    X[c(1, nr), ] == ABC[c(1, nr), ])
  problem <- Problem(objective, constraints)
  res <- solve(problem)
  print(res$status)
  res$getValue(X) 

}

ABC <- na.approx(data[-(1:2)])
replace(data, colnames(ABC), opt(ABC, data$Total))

giving

[1] "optimal"

  Year Total  A  B  C
1 2000    50 12 22 16
2 2001    52 13 22 17
3 2002    53 14 22 17
4 2003    57 15 23 19
5 2004    60 16 24 20
6 2005    61 17 24 20

2) We could combine the other two answers with some elements of this one to get this approach. Unlike (1) we are not guaranteed a solution even if a feasible solution exists since the approach does not handle the diff constraint other than to check afterwards whether it happened to be satisfied.

library(sfsmisc)
library(zoo)

opt2 <- function(ABC, Total, method = "offset-round") {
  rowRoundfixS <- function(x) t(apply(x, 1, roundfixS, method = method))
  ABC2 <- rowRoundfixS(proportions(ABC, 1) * Total)
  nr <- nrow(ABC)
  ok <- all(rowSums(ABC2) == Total) && all(diff(ABC2) >= 0) &&
     all(ABC2[c(1, nr), ] == ABC[c(1, nr), colnames(ABC)])
  print(if (ok) "ok" else "failed")
  ABC2
}

ABC <- na.approx(data[-(1:2)])
replace(data, colnames(ABC), opt(ABC, data$Total))

giving

[1] ok

  Year Total  A  B  C
1 2000    50 12 22 16
2 2001    52 13 22 17
3 2002    53 14 22 17
4 2003    57 15 23 19
5 2004    60 16 24 20
6 2005    61 17 24 20

which at least for this data gives the same result as (1).

Note

Input

data <- data.frame(
  Year = seq(2000, 2005, by = 1),
  Total = c(50, 52, 53, 57, 60, 61),
  A = c(12, NA, NA, NA, NA, 17),
  B = c(22, NA, NA, NA, NA, 24),
  C = c(16, NA, NA, NA, NA, 20)
)
2024-07-20
G. Grothendieck