Question
Improve processing time of applying a function over a vector and grouping by columns
I am trying to apply a function over data.table columns, and grouping by columns value.
I am using the lapply fuction, but my script is quite slow.
To give some context, I am working of probability values:
- First, I multiple each set of 5 probability values for each "id" by a random value
- Then, I do the following calculation, grouping by variables "group_1" and "group_2": PD_3_N=1-PROD(1-PD_2_N)
Here is a reproducible example with dummy values:
###########
# Dummy data
set.seed(99)
n_col <- 4
size <- 3e6
num_group2 <- 10
vec_1 <- paste0("PD_1_N", (0:n_col))
vec_2 <- paste0("PD_2_N", (0:n_col))
vec_3 <- paste0("PD_3_N", (0:n_col))
id <- rep(seq(1, size, 1), num_group2)
group_1 <- rep(sample(seq(1, size, 1), size=size, replace=TRUE), num_group2)
group_2 <- sort(rep(seq(1, num_group2, 1), size))
factor <- runif(size*num_group2, 0.5, 4)
data <- data.table(id, group_1, group_2, factor)
data[, vec_1] <- data.table(rep(runif(size, 0, 0.5), num_group2),
rep(runif(size, 0, 0.5), num_group2),
rep(runif(size, 0, 0.5), num_group2),
rep(runif(size, 0, 0.5), num_group2),
rep(runif(size, 0, 0.5), num_group2))
###############
# lapply step 1
t <- Sys.time()
data[, (vec_2) := lapply(.SD, function(x) pmin(1, factor*x)), .SDcols=vec_1]
Sys.time() - t
###############
# lapply step 2
t <- Sys.time()
data[, (vec_3) := lapply(.SD, function(x) 1 - prod((1 - x))),
by=c("group_1", "group_2"), .SDcols=vec_2]
Sys.time() - t
######################
# test: 2 steps in one
t <- Sys.time()
data[, (vec_3) := lapply(.SD, function(x) 1 - prod((1 - pmin(1, factor*x)))),
by=c("group_1", "group_2"), .SDcols=vec_1]
Sys.time() - t
# end test
- The step 1 is quite fast: around 1 second
- The step 2 is quite slow: around 1.9 mins
Is there a way to improve the processing time of the step 2? I am also surprised that, when I try to combine the 2 steps in a unique line of code, it is actually much slower, around 10 mins (see "test: 2 steps in one" in the above code).