Question

Improve processing time of applying a function over a vector and grouping by columns

I am trying to apply a function over data.table columns, and grouping by columns value.

I am using the lapply fuction, but my script is quite slow.

To give some context, I am working of probability values:

First, I multiple each set of 5 probability values for each "id" by a random value
Then, I do the following calculation, grouping by variables "group_1" and "group_2": PD_3_N=1-PROD(1-PD_2_N)

Here is a reproducible example with dummy values:

###########
# Dummy data
set.seed(99)
n_col <- 4
size <- 3e6
num_group2 <- 10
vec_1 <- paste0("PD_1_N", (0:n_col))
vec_2 <- paste0("PD_2_N", (0:n_col))
vec_3 <- paste0("PD_3_N", (0:n_col))
id <- rep(seq(1, size, 1), num_group2)
group_1 <- rep(sample(seq(1, size, 1), size=size, replace=TRUE), num_group2)
group_2 <- sort(rep(seq(1, num_group2, 1), size))
factor <- runif(size*num_group2, 0.5, 4)
data <- data.table(id, group_1, group_2, factor)
data[, vec_1] <- data.table(rep(runif(size, 0, 0.5), num_group2), 
                            rep(runif(size, 0, 0.5), num_group2), 
                            rep(runif(size, 0, 0.5), num_group2), 
                            rep(runif(size, 0, 0.5), num_group2), 
                            rep(runif(size, 0, 0.5), num_group2))

###############
# lapply step 1
t <- Sys.time()
data[, (vec_2) := lapply(.SD, function(x) pmin(1, factor*x)), .SDcols=vec_1]
Sys.time() - t

###############
# lapply step 2
t <- Sys.time()
data[, (vec_3) := lapply(.SD, function(x) 1 - prod((1 - x))), 
     by=c("group_1", "group_2"), .SDcols=vec_2]
Sys.time() - t

######################
# test: 2 steps in one
t <- Sys.time()
data[, (vec_3) := lapply(.SD, function(x) 1 - prod((1 - pmin(1, factor*x)))), 
     by=c("group_1", "group_2"), .SDcols=vec_1]
Sys.time() - t
# end test

The step 1 is quite fast: around 1 second
The step 2 is quite slow: around 1.9 mins

Is there a way to improve the processing time of the step 2? I am also surprised that, when I try to combine the 2 steps in a unique line of code, it is actually much slower, around 10 mins (see "test: 2 steps in one" in the above code).

3 90 3

1 Jan 1970

Solution

data.table has an optimized version of prod. The data.table authors call this “GForce” optimization. See the "What is this sorcery" section here.

The issue here, is that it will not work if multiple operations are done together and inside the prod.

We have to construct it step by step:

# First, what is inside the prod. No need for grouping
data[, (vec_3) := lapply(.SD, function(x) 1 - x), .SDcols = vec_2]

# Now, the prod alone, by groups. Now we take vec_3 in .SDcols

data[, (vec_3) := lapply(.SD, prod), 
     by = c("group_1", "group_2"), .SDcols = vec_3]

# Now, the outer operation, no need for grouping
data[, (vec_3) := lapply(.SD, function(x) 1 - x), .SDcols = vec_3]

My results comparing both approaches:

> data.table::setkey(data, group_1, group_2)
> # lapply step 2
> t <- Sys.time()
> data[, (vec_3) := lapply(.SD, function(x) 1 - prod((1 - x))), 
+      by=c("group_1", "group_2"), .SDcols=vec_2]
> t <- Sys.time()
> data[, (vec_3) := lapply(.SD, function(x) 1 - prod((1 - x))), 
+      by=c("group_1", "group_2"), .SDcols=vec_2]
> Sys.time() - t
Time difference of 1.847846 mins

> # Optmized
> t <- Sys.time()
> # First, what is inside the prod. No need for grouping
> data[, (vec_3) := lapply(.SD, function(x) 1 - x), .SDcols = vec_2]
> 
> # Now, the prod alone, by groups. Now we take vec_3 in .SDcols
> 
> data[, (vec_3) := lapply(.SD, prod), 
+      by = c("group_1", "group_2"), .SDcols = vec_3]
> 
> # Now, the outer operation, no need for grouping
> data[, (vec_3) := lapply(.SD, function(x) 1 - x), .SDcols = vec_3]
> Sys.time() - t
Time difference of 4.312201 secs

So 1.9 minutes vs 4.31 secs.

See when is GForce optimizing or not with: (has some overhead)

options(datatable.verbose = TRUE)

Finding groups using uniqlist on key ... 0.540s elapsed (0.390s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.110s elapsed (0.080s cpu) 
lapply optimization changed j from 'lapply(.SD, prod)' to 'list(prod(PD_3_N0), prod(PD_3_N1), prod(PD_3_N2), prod(PD_3_N3), prod(PD_3_N4))'
GForce optimized j to 'list(gprod(PD_3_N0), gprod(PD_3_N1), gprod(PD_3_N2), gprod(PD_3_N3), gprod(PD_3_N4))' (see ?GForce)
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.028
gforce assign high and low took 0.025
gforce eval took 1.122
1.310s elapsed (0.970s cpu) 
Assigning to 30000000 row subset of 30000000 rows
RHS_list_of_columns == true

The + sign deactivates GForce in this example, so most of the optimization is there

> data[, (vec_3) := lapply(.SD, prod), .SDcols = vec_3, by = c("group_1", "group_2")]
Finding groups using uniqlist on key ... 0.510s elapsed (0.420s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.110s elapsed (0.090s cpu) 
lapply optimization changed j from 'lapply(.SD, prod)' to 'list(prod(PD_3_N0), prod(PD_3_N1), prod(PD_3_N2), prod(PD_3_N3), prod(PD_3_N4))'
GForce optimized j to 'list(gprod(PD_3_N0), gprod(PD_3_N1), gprod(PD_3_N2), gprod(PD_3_N3), gprod(PD_3_N4))' (see ?GForce)
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.028
gforce assign high and low took 0.030
gforce eval took 1.212
1.390s elapsed (0.750s cpu) 
Assigning to 30000000 row subset of 30000000 rows
RHS_list_of_columns == true
> Sys.time() - t
Time difference of 3.031504 secs
> t <- Sys.time()
> data[, (vec_3) := lapply(.SD, function(x) +prod(x)), .SDcols = vec_3, by = c("group_1", "group_2")]
Finding groups using uniqlist on key ... 0.530s elapsed (0.480s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.110s elapsed (0.090s cpu) 
lapply optimization changed j from 'lapply(.SD, function(x) +prod(x))' to 'list(..FUN1(PD_3_N0), ..FUN1(PD_3_N1), ..FUN1(PD_3_N2), ..FUN1(PD_3_N3), ..FUN1(PD_3_N4))'
GForce is on, but not activated for this query; left j unchanged (see ?GForce)
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... 
  memcpy contiguous groups took 7.584s for 18964290 groups
  eval(j) took 63.648s for 18964290 calls
00:01:30 elapsed (00:01:22 cpu) 
> Sys.time() - t
Time difference of 1.514286 mins

2024-07-12

Telaroz

Solution

The slow part seems to be the grouped 1 - x. Try moving it out of the grouped computation:

t <- Sys.time()
data[, (vec_3) := lapply(.SD, function(x) 1 - x), .SDcols = vec_2]
data[, (vec_3) := lapply(.SD, prod), by = c("group_1", "group_2"), .SDcols = vec_3]
data[, (vec_3) := lapply(.SD, function(x) 1 - x), .SDcols = vec_3]
Sys.time() - t

2024-07-12

Mikko Marttila