## Load and install the packages that we'll be using today
if (!require("pacman")) install.packages("pacman")
::p_load(tictoc, parallel, pbapply, future, future.apply, tidyverse,
pacman
hrbrthemes, furrr, RhpcBLASctl, memoise, here)## My preferred ggplot2 plotting theme (optional)
theme_set(hrbrthemes::theme_ipsum())
## Set future::plan() resolution strategy
plan(multisession)
In this post, I write down the materials I studied on parallel programming. All the things below are from Grant McDermott’s lecture so none of it should be credited as my work. It is just a reminder post for applying parallel programming.
Load packages
Example 1
# library(tidyverse) ## Already loaded
## Emulate slow function
=
slow_square function(x = 1) {
= x^2
x_sq = tibble(value = x, value_squared = x_sq)
d Sys.sleep(2)
return(d)
}
# library(tictoc) ## Already loaded
tic()
= lapply(1:12, slow_square) %>% bind_rows()
serial_ex toc()
24.084 sec elapsed
# future::availableCores() ## Another option
detectCores()
[1] 10
Use future.apply
# library(future.apply) ## Already loaded
# plan(multisession) ## Already set above
tic()
= future_lapply(1:12, slow_square) %>% bind_rows()
future_ex toc(log = TRUE)
6.711 sec elapsed
Execution time has greatly reduced! The results are also equivalent:
all.equal(serial_ex, future_ex)
[1] TRUE
purrr
package also has something similar:
# library(furrr) ## Already loaded
# plan(multisession) ## Already set above
tic()
= future_map(1:12, slow_square) |> list_rbind()
furrr_ex toc()
5.098 sec elapsed
Example 2
This is another example. I will not run this to save rendering time.
## Set seed (for reproducibility)
set.seed(1234)
# Set sample size
= 1e6
n
## Generate a large data frame of fake data for a regression
=
our_data tibble(x = rnorm(n), e = rnorm(n)) %>%
mutate(y = 3 + 2*x + e)
## Function that draws a sample of 10,000 observations, runs a regression and
## extracts the coefficient value on the x variable (should be around 2).
=
bootstrp function(i) {
## Sample the data
= sample_n(our_data, size = 1e4, replace = TRUE)
sample_data ## Run the regression on our sampled data and extract the extract the x
## coefficient.
= lm(y ~ x, data = sample_data)$coef[2]
x_coef ## Return value
return(tibble(x_coef = x_coef))
}
set.seed(123L) ## Optional to ensure that the results are the same
## 10,000-iteration simulation
tic()
= lapply(1:1e4, bootstrp) %>% bind_rows()
sim_serial toc(log = TRUE)
# Takes about 36 seconds.
Summary of parallel programming packages in R
future
ecosystem is very useful. It provides simple and unified approach to implementing parallel programming. You can usually apply this ecosystem by using future.apply
or furrr
package.
If in Linux or Mac, try forking!
There two different ways to run parallel programming:
forking | parallel sockets (PSOCKS) |
---|---|
Fast and memory efficient. | Slower and more memory-intensive (than forking). |
Only available for Unix-based systems. | Works on every operating system, incl. Windows. |
Potentially unstable in an IDE like RStudio. | Fine to use in an IDE like RStudio. |
How to do this
- Change your resolution plan to
plan(multicore)
, and - Run your R script from the terminal using, say,
$ Rscript -e 'rmarkdown::render("mydoc.Rmd", "all")'
or$ Rscript myfile.R
.
Implicit parallel programming
Some packages actually apply parallel programming implicitly (e.g. data.table
). So in this case, you might not need to resort too much in explicit parallel programming.
Setting how many cores to use
plan(multisession)
or plan(multicore)
automatically default to using all your cores. You can change that by running, say, plan(multisession(workers = detectCores()-1))
.