12  Scalability

Important

Effective Data Science is still a work-in-progress. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it.

If you would like to contribute to the development of EDS, you may do so at https://github.com/zakvarty/data_science_notes.

12.1 Scalability and Production

When put into production code gets used more and on more data. We will likely have to consider scalability of our methods in terms of

  • Computation time

  • Memory requirements

When doing so we have to balance a trade-off between development costs and usage costs.

12.1.1 Example: Bayesian Inference

  • MCMC originally takes ~24 hours

  • Identifying and amending bottlenecks in code reduced this to ~24 minutes.

Is this actually better? That will depend on a number of factors, including:

  • human hours invested
  • frequency of use
  • safe / stable / general / readable
  • trade for scalability

12.1.2 Knowing when to worry

Sub-optimal optimisation can be worse than doing nothing

… programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimisation is the root of all evil (or at least most of it) in programming. - Donald Knuth

Plot of computation time agains coding time with feasible solutions shown as crosses and the Pareto frontier added as an orange curve.

12.1.3 Our Focus

Writing code that scales well in terms of computaiton time or memory used is a huge topic. In this section we restrict our aims to:

  • Basic profiling to find bottlenecks.
  • Strategies for writing scalable (R) code.
  • Signpost advanced methods & further reading.

12.2 Basics of Code Profiling

12.2.1 R as a stopwatch

The simplest way to profile your code is to time how long it takes to run. There are three common ways to do this.

Firstly, you could record the time before your code starts executing, the time it completes and look at the difference of those.

t_start <- Sys.time()
Sys.sleep(0.5) # YOUR CODE
t_end <- Sys.time()

t_end - t_start
#> Time difference of 0.5026569 secs

The system.time function provides a shorthand for this if your code runs sequentially and extends the functionality to work for parallel code too.

system.time(
  Sys.sleep(0.5)
)
#>    user  system elapsed 
#>   0.000   0.001   0.503

The tictoc package has similar features, but also allows you to add intermediate timers to more understand which parts of your code are taking the most time to run.

library(tictoc)

tic() 
Sys.sleep(0.5) # YOUR CODE 
toc()
#> 0.504 sec elapsed

With tictoc we can get fancy

tic("total")
tic("first, easy part")
Sys.sleep(0.5)
toc(log = TRUE)
#> first, easy part: 0.505 sec elapsed
tic("second, hard part")
Sys.sleep(3)
toc(log = TRUE)
#> second, hard part: 3.01 sec elapsed
toc()
#> total: 3.52 sec elapsed

If your code is already very fast (but will be run very many times, so further efficiency gains are required) then the methods may fail because they do not sample the state of the code at a high enough frequency. In those cases you might want to explore the {mircobenchmark} package.

12.3 Profiling Your Code

To diagnose scaling issues you have to understand what your code is doing.

  • Stop the code at time \(\tau\) and examine the call-stack.

    • The current function being evaluated, the function that called that, the function that called that, …, top level function.
  • Do this a lot and you can measure (estimate) the proportion of working memory (RAM) uses over time and the time spent evaluating each function.

12.3.1 Profiling: Toy Example

Suppose we have the following code in a file called prof-vis-example.R.

h <- function() {
  profvis::pause(1)
}

g <- function() {
  profvis::pause(1)
  h()
}

f <- function() {
  profvis::pause(1)
  g()
  profvis::pause(1)
  h()
}

Then the call stack for f() would look something like this.

Schematic diagram of a call stack. x-axis shows time and nested function calls are shown as stacked blocks of varying widths, with the deepest function call at the top. At time tau = 2.5, the pause function is being evaluated within function h, which is being evaluated within function g, which in turn is being evaluated within function f.

Callstack for f()

We can examine the true call stack using the profvis() function from the profvis package. By saving the code in a separate file and sourcing it into our session, this function will also give us line-by-line information about the time and memory demands of our code.

source("prof-vis-example.R")
profvis::profvis(f())

RStudio window showing interacting profile of function f. A veritcal bar chart shows how much time is spent evaluating functions f,g,h, and pause.

In both the upper histogram and the lower flame plot we can see that the majority of time is being spent in pause() and h(). What we have to be careful of here is that the upper plot shows the total amount of time in each function call, so h() appears to take longer than g(), but this is because it is called more often in the code snippet we are profiling.

12.4 Notes on Time Profiling

We will get slightly different results each time you run the function

  • Changes to internal state of computer
  • Usually not a big deal, mainly effects fastest parts of code
  • Be careful with stochastic simulations
  • Use set.seed() to make a fair comparison over many runs.

12.4.1 Source code and compiled functions

If you write a function you can see the source of that function by calling it’s name

pad_with_NAs <- function(x, n_left, n_right){
  c(rep(NA, n_left), x, rep(NA, n_right))
}
pad_with_NAs
#> function(x, n_left, n_right){
#>   c(rep(NA, n_left), x, rep(NA, n_right))
#> }

This is equally true for functions within packages.

eds::pad_with_NAs
#> function (x, n_left, n_right) 
#> {
#>     stopifnot(n_left >= 0)
#>     stopifnot(n_right >= 0)
#>     stopifnot(class(x) %in% c("character", "complex", "integer", 
#>         "logical", "numeric", "factor"))
#>     c(rep(NA, n_left), x, rep(NA, n_right))
#> }
#> <bytecode: 0x7fb6b1b18550>
#> <environment: namespace:eds>

Some functions use compiled code that is written in another language. This is the case for dplyr’s arrange(), which calls some compiled C++ code.

dplyr::arrange
#> function (.data, ..., .by_group = FALSE) 
#> {
#>     UseMethod("arrange")
#> }
#> <bytecode: 0x7fb6b607ffc8>
#> <environment: namespace:dplyr>

It is also true for many functions from base R, for which there is (for obvious reason) no R source code.

mean
#> function (x, ...) 
#> UseMethod("mean")
#> <bytecode: 0x7fb6b4270080>
#> <environment: namespace:base>


These compiled functions have no R source code, and the profiling methods we have used here don’t extend into compiled code. See {jointprof} if you really need this profiling functionality.

12.5 Memory Profiling

profvis() can similarly measure the memory usage of your code.

x <- integer()
for (i in 1:1e4) {
  x <- c(x, i)
}

RStudio interactive code profiler. Code to iteratively extend a vector uses and clears a lot of working memory. The garbage collector function GC causes most of the runtime.

  • Copy-on-modify behaviour makes growing objects slow.
  • Pre-allocate storage where possible.
  • Strategies and structures, see R inferno and Effecient R.

12.6 Tips to work at scale

TL;DR: pick your object types carefully, vectorise your code and as a last resort implement your code in a faster language.

12.6.1 Vectorise

Two bits of code do the same task, but the second is much faster, because it involves fewer function calls.

x <- 1:10
y <- 11:20 
z <- rep(NA, length(x))

for (i in seq_along(x)) {
  z[i] <- x[i] * y[i]
}
x <- 1:10
y <- 11:20 
z <- x * y

Where possible write and use functions to take advantage of vectorised inputs. E.g.

rnorm(n = 100, mean = 1:10, sd = rep(1, 10))

Be careful of recycling!

12.6.2 Linear Algebra

X <- diag(x = c(2, 0.5))
y <- matrix(data = c(1, 1), ncol = 1)

X %*% y
#>      [,1]
#> [1,]  2.0
#> [2,]  0.5

More on vectorising: Noam Ross Blog Post

12.7 For loops in disguise

12.7.1 The apply family

Functional programming equivalent of a for loop. [apply(), mapply(), lapply(), …]

Apply a function to each element of a list-like object.

A <- matrix(data = 1:12, nrow = 3, ncol = 4)
A
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    4    7   10
#> [2,]    2    5    8   11
#> [3,]    3    6    9   12
# MARGIN = 1 => rows,  MARGIN = 2 => columns
apply(X = A, MARGIN = 1, FUN = sum)
#> [1] 22 26 30

This generalises functions from matrixStats, where for some special operations we can do all to the necessary calculation in C++.

rowSums(A)
#> [1] 22 26 30

12.7.2 {purrr}

Iterate over a single object with map().

mu <- c(-10, 0, 10)
purrr::map(.x = mu, .f = rnorm, n = 5)
#> [[1]]
#> [1]  -9.880129 -10.451713  -9.567073 -11.239899 -10.617120
#> 
#> [[2]]
#> [1] 1.4725695 1.3105922 1.6106729 1.6752833 0.1976809
#> 
#> [[3]]
#> [1]  9.715314 10.107474  9.699145  9.101771  9.297176

Iterate over multiple objects map2() and pmap().

mu <- c(-10, 0, 10)
sigma <- c(0, 0.1, 0)
purrr::map2(.x = mu, .y = sigma, .f = rnorm, n = 5)
#> [[1]]
#> [1] -10 -10 -10 -10 -10
#> 
#> [[2]]
#> [1] -0.08854929  0.06603302 -0.07120175 -0.10550448  0.08062672
#> 
#> [[3]]
#> [1] 10 10 10 10 10
mu <- c(-10, 0, 10)
sigma <- c(0, 0.1, 0)

purrr::pmap(
  .f = rnorm, 
  n = 5,
  .l = list(
    mean = mu, 
    sd = sigma))
#> [[1]]
#> [1] -10 -10 -10 -10 -10
#> 
#> [[2]]
#> [1]  0.01120288  0.10556350 -0.10836403 -0.05334010  0.07257224
#> 
#> [[3]]
#> [1] 10 10 10 10 10

For more details and variants see Advanced R chapters 9-11 on functional programming.

12.8 Easy parallelisation with furrr

  • {parallel} and {futures} allow parallel coding over multiple cores.

  • Powerful, but steep learning curve.

  • furrr makes this very easy, just add future_ to purrr verbs.

mu <- c(-10, 0, 10)
furrr::future_map(
  .x = mu, 
  .f = rnorm,
  .options = furrr::furrr_options(seed = TRUE),
  n = 5) 
#> [[1]]
#> [1]  -9.983428 -12.750038 -10.880379  -8.767607  -9.385472
#> 
#> [[2]]
#> [1]  0.04170641  1.04998508 -0.09269608 -0.08873426 -0.71685557
#> 
#> [[3]]
#> [1]  8.588356 10.094437 11.217455  9.735384  9.868978

This is, of course excessive for this small example!

One thing to be aware of is that we need to be very careful handling random number generation in relation to parallelisation. There are many options for how you might want to set this up, see R-bloggers for more details.

12.9 Sometimes R doesn’t cut it

Logo of the Rcpp package. A speed dial turned up to 11 out of 10 in a blue hexagon.

RCPP: An API for running C++ code in R. Useful when you need:

  • loops to be run in order
  • lots of function calls (e.g. deep recursion)
  • optimised data structures

Rewriting R code in C++ and other low-level programming languages is beyond our scope, but good to know exists. Starting point: Advanced R Chapter 25.

12.10 Wrapping up

Summary

  1. Pick you battles wisely
  2. Target your energy with profiling
  3. Scale loops with vectors
  4. Scale loops in parallel processing
  5. Scale in another language

Help!