CodeDepends: Static analysis and dependency detection for R code

Gabriel Becker

Introduction

The CodeDepends package provides a flexible framework for statically analyzing R code (i.e., without evaluating it). It also contains higher-level functionality for: detecting dependencies between R code blocks or expressions, “tree-shaking” (pruning a script down to only the expressions necessary to evaluate a given expression), plotting variable usage timelines, and more.

The workhorses: readScript and getInputs

The primary functions to perform basic code analysis are readScript which reads in R scripts of various forms (including .R and .Rmd files), and getInputs which performs the low-level code-analysis.

The readScript function returns a Script object (essentially a list of ScriptNodes representing the top-level expressions in the script). This can then be passed to the getInputs which, in that case, returns a ScriptInfo object, which can be thought of as a list of ScriptNodeInfo objects representing information about those top-level expressions.

R expressions can also be passed directly to getInputs, which returns a single ScriptNodeInfo object in that case. While in practice users will generally call getInputs on entire scripts, passing expressions directly is useful for testing and illustration.

As stated above, ScriptNodeInfo objects are the units of information about single expressions being analyzed, and collect various information extracted from examining the expression itself:

library(CodeDepends)
getInputs(quote(x <- y + rnorm(10, sd = z)))
## An object of class "ScriptNodeInfo"
## Slot "files":
## character(0)
## 
## Slot "strings":
## character(0)
## 
## Slot "libraries":
## character(0)
## 
## Slot "inputs":
## [1] "y" "z"
## 
## Slot "outputs":
## [1] "x"
## 
## Slot "updates":
## character(0)
## 
## Slot "functions":
##     + rnorm 
##    NA    NA 
## 
## Slot "removes":
## character(0)
## 
## Slot "nsevalVars":
## character(0)
## 
## Slot "sideEffects":
## character(0)
## 
## Slot "code":
## x <- y + rnorm(10, sd = z)

As we can see, the information includes the any string literals used in the expression, split into file and non-file strings based on whether the string appears to point to an existing path at analysis time with respect to the basedir argument (which defaults to the current directory). It also contains any libraries loaded by the code (via library, require, or requireNamespace calls).

Next is are the inputs and outputs of the expression, which are the variables used by the expression and created by the expression (via assignment), respectively. By default, these lists will not include symbols used in ways that mean they are non-standardly evaluated (e.g., within the construction of a ggplot2 plot object). These non-standard evaluation variables are collected separately (as nsevalVars).

Variables whose values are updated (ie ones who are assigned new values which depend on their existing value) are collected separately. These updates can take a large number of forms, including:

x = x + 5
rownames(x) = 5
x[1:3] = 5
x  = lapply(1:5, function(i) x[i]^2)
x$y = 5

In all of the above cases, the variable x will be listed in both the updated and inputs categories, but NOT in the outputs category.

Next are the functions which were called by the expression. These include those invoked as funtionals, e.g. via the apply family or mutate_* and summarize_* families. We note here that the functions list is actually a logical vector, indicating whether the function was locally defined within the script (TRUE), defined within a package (FALSE), or unkown (NA). The names of the vector indicate the names of the functions. Currently, functions will always be unknown if a single expression is analyzed directly. Function provenance detection is only applied to full scripts.

Finally, the list of removed variables, side-effects CodeDepends is able to detect, and a copy of the code complete the list of information extracted.

Symbols within formulas

Symbols within formulas are treated specially when analyzing code, based on the formulaInputs argument to getInputs. If FALSE (the default), they are assumed to evaluated nonstandardly (e.g., in the context of a data.frame), if TRUE, they are counted as standard inputs. Currently there is no capacity for mixing these behaviors within a single call to getInputs.

Input collectors, function handlers, and customization

The getInputs function accepts a collector argument, which essentially specifies a state tracker to be used when walking the code to collect inputs, functions called, etc.

For largely historical reasons, input collectors are roughly defined as the output from the inputCollector constructor, rather than as a more formal class.

When creating an input collector, various behavior can be customized, primarily in the form of handlers which specify behavior when analyzing calls to specific functions. This is, for example, how CodeDepends knows that some arguments within certain functions are non-standardly evaluated. CodeDepends ships with a robust set of default handlers, but these can be overridden or supplemented with custom handlers by specifying them when constructing the collector, either via the ... arguments or as list. In both cases, the names are the names of the function the handler should be used on.

col = inputCollector(library = function(e, collector, ...) {
    print(paste("Hello", asVarName(e)))
    defaultFuncHandlers$library(e, collector, ...)
})
getInputs(quote(library(CodeDepends)), collector = col)
## [1] "Hello CodeDepends"
## An object of class "ScriptNodeInfo"
## Slot "files":
## character(0)
## 
## Slot "strings":
## character(0)
## 
## Slot "libraries":
## [1] "CodeDepends"
## 
## Slot "inputs":
## character(0)
## 
## Slot "outputs":
## character(0)
## 
## Slot "updates":
## character(0)
## 
## Slot "functions":
## named logical(0)
## 
## Slot "removes":
## character(0)
## 
## Slot "nsevalVars":
## character(0)
## 
## Slot "sideEffects":
## character(0)
## 
## Slot "code":
## library(CodeDepends)

inputCollector also accepts arguments which control what is counted as an input when processing expressions. The inclPrevOutput argument specifies whether output variables should be included as inputs to subsequent expressions when processing multiple expressions as an single block (e.g., when they are wrapped in {}). The checkLibrarySymbols and funcsAsInputs arguments control how symbols which appear to be resolved within libraries, and functions which are called are handled, respectively. The default behavior is for all of these to be FALSE.

Dependency detection and script visualization

CodeDepends can visualize code in various ways.

Variable dependency graphs

We can create the variable graph of dependnecies between variables, via the makeVariableGraph function:

 f = system.file("samples", "results-multi.R", package = "CodeDepends")
 sc = readScript(f)
 g = makeVariableGraph( info = getInputs(sc))
 if(require(Rgraphviz))
   plot(g)
## Loading required package: Rgraphviz
## Loading required package: graph
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, cbind, colMeans, colSums, colnames, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rowMeans,
##     rowSums, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which, which.max, which.min
## Loading required package: grid

call graphs

We can also create call graphs for functions or entire packages:

  gg = makeCallGraph("package:CodeDepends")
  if(require(Rgraphviz)) {
     gg = layoutGraph(gg, layoutType = "circo")
     graph.par(list(nodes = list(fontsize=55)))
     renderGraph(gg) ## could also call plot directly
  } 

Variable definitions timelines

Finally we can display timelines for when variables are defined, redefined, and used:

f = system.file("samples", "results-multi.R", package = "CodeDepends")
sc = readScript(f)
dtm = getDetailedTimelines(sc, getInputs(sc))
plot(dtm)

## [1] TRUE
 # A big/long function
info = getInputs(arima0)
dtm = getDetailedTimelines(info = info)
plot(dtm, var.cex = .7, mar = 4, srt = 30)

## [1] TRUE