fplyr 1.3.0

Apply Functions to Blocks of Files

2023-08-23

After briefly describing the problem that fplyr tries to solve, this vignette will go through all the functions in the package, explaining their usage. In order to make the most of this package, a certain degree of familiarity with the data.table package is suggested. Often, if one has trouble understanding an option, it will be possible to find detailed help in the manual of data.table’s fread() function. Furthermore, basic acquaintance with the *ply family of functions in R, especially lapply(), will also be helpful. You are encouraged to run the code of this vignette on your own and explore the output of the commands.

Introduction

A very common operation when analyzing data is that of splitting the observations into groups and applying a function to each group, separately. So common is this operation, that in R there are at least two functions that implement it: by() and aggregate(). However, using these functions requires that the data be loaded into the RAM, and often the files are too big to fit in the memory. fplyr was born to solve this problem: it allows to perform split-apply-combine operations to very big files; by reading the files chunk by chunk, only a limited number of rows is stored in memory at any given time.

fplyr combines the strengths of two other packages: iotools and data.table. While iotools has some functions, such as chunk.apply(), to apply a function to chunks of files, the chunks may not reflect the actual groups in which the data are partitioned. In particular, a ‘chunk’ may contain observations pertaining to several different groups, and the task of further splitting them is left to the user. In fplyr, on the other hand, the further splitting is done automatically (thanks also to the data.table package), so the user needs not worry about it.

Preconditions

Before using fplyr you need to ensure that the input file is in the correct format. First and foremost, the data must be amenable to the split-apply-combine paradigm, so the observations must be grouped according to the value of a certain field. We refer to the values of the ‘groupby’ field as the subjects. Thus, for instance, in the famous iris data set, each species would be a different subject. All the observations pertaining to the same subject constitute a block.

In fplyr the input file must be formatted in such a way that the first field contains all the subject IDs. If the IDs are not in the first field, it won’t work. Moreover, all the observations referring to the same subject must be consecutive; in other words, the file must be sorted on the first field, the reason being that the file is read block by block. Indeed, the subject ID of one line is compared with that of the previous line, and the reading goes on until the IDs are the same.¹ Note that fplyr always ensures that all the rows with the same subject ID are read together in the same batch, but only if the rows are consecutive. To make sure that a file complies with these specifications, it is possible to use *nix command-line tools such as awk and sort.

As an example file, in this vignette we will use a modified version of the iris dataset where the species has been relocated to the first column. This file is very small and would probably be accommodated even in the RAM of old hardware, so fplyr would not be necessary. Nevertheless, this file is attached to the package, meaning that it will be immediately available to all users, and despite its having only three blocks, it will still illustrate the most important features of fplyr. We begin by storing the path to this file into the variable f:

f <- system.file("extdata", "dt_iris.csv", package = "fplyr")

# Let's have a look at the first four lines of the file
fread(f, nrows = 4)
#>    Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1:  setosa          5.1         3.5          1.4         0.2
#> 2:  setosa          4.9         3.0          1.4         0.2
#> 3:  setosa          4.7         3.2          1.3         0.2
#> 4:  setosa          4.6         3.1          1.5         0.2

flply

Use flply() when you want to obtain a list where each element corresponds to a subject and contains the result of the processing of the corresponding block. In our examples, the output of flply() will contain three elements, one for each Iris species. The elements of the list will be conveniently named after the subject IDs.

fplyr allows you to apply a function to each block of the file. For the sake of distinguishing the user-specified function to be applied to each block from other functions, we shall refer to it as FUN. In the first example we will obtain the summary() of each species. In general, all the functions in the package support two fundamental arguments: the path to the input file, and FUN.

species_summ <- flply(input = f, FUN = summary)

# Now `species_summ` is a list of three elements; let's show the 'versicolor' element
species_summ$versicolor
#>    Species           Sepal.Length    Sepal.Width     Petal.Length 
#>  Length:50          Min.   :4.900   Min.   :2.000   Min.   :3.00  
#>  Class :character   1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00  
#>  Mode  :character   Median :5.900   Median :2.800   Median :4.35  
#>                     Mean   :5.936   Mean   :2.770   Mean   :4.26  
#>                     3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60  
#>                     Max.   :7.000   Max.   :3.400   Max.   :5.10  
#>   Petal.Width   
#>  Min.   :1.000  
#>  1st Qu.:1.200  
#>  Median :1.300  
#>  Mean   :1.326  
#>  3rd Qu.:1.500  
#>  Max.   :1.800

For flply(), FUN can be any function that takes as input a “data.frame”; summary() was just an example, but other appropriate functions are str(), as.matrix(), and so on. Of course, if you cannot find a function that does what you want, you can write your own FUN, as we shall see in the next example, where we’ll perform hierarchical clustering within each species.² Note that this is also how functions like lapply() work.

clusters <- flply(f, FUN = function(d) {
  dm <- dist(d[, -1]) # Compute the distance matrix, excluding the first field
  hclust(dm) # Perform the clustering and return the object
})

# The `cluster` variable contains one "hclust" object for each species.
# Let's plot the 'setosa' dendrogram
plot(clusters$setosa)

If FUN takes more than one argument, it is possible to pass any additional argument directly to flply(): they will be passed, in turn, to FUN. For instance, suppose that we want to use kmeans() instead of hclust(), and we want to specify the number of centroids as an additional parameter. In the next example we will also define FUN as a separate function before using it, rather than writing an anonymous function like in the previous example. The output will be a “kmeans” object for each species.

kmeans_FUN <- function(d, my_centers) {
  kmeans(d[, -1], centers = my_centers)
}

my_centers <- 2
# We pass `my_centers` to flply(), and flply() passes it to kmeans_FUN
clusters <- flply(f, FUN = kmeans_FUN, my_centers)
# Let's display the centers of the 'setosa' clusters
clusters$setosa$centers
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     4.818182    3.236364     1.433333   0.2303030
#> 2     5.370588    3.800000     1.517647   0.2764706

# Now let's do the same thing, but with three centers for each species
my_centers <- 3
clusters <- flply(f, FUN = kmeans_FUN, my_centers)
clusters$setosa$centers
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.512500    4.000000     1.475000    0.275000
#> 2     5.100000    3.513043     1.526087    0.273913
#> 3     4.678947    3.084211     1.378947    0.200000

The last example of this section may be a bit surprising. Since, in R, [[ is a function, nothing prevents us from using it as FUN to select only, say, the second column of each block. Admittedly, however, in this case it would be better to use the select option (see ?flply and ?fread, or wait for the Other options subsection).

sepal_length <- flply(f, `[[`, 2)

# Now `sepal_length` contains all the sepal lengths, divided by species
sepal_length
#> $setosa
#>  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7
#> [20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9
#> [39] 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
#> 
#> $versicolor
#>  [1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2
#> [20] 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3
#> [39] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7
#> 
#> $virginica
#>  [1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
#> [20] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4
#> [39] 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

Naming convention

We followed the same convention of the plyr package. The name of each function consists of two letters followed by ‘ply’: the first letter represent the type of input, whereas the second letter characterizes the type of output, and the final ‘ply’ clinches the relation with the existing ‘apply’ family of functions. The first letter is usually ‘f’, because the input is the path to a file. The second letter is ‘l’ if the output is a list, as in flply(), it is ‘t’ if the output is a “data.table”, ‘f’ if the output is another file, and ‘m’ if the output can be multiple things.

ftply

Use ftply() to return a “data.table”; the rows corresponding to the different subjects will be rbinded together. Needless to say, in this case FUN must return a “data.frame” or a “data.table”, while in flply() there was no such restriction. (When fplyr is loaded, the data.table package is loaded as well.) Moreover, in this case FUN has to take at least two arguments: the first one being a “data.table” corresponding to the current block being processed, and the second one being a character vector containing the subject ID. This is best explained with an example:

selected_flowers <- ftply(f, function(d, by) {
  if (by == "setosa")
    return(NULL)
  else
    return(d)
})
#> Warning in ftply(f, function(d, by) {: Block setosa returned an empty
#> data.table.

# Let's have a look at the first few lines of the output; note that it start directly with 'versicolor', because all the 'setosa' flowers have been omitted
head(selected_flowers, 4)
#>       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1: versicolor          7.0         3.2          4.7         1.4
#> 2: versicolor          6.4         3.2          4.5         1.5
#> 3: versicolor          6.9         3.1          4.9         1.5
#> 4: versicolor          5.5         2.3          4.0         1.3

Here, we are skipping the ‘setosa’ species. The result will be equal to the input, except that the rows corresponding to the setosa flowers will be omitted. Notice also that fplyr warns us that one block didn’t return any output. In general, the behavior of ftply() is equivalent to flply() followed by rbind on the resulting list.

Importantly, the d argument to FUN contains a “data.table” of the current block being processed, but without the first field. This is just for efficiency concerns; the first field will be added back to the output of FUN. In fact, the following example will show that inside FUN the d data set has only four columns, whereas normally it would have five.

count_cols <- function(d, by) {
  ncol(d)
}
ftply(f, count_cols)
#>       Species V1
#> 1:     setosa  4
#> 2: versicolor  4
#> 3:  virginica  4

The `nblocks` option

ftply() can also be used to quickly glance at the data, much like one would use the head() function. Indeed, we can specify the nblocks option to select only the first block; thus, we can see what the data look like without loading the whole file into memory. By default, in ftply() FUN returns the data without modifying them, so in this case we can avoid specifying FUN. Incidentally, all the other functions support the nblocks option as well; it is intended to be the analogous of nrows in read.table() and fread().

flowers_head <- ftply(f, nblocks = 1)

# Now `flowers_head` has 50 observations, while the original data set had 150. Let's have a look at the first ones.
head(flowers_head, 4)
#>    Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1:  setosa          5.1         3.5          1.4         0.2
#> 2:  setosa          4.9         3.0          1.4         0.2
#> 3:  setosa          4.7         3.2          1.3         0.2
#> 4:  setosa          4.6         3.1          1.5         0.2

Parallelization

Another useful option is parallel, with which it is possible to specify the number of threads that fplyr can use. Like nblocks, also the parallel option is supported by all the functions. It is not necessary to initialize any cluster, but this option has effect only on Unix-like systems, not on Windows. In the following example we will select, for each block, a random sample of 10 observations.

result <- ftply(f, parallel = 3, FUN = function(d, by) {
  d[sample(1:nrow(d), 10), ]
})

# Let's check that the output has 30 rows (10 for each species)
nrow(result)
#> [1] 30

ffply

This package was born to deal with files that are too big to fit into the available RAM. With fplyr, it is easy to process such files, but what if even the output of the processing is too big for the memory? One solution could be to write the output to a file as soon as it is generated, without ever returning it. This solution is implemented in the ffply() function, but it works only if FUN returns a “data.table” or “data.frame”. It is equivalent to calling ftply() followed by write.table() or fread(). This function supports one additional argument with respect to the previously described functions: the path to the output file. In the example, we will replace the original observations with their principal components, block by block.

out <- tempfile() # Create temporary output file
ffply(f, out, function(d, by) {
  # Here, `d` does not contain the subject IDs; they will be automatically added back later
  x <- prcomp(d)$x
  as.data.table(x)
})

# Let's check the result. Note in particular that the subject IDs are present
fread(out, nrows = 4)
#>    Species        PC1         PC2         PC3          PC4
#> 1:  setosa -0.1068424 -0.02489398  0.08216974 -0.034541755
#> 2:  setosa  0.3940472  0.16586593  0.13148092 -0.017551195
#> 3:  setosa  0.3906877 -0.12685112  0.07181182  0.009744303
#> 4:  setosa  0.5117016 -0.02656106 -0.11121361 -0.032673214

For ffply(), FUN must take two arguments, like in ftply(). The return value of ffply() is the number of processed blocks.

Other options

Besides the options we have already discussed, such as nblocks and parallel, all the functions in the package support a set of core options that modify how the file is read. These options are as follows.

key.sep The character that delimits the first field from the rest [default: “\t”].
sep The field delimiter (often equal to key.sep) [default: “\t”].
skip Number of lines to skip at the beginning of the file [default: 0].
header Whether the file has a header [default: TRUE].
nblocks The number of blocks to read [default: Inf].
stringsAsFactors Whether to convert strings into factors [default: FALSE].
colClasses Vector or list specifying the class of each field [default: NULL].
select The columns (names or numbers) to be read [default: NULL].
drop The columns (names or numbers) not to be read [default: NULL].
col.names Names of the columns [default: NULL].

With the exception of key.sep, all these options are comprehensively documented in the help page of data.table‘s fread() function (?fread). For key.sep, see the help page of iotools’ read.chunk() (?read.chunk).

fmply

For the last function, suppose that the analysis of each block produces several output files; for instance, we may want to compute the principal components as well as a nonlinear transformation of the variables, for each block, and save them to two separate output files. In this case, we can use fmply(). Like ftply(), it too supports the output option, but this time it can be a vector of many paths. Accordingly, FUN should now return a list of “data.table”s, one for each of the output files.

out <- c(pca = tempfile(), transf = tempfile())
# Note that the vector needs not be named, we use these names just for convenience

analyze_block <- function(d) {
  # Here, `d` does contain the subject IDs, so we have to remove them...
  x <- prcomp(d[, -1])$x
  # ...and add them back manually
  x <- cbind(d[, 1], x)
  # Transform each number 'z' into e^(-z)
  y <- cbind(d[, 1], exp(-d[, -1]))
  # Return a list of two "data.table"s
  list(x, y)
}

fmply(f, out, analyze_block)

Notice that, contrary to ffply(), FUN takes only one argument, and it is the full block, including the first field. Therefore, we had to remove this field when we computed the principal components, and add it back at the end. (In ffply() and ftply() this is done automatically.) Moreover, FUN should now return two values, the first of which is printed to the first output file, and the second of which is printed to the second output file. There is no limit to the number of output files, but the order of the output files and of the values returned by FUN must match (named vectors and lists are not taken into account at the moment).

Sometimes it is also necessary to return objects that are not printable as “data.table”s. For instance, suppose that, besides printing the principal components to the output file, we also wanted to return the "prcomp" object. In these cases, fmply() is still helpful, because it allows FUN to return one more element, which in turn will be returned by fmply(). For example, consider the following modification of analyze_block():

analyze_block2 <- function(d) {
  pca <- prcomp(d[, -1])
  x <- cbind(d[, 1], pca$x)
  y <- cbind(d[, 1], exp(-d[, -1]))
  # 'x' and 'y' are the same as before, but now we add the 'pca' object
  list(x, y, pca)
}

iris_pca <- fmply(f, out, analyze_block2)

# Let's have a look at the screeplot of the 'versicolor' PCA
screeplot(iris_pca$versicolor)

Here, FUN returns three arguments, but there are only two output files. The third value returned by FUN, then, is returned at the end by fmply(). In particular, the variable iris_pca will be a list of three "prcomp" objects, one for each species.

Actually, it is a bit more complicated than that: the iotools package takes care of the reading, so the file is read chunk by chunk, not block by block, and then the chunk is split into its constituent blocks; you can read more about how iotools reads files in the help page of the chunk.reader() function.↩︎
Yes, I know this clustering is pointless, but the example is just meant to illustrate the kind of things that one can do, provided that he or she has access to more appropriate data sets.↩︎