Fastest data operations with least memory in tidy syntax

Why tidyft?

Before tidyft, I’ve designed a package named tidyfst. Backed by data.table, it is fast and convenient. By then, I was not so interested in modification by reference, which always causes trouble in my workflow. Therefore, I use a lot of functions to make copies so as to suppress the in place replacement. However, when it comes to big data, simply making a new copy of the original data set could be time consuming and memory inefficient. So I tried to write some functions using the feature of modification by reference. This ends up in inconsistency of many functions in the tidyfst package. In the end, I removed all the in place replacement functions in tidyfst and build a new package instead. This is how tidyft comes into being.

The philosophy of tidyft

You cannot step into the same river twice, for other waters are continually flowing on.

—— Heraclitus

If you try to do data operations on any data.table(s), never use it again for futher analysis, because it is not the data you knew before. And you might never figure out what have happened and what have been changed in that process. If you really want to use it again, try make a copy first by using copy(), which might take extra time and space (that’s why tidyft avoid doing this all the time).

Another rule is, tidyft only deals with data.table(s), the raw data.frame and other formats such as tibble could not work. If you already have lots of data.frames in the environment, try these codes.

library(tidyft)
#> 
#> Life's short, use R.
#> 
#> Attaching package: 'tidyft'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag

# make copies
copy(iris) -> a
copy(mtcars) -> b

# before
class(a)
#> [1] "data.frame"
class(b)
#> [1] "data.frame"

# convert codes
lapply(ls(),get) %>%
  lapply(setDT) %>% 
  invisible()

# after
class(a)
#> [1] "data.table" "data.frame"
class(b)
#> [1] "data.table" "data.frame"

One last thing, while modifications are carried out in place, doesn’t mean that the results could not be showed after operation. The data.table package would return it invisibly, but in tidyft, the final results are always printed if possible. This brings no reduction to the computation performance.

Working with fst

tidyft would not be so powerful without fst. I first introduce this workflow into tidyfst. In such workflow, you do not have to read all data into memory, only import the needed data when necessary. tidyft is not so convenient for in-memory operations, but it works very well (if not best) with the fst workflow. Here we’ll make some examples.

rm(list = ls())

library(tidyft)
# make a large data.frame
iris[rep(1:nrow(iris),1e4),] -> dt
# size: 1500000 rows, 5 columns
dim(dt)
#> [1] 1500000       5
# save as fst table
as_fst(dt) -> ft
# remove the data.frame from RAM
rm(dt)

# inspect the fst table of large iris
ft
#> <fst file>
#> 1500000 rows, 5 columns (dta1f44f7113fb.fst)
#> 
#>         Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#>             <double>    <double>     <double>    <double>  <factor>
#> 1                5.1         3.5          1.4         0.2    setosa
#> 2                4.9         3.0          1.4         0.2    setosa
#> 3                4.7         3.2          1.3         0.2    setosa
#> 4                4.6         3.1          1.5         0.2    setosa
#> 5                5.0         3.6          1.4         0.2    setosa
#> --                --          --           --          --        --
#> 1499996          6.7         3.0          5.2         2.3 virginica
#> 1499997          6.3         2.5          5.0         1.9 virginica
#> 1499998          6.5         3.0          5.2         2.0 virginica
#> 1499999          6.2         3.4          5.4         2.3 virginica
#> 1500000          5.9         3.0          5.1         1.8 virginica
summary_fst(ft)
#> <fst file>
#> 1500000 rows, 5 columns (dta1f44f7113fb.fst)
#> 
#> * 'Sepal.Length': double
#> * 'Sepal.Width' : double
#> * 'Petal.Length': double
#> * 'Petal.Width' : double
#> * 'Species'     : factor

# list the variables in the environment
ls() # only the ft exists
#> [1] "ft"

The as_fst could save any data.frame as “.fst” file in temporary file and parse it back as fst table. Fst table is small in RAM, but if you want to get any part of the data.frame, you can get it in almost no time:

ft %>% 
  slice_fst(5555:6666)  # get 5555 to 6666 row
#>       Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#>              <num>       <num>        <num>       <num>     <fctr>
#>    1:          5.0         3.6          1.4         0.2     setosa
#>    2:          5.4         3.9          1.7         0.4     setosa
#>    3:          4.6         3.4          1.4         0.3     setosa
#>    4:          5.0         3.4          1.5         0.2     setosa
#>    5:          4.4         2.9          1.4         0.2     setosa
#>   ---                                                             
#> 1108:          5.9         3.0          4.2         1.5 versicolor
#> 1109:          6.0         2.2          4.0         1.0 versicolor
#> 1110:          6.1         2.9          4.7         1.4 versicolor
#> 1111:          5.6         2.9          3.6         1.3 versicolor
#> 1112:          6.7         3.1          4.4         1.4 versicolor

Except for slice_fst, there are also other functions for subsetting the data, such as select_fst,filter_fst. Good practice is: Make subsets of the data and use the least needy data to do operations. For very large data sets, you may try to do tests on a sample of the data (using slice or select to get several rows or columns) first before you implement a huge operation. Now let’s do a slightly complex manipulation. We’ll use sys_time_print to measure the running time.


sys_time_print({
  res =  ft %>% 
   select_fst(Species,Sepal.Length,Sepal.Width) %>% 
   rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>% 
   arrange(group,sl) %>% 
   filter(sl > 5) %>% 
   distinct(sl,.keep_all = TRUE) %>% 
   summarise(sw = max(sw),by = group)
})
#> [1] "Finished in 0.390s elapsed (0.470s cpu)"

res
#>         group    sw
#>        <fctr> <num>
#> 1:     setosa   4.4
#> 2: versicolor   3.3
#> 3:  virginica   3.8

This should be pretty fast. Becasue when we use the data in fst table, we never get them until using the “_fst” suffix functions, so the tidyft functions never modify the data in the fst file or fst table. That is to say, we do not have to worry about the modification by reference any more. No copies made, fastest ever.

Performance

The fst workflow could also be working with other tools, though less efficient. Now let’s compare the performance of tidyft, data.table, dtplyr and dplyr.


rm(list = ls())

library(profvis)
library(data.table)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:tidyft':
#> 
#>     add_count, anti_join, arrange, count, cummean, distinct, filter,
#>     full_join, group_by, groups, inner_join, lag, lead, left_join,
#>     mutate, nth, pull, relocate, rename, right_join, select,
#>     select_vars, semi_join, slice, slice_head, slice_max, slice_min,
#>     slice_sample, slice_tail, summarise, transmute, ungroup
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dtplyr)
library(tidyft)


# make a large data.frame
iris[rep(1:nrow(iris),1e4),] -> dt
# size: 1500000 rows, 5 columns
dim(dt)
#> [1] 1500000       5
# save as fst table
as_fst(dt) -> ft
# remove the data.frame from RAM
rm(dt)
  

profvis({
  
  res1 = ft %>% 
    select_fst(Species,Sepal.Length,Sepal.Width,Petal.Length) %>% 
    dplyr::select(-Petal.Length) %>% 
    dplyr::rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>% 
    dplyr::arrange(group,sl) %>% 
    dplyr::filter(sl > 5) %>% 
    dplyr::distinct(sl,.keep_all = TRUE) %>% 
    dplyr::group_by(group) %>% 
    dplyr::summarise(sw = max(sw))
  
  res2 = ft %>% 
    select_fst(Species,Sepal.Length,Sepal.Width,Petal.Length) %>% 
    lazy_dt() %>% 
    dplyr::select(-Petal.Length) %>% 
    dplyr::rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>% 
    dplyr::arrange(group,sl) %>% 
    dplyr::filter(sl > 5) %>% 
    dplyr::distinct(sl,.keep_all = TRUE) %>% 
    dplyr::group_by(group) %>% 
    dplyr::summarise(sw = max(sw)) %>% 
    as.data.table()
  
  res3 = ft[,c("Species","Sepal.Length","Sepal.Width","Petal.Length")] %>%  
    setDT() %>%
    .[,.SD,.SDcols = -"Petal.Length"] %>% 
    setnames(old =c("Species","Sepal.Length","Sepal.Width"),
             new = c("group","sl","sw")) %>% 
    setorder(group,sl) %>% 
    .[sl>5] %>% unique(by = "sl") %>% 
    .[,.(sw = max(sw)),by = group]
  
  
  res4 =  ft %>% 
    tidyft::select_fst(Species,Sepal.Length,Sepal.Width,Petal.Length) %>% 
    tidyft::select(-Petal.Length) %>% 
    tidyft::rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>% 
    tidyft::arrange(group,sl) %>% 
    tidyft::filter(sl > 5) %>% 
    tidyft::distinct(sl,.keep_all = TRUE) %>% 
    tidyft::summarise(sw = max(sw),by = group)
  
  
})
setequal(res1,res2) #> [1] FALSE setequal(res2,res3) #> [1] TRUE setequal(res3,res4) #> [1] TRUE

Because tidyft is based on data.table, therefore, if you always use data.table correctly, then tidyft should not perform better than data.table (I do use some tricks, by never do column selection but delete the unselected ones instead, which is faster and more memory efficient than using .SDcols in data.table). However, tidyft has a very different syntax, which might be more readable. And lots of complex operations of data.table has been wrapped in it. This could save your day to write the correct codes sometimes. I hope all my time devoted to this work could possibly save some of your valuable time on data operations of big datasets.

Session Information

sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                               
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dtplyr_1.2.1      dplyr_1.0.9       data.table_1.14.2 profvis_0.3.7    
#> [5] fstcore_0.9.12    tidyft_0.5.7     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.9        bslib_0.4.0       compiler_4.2.1    pillar_1.8.0     
#>  [5] jquerylib_0.1.4   tools_4.2.1       digest_0.6.29     tibble_3.1.8     
#>  [9] jsonlite_1.8.0    evaluate_0.16     lifecycle_1.0.1   pkgconfig_2.0.3  
#> [13] rlang_1.0.4       DBI_1.1.3         cli_3.3.0         rstudioapi_0.13  
#> [17] yaml_2.3.5        parallel_4.2.1    xfun_0.32         fastmap_1.1.0    
#> [21] stringr_1.4.0     knitr_1.39        generics_0.1.3    sass_0.4.2       
#> [25] htmlwidgets_1.5.4 vctrs_0.4.1       tidyselect_1.1.2  glue_1.6.2       
#> [29] R6_2.5.1          fansi_1.0.3       rmarkdown_2.14    purrr_0.3.4      
#> [33] magrittr_2.0.3    ellipsis_0.3.2    htmltools_0.5.3   assertthat_0.2.1 
#> [37] fst_0.9.8         utf8_1.2.2        stringi_1.7.8     cachem_1.0.6     
#> [41] crayon_1.5.1