Fastest data operations with least memory in tidy syntax

The philosophy of tidyft

You cannot step into the same river twice, for other waters are continually flowing on.

—— Heraclitus

If you try to do data operations on any data.table(s), never use it again for futher analysis, because it is not the data you know before. And you might never figure out what have happened and what has been changed in that process. If you really want to use it again, try make a copy first using copy(), which might take extra time and space (that’s why tidyft avoid doing this all the time).

Another rule is, tidyft only deals with data.table(s), the raw data.frame and other formats such as tibble could not work. If you already have lots of data.frames in the environment, try these codes.

library(tidyft)
#> 
#> Life's short, use R.
#> 
#> Attaching package: 'tidyft'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag

# make copies
copy(iris) -> a
copy(mtcars) -> b

# before
class(a)
#> [1] "data.frame"
class(b)
#> [1] "data.frame"

# convert codes
lapply(ls(),get) %>%
  lapply(setDT) %>%
  invisible()

# after
class(a)
#> [1] "data.table" "data.frame"
class(b)
#> [1] "data.table" "data.frame"

One last thing, while modifications are carried out in place, doesn’t mean that the results could not be showed after operation. The data.table package would return it invisibly, but in tidyft, the final results are always printed if possible. This brings no reduction to the computation performance.

Working with fst

tidyft would not be so powerful without fst. I first introduce this workflow into tidyfst. In such workflow, you do not have to read all data into memory, only import the needed data when necessary. tidyft is not so convenient for in-memory operations, but it works very well (if not best) with the fst workflow. Here we’ll make some examples.

rm(list = ls())

library(tidyft)
# make a large data.frame
iris[rep(1:nrow(iris),1e4),] -> dt
# size: 1500000 rows, 5 columns
dim(dt)
#> [1] 1500000       5
# save as fst table
as_fst(dt) -> ft
# remove the data.frame from RAM
rm(dt)

# inspect the fst table of large iris
ft
#> <fst file>
#> 1500000 rows, 5 columns (dt6ff4325f5974.fst)
#> 
#>         Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#>             <double>    <double>     <double>    <double>  <factor>
#> 1                5.1         3.5          1.4         0.2    setosa
#> 2                4.9         3.0          1.4         0.2    setosa
#> 3                4.7         3.2          1.3         0.2    setosa
#> 4                4.6         3.1          1.5         0.2    setosa
#> 5                5.0         3.6          1.4         0.2    setosa
#> --                --          --           --          --        --
#> 1499996          6.7         3.0          5.2         2.3 virginica
#> 1499997          6.3         2.5          5.0         1.9 virginica
#> 1499998          6.5         3.0          5.2         2.0 virginica
#> 1499999          6.2         3.4          5.4         2.3 virginica
#> 1500000          5.9         3.0          5.1         1.8 virginica
summary_fst(ft)
#> <fst file>
#> 1500000 rows, 5 columns (dt6ff4325f5974.fst)
#> 
#> * 'Sepal.Length': double
#> * 'Sepal.Width' : double
#> * 'Petal.Length': double
#> * 'Petal.Width' : double
#> * 'Species'     : factor

# list the variables in the environment
ls() # only the ft exists
#> [1] "ft"

The as_fst could save any data.frame as “.fst” file in temporary file and parse it back as fst table. Fst table is small in RAM, but if you want to get any part of the data.frame, you can get it in almost no time:

ft %>%
  slice_fst(5555:6666)  # get 5555 to 6666 row
#>       Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#>              <num>       <num>        <num>       <num>     <fctr>
#>    1:          5.0         3.6          1.4         0.2     setosa
#>    2:          5.4         3.9          1.7         0.4     setosa
#>    3:          4.6         3.4          1.4         0.3     setosa
#>    4:          5.0         3.4          1.5         0.2     setosa
#>    5:          4.4         2.9          1.4         0.2     setosa
#>   ---                                                             
#> 1108:          5.9         3.0          4.2         1.5 versicolor
#> 1109:          6.0         2.2          4.0         1.0 versicolor
#> 1110:          6.1         2.9          4.7         1.4 versicolor
#> 1111:          5.6         2.9          3.6         1.3 versicolor
#> 1112:          6.7         3.1          4.4         1.4 versicolor

Except for slice_fst, there are also other functions for subsetting the data, such as select_fst,filter_fst. Good practice is: Make subsets of the data and use the least needy data to do operations. For very large data sets, you may try to do tests on a sample of the data (using slice or select to get several rows or columns) first before you implement a huge operation. Now let’s do a slightly complex manipulation. We’ll use sys_time_print to measure the running time.


sys_time_print({
  res =  ft %>%
   select_fst(Species,Sepal.Length,Sepal.Width) %>%
   rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>%
   arrange(group,sl) %>%
   filter(sl > 5) %>%
   distinct(sl,.keep_all = TRUE) %>%
   summarise(sw = max(sw),by = group)
})
#> [1] "Finished in 0.320s elapsed (0.140s cpu)"

res
#>         group    sw
#>        <fctr> <num>
#> 1:     setosa   4.4
#> 2: versicolor   3.3
#> 3:  virginica   3.8

This should be pretty fast. Becasue when we use the data in fst table, we never get them until using the “_fst” suffix functions, so the tidyft functions never modify the data in the fst file or fst table. That is to say, we do not have to worry about the modification by reference any more. No copies made, fastest ever.

Performance

The fst workflow could also be working with other tools, though less efficient. Now let’s compare the performance of tidyft, data.table, dtplyr and dplyr.


rm(list = ls())

library(data.table)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:tidyft':
#> 
#>     add_count, anti_join, arrange, count, cummean, distinct, filter,
#>     full_join, group_by, groups, inner_join, lag, lead, left_join,
#>     mutate, nth, pull, relocate, rename, right_join, select,
#>     select_vars, semi_join, slice, slice_head, slice_max, slice_min,
#>     slice_sample, slice_tail, summarise, transmute, ungroup
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dtplyr)
library(tidyft)


# make a large data.frame
iris[rep(1:nrow(iris),1e4),] -> dt
# size: 1500000 rows, 5 columns
dim(dt)
#> [1] 1500000       5
# save as fst table
as_fst(dt) -> ft
# remove the data.frame from RAM
rm(dt)


bench::mark(

  dplyr = ft %>%
    select_fst(Species,Sepal.Length,Sepal.Width,Petal.Length) %>%
    dplyr::select(-Petal.Length) %>%
    dplyr::rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>%
    dplyr::arrange(group,sl) %>%
    dplyr::filter(sl > 5) %>%
    dplyr::distinct(sl,.keep_all = TRUE) %>%
    dplyr::group_by(group) %>%
    dplyr::summarise(sw = max(sw)),

  dtplyr = ft %>%
    select_fst(Species,Sepal.Length,Sepal.Width,Petal.Length) %>%
    lazy_dt() %>%
    dplyr::select(-Petal.Length) %>%
    dplyr::rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>%
    dplyr::arrange(group,sl) %>%
    dplyr::filter(sl > 5) %>%
    dplyr::distinct(sl,.keep_all = TRUE) %>%
    dplyr::group_by(group) %>%
    dplyr::summarise(sw = max(sw)) %>%
    as.data.table(),

  data.table = ft[,c("Species","Sepal.Length","Sepal.Width","Petal.Length")] %>%
    setDT() %>%
    .[,.SD,.SDcols = -"Petal.Length"] %>%
    setnames(old =c("Species","Sepal.Length","Sepal.Width"),
             new = c("group","sl","sw")) %>%
    setorder(group,sl) %>%
    .[sl>5] %>% unique(by = "sl") %>%
    .[,.(sw = max(sw)),by = group],


  tidyft =  ft %>%
    tidyft::select_fst(Species,Sepal.Length,Sepal.Width,Petal.Length) %>%
    tidyft::select(-Petal.Length) %>%
    tidyft::rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>%
    tidyft::arrange(group,sl) %>%
    tidyft::filter(sl > 5) %>%
    tidyft::distinct(sl,.keep_all = TRUE) %>%
    tidyft::summarise(sw = max(sw),by = group),

  check = setequal
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr       151.3ms    159ms      6.30     180MB     22.0
#> 2 dtplyr      181.6ms    188ms      4.63     201MB     20.1
#> 3 data.table  102.8ms    114ms      8.07     130MB     14.5
#> 4 tidyft       91.7ms    102ms      8.72     102MB     14.0

Because tidyft is based on data.table, therefore, if you always use data.table correctly, then tidyft should not perform better than data.table (I do use some tricks, by never do column selection but delete the unselected ones instead, which is faster and more memory efficient than using .SDcols in data.table). However, tidyft has a very different syntax, which might be more readable. And lots of complex operations of data.table has been wrapped in it. This could save your day to write the correct codes sometimes. I hope all my time devoted to this work could possibly save some of your valuable time on data operations of big datasets.

Session Information

sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=C                               
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dtplyr_1.3.1      dplyr_1.1.4       data.table_1.16.0 fstcore_0.9.18   
#> [5] tidyft_0.9.20    
#> 
#> loaded via a namespace (and not attached):
#>  [1] jsonlite_1.8.8    compiler_4.4.1    crayon_1.5.3      tidyselect_1.2.1 
#>  [5] Rcpp_1.0.13       stringr_1.5.1     parallel_4.4.1    jquerylib_0.1.4  
#>  [9] yaml_2.3.9        fastmap_1.2.0     R6_2.5.1          generics_0.1.3   
#> [13] knitr_1.48        tibble_3.2.1      bslib_0.7.0       pillar_1.9.0     
#> [17] rlang_1.1.4       utf8_1.2.4        cachem_1.1.0      stringi_1.8.4    
#> [21] xfun_0.46         sass_0.4.9        cli_3.6.3         withr_3.0.1      
#> [25] magrittr_2.0.3    digest_0.6.36     rstudioapi_0.16.0 fst_0.9.8        
#> [29] lifecycle_1.0.4   vctrs_0.6.5       bench_1.1.3       evaluate_0.24.0  
#> [33] glue_1.7.0        profmem_0.6.0     fansi_1.0.6       rmarkdown_2.27   
#> [37] tools_4.4.1       pkgconfig_2.0.3   htmltools_0.5.8.1

Fastest data operations with least memory in tidy syntax

Why tidyft?

The philosophy of tidyft

Working with fst

Performance

Session Information