Benchmarks

Using tidylog adds a small overhead to each function call. For instance, because tidylog needs to figure out how many rows were dropped when you use tidylog::filter, this call will be a bit slower than using dplyr::filter directly. The overhead is usually not noticeable, but can be for larger datasets, especially when using joins. The benchmarks below give some impression of how large the overhead is.

library("dplyr")
library("tidylog", warn.conflicts = FALSE)
library("bench")
library("knitr")

filter

On a small dataset:

bench::mark(
    dplyr::filter(mtcars, cyl == 4),
    tidylog::filter(mtcars, cyl == 4), iterations = 100
) %>%
    dplyr::select(expression, min, median, n_itr) %>%
    kable()
expression min median n_itr
dplyr::filter(mtcars, cyl == 4) 1.81ms 3.72ms 98
tidylog::filter(mtcars, cyl == 4) 3.85ms 6.49ms 97

On a larger dataset:

df <- tibble(x = rnorm(100000))

bench::mark(
    dplyr::filter(df, x > 0),
    tidylog::filter(df, x > 0), iterations = 100
) %>%
    dplyr::select(expression, min, median, n_itr) %>%
    kable()
expression min median n_itr
dplyr::filter(df, x > 0) 7.47ms 13.4ms 95
tidylog::filter(df, x > 0) 8.8ms 12.9ms 96

mutate

On a small dataset:

bench::mark(
    dplyr::mutate(mtcars, cyl = as.factor(cyl)),
    tidylog::mutate(mtcars, cyl = as.factor(cyl)), iterations = 100
) %>%
    dplyr::select(expression, min, median, n_itr) %>%
    kable()
expression min median n_itr
dplyr::mutate(mtcars, cyl = as.factor(cyl)) 3.11ms 5.63ms 97
tidylog::mutate(mtcars, cyl = as.factor(cyl)) 5.31ms 8.2ms 94

On a larger dataset:

df <- tibble(x = round(runif(10000) * 10))

bench::mark(
    dplyr::mutate(df, x = as.factor(x)),
    tidylog::mutate(df, x = as.factor(x)), iterations = 100
) %>%
    dplyr::select(expression, min, median, n_itr) %>%
    kable()
expression min median n_itr
dplyr::mutate(df, x = as.factor(x)) 15.4ms 21.1ms 95
tidylog::mutate(df, x = as.factor(x)) 19.7ms 26.7ms 93

joins

Joins are the most expensive operation, as tidylog has to do two additional joins behind the scenes.

On a small dataset:

bench::mark(
    dplyr::inner_join(band_members, band_instruments, by = "name"),
    tidylog::inner_join(band_members, band_instruments, by = "name"), iterations = 100
) %>%
    dplyr::select(expression, min, median, n_itr) %>%
    kable()
expression min median n_itr
dplyr::inner_join(band_members, band_instruments, by = “name”) 6.12ms 8.9ms 95
tidylog::inner_join(band_members, band_instruments, by = “name”) 23.86ms 34.9ms 82

On a larger dataset (with many row duplications):

N <- 1000
df1 <- tibble(x1 = rnorm(N), key = round(runif(N) * 10))
df2 <- tibble(x2 = rnorm(N), key = round(runif(N) * 10))

bench::mark(
    dplyr::inner_join(df1, df2, by = "key"),
    tidylog::inner_join(df1, df2, by = "key"), iterations = 100
) %>%
    dplyr::select(expression, min, median, n_itr) %>%
    kable()
expression min median n_itr
dplyr::inner_join(df1, df2, by = “key”) 11.8ms 16.1ms 91
tidylog::inner_join(df1, df2, by = “key”) 31ms 38.1ms 83