Joining verbs for data.table

Almost all joins between 2 data.tables use a notation where one of them is used as i in a frame applied to the other, and the joining columns are specified with the on parameter. However, in addition to the “basic” joins, data.table allows for special cases like rolling joins, summarizing while joining, non-equi joins, etc. This vignette will describe the notation to apply these joins with verbs defined in table.express, which, like the single-table verbs, build data.table expressions.

Basic joins

We’ll consider most of the dplyr joining verbs in this section:

inner_join
left_join
right_join
anti_join
semi_join
full_join

A <- data.table::data.table(x = rep(c("b", "a", "c"), each = 3),
                            y = c(1, 3, 6),
                            v = 1:9)

B <- data.table::data.table(x = c("c", "b"),
                            v2 = 8:7,
                            foo = c(4, 2))

A

#>    x y v
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: a 1 4
#> 5: a 3 5
#> 6: a 6 6
#> 7: c 1 7
#> 8: c 3 8
#> 9: c 6 9

#>    x v2 foo
#> 1: c  8   4
#> 2: b  7   2

The methods defined in table.express accept the on part of the expression in their ellipsis:

A %>%
    inner_join(B, x)

#>    x y v v2 foo
#> 1: c 1 7  8   4
#> 2: c 3 8  8   4
#> 3: c 6 9  8   4
#> 4: b 1 1  7   2
#> 5: b 3 2  7   2
#> 6: b 6 3  7   2

A %>%
    inner_join(B, x, v = v2)

#>    x y v foo
#> 1: c 3 8   4

An important thing to note in the second example above is the order in which the columns are given, i.e. that v is written before v2, since the order is relevant for data.table. We can remember the correct order simply by looking at which data.table appears first in the expression, and knowing that said data.table’s columns must appear first in the on expressions. In this case, A appears before B, so writing v2 = v would not work.

In order to maintain consistency in the ordering just described, left_join automatically swaps expression elements internally in order to build the expression:

A %>%
    start_expr %>%
    left_join(B, x, v = v2) %T>%
    print %>%
    end_expr

#> .DT_[.DT_0_, on = list(x, v2 = v), nomatch = , mult = , roll = , 
#>     rollends = ]

#>    x v2 foo y
#> 1: b  1  NA 1
#> 2: b  2  NA 3
#> 3: b  3  NA 6
#> 4: a  4  NA 1
#> 5: a  5  NA 3
#> 6: a  6  NA 6
#> 7: c  7  NA 1
#> 8: c  8   4 3
#> 9: c  9  NA 6

We can also see an extra .DT_0_ pronoun in the expression. These special .DT_*_ pronouns hold the different data.tables that have entered the expression so far, and are automatically assigned to the evaluation’s environment. In this case, .DT_ holds B and .DT_0_ holds A.

No additional considerations are required to use right_join or anti_join:

A %>%
    right_join(B, x, v = v2)

#>    x  y v foo
#> 1: c  3 8   4
#> 2: b NA 7   2

A %>%
    anti_join(B, x, v = v2)

#>    x y v
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: a 1 4
#> 5: a 3 5
#> 6: a 6 6
#> 7: c 1 7
#> 8: c 6 9

A semi_join is essentially a right_join with nomatch = NULL, and where j is set to unique(.SD):

setnames(B, "v2", "v")

A %>%
    semi_join(B, x)

#>    x y v
#> 1: c 1 7
#> 2: c 3 8
#> 3: c 6 9
#> 4: b 1 1
#> 5: b 3 2
#> 6: b 6 3

Finally, full_join is basically a wrapper for merge specifying all = TRUE:

A %>%
    full_join(B, x)

#>    x y v.x v.y foo
#> 1: a 1   4  NA  NA
#> 2: a 3   5  NA  NA
#> 3: a 6   6  NA  NA
#> 4: b 1   1   7   2
#> 5: b 3   2   7   2
#> 6: b 6   3   7   2
#> 7: c 1   7   8   4
#> 8: c 3   8   8   4
#> 9: c 6   9   8   4

Expression chaining

Something to keep in mind is that most joins specify i and on inside the frame, so any subsequent verbs that specify j, by, or keyby would still be possible. In order to enable this, many joining verbs have an .expr parameter that indicates that the expression should be kept, delaying evaluation until a verb that sets j is used. This can be useful if only a subset of the joined columns is desired:

A %>%
    left_join(B, x, .expr = TRUE) %>%
    select(x, y, foo)

#>    x y foo
#> 1: b 1   2
#> 2: b 3   2
#> 3: b 6   2
#> 4: a 1  NA
#> 5: a 3  NA
#> 6: a 6  NA
#> 7: c 1   4
#> 8: c 3   4
#> 9: c 6   4

But, when working lazily, this would require explicit chaining for expressions that should be applied to the joined table:

A %>%
    start_expr %>%
    left_join(B, x) %>%
    chain %>%
    group_by(x) %>%
    mutate(y = cumsum(y)) %>%
    end_expr %>% {
        invisible(print(.))
    }

#>    x  v foo  y i.v
#> 1: b  7   2  1   1
#> 2: b  7   2  4   2
#> 3: b  7   2 10   3
#> 4: a NA  NA  1   4
#> 5: a NA  NA  4   5
#> 6: a NA  NA 10   6
#> 7: c  8   4  1   7
#> 8: c  8   4  4   8
#> 9: c  8   4 10   9

This is particularly important if the selection expressions call any function, e.g. tidyselect helpers or even : with non-numerics, because data.table does not support that kind of syntax for j in the same frame as a join.

Mutating join

A special data.table idiom is described here as an update join. In order to highlight the fact that it modifies the left-hand side table by reference, the mutate_join verb is defined in table.express. Said verb accepts the columns to be added in its .SDcols parameter, possibly with new names:

A %>%
    mutate_join(B, x, .SDcols = c("foo", v_from_B = "v"))

print(A)

#>    x y v foo v_from_B
#> 1: b 1 1   2        7
#> 2: b 3 2   2        7
#> 3: b 6 3   2        7
#> 4: a 1 4  NA       NA
#> 5: a 3 5  NA       NA
#> 6: a 6 6  NA       NA
#> 7: c 1 7   4        8
#> 8: c 3 8   4        8
#> 9: c 6 9   4        8

A particularity of this idiom is that the number of rows from the resulting join must match the left-hand side exactly or not at all, so this won’t work:

B %>%
    mutate_join(A, x, .SDcols = "y")

In these cases, we must either use mult if appropriate, or specify a summarizing expression in .SDcols:

B %>%
    mutate_join(A, x, .SDcols = "y", mult = "first")

print(B)

#>    x v foo y
#> 1: c 8   4 1
#> 2: b 7   2 1

B %>%
    mutate_join(A, x, .SDcols = .(y = mean(y)))

print(B)

#>    x v foo        y
#> 1: c 8   4 3.333333
#> 2: b 7   2 3.333333

The last example specifies by = .EACHI in the joining expression.

Rolling joins

A nice blog post describing rolling joins can be found at R-bloggers, so almost the same website and paypal tables will be used for the examples below. Another short description with animated depictions can also be found here.

print(website)

#>        name  session_start_time session_id
#>  1:   Erica 2016-01-04 19:12:00          1
#>  2:   Erica 2016-01-04 21:05:00          2
#>  3: Francis 2016-01-02 13:09:00          3
#>  4: Francis 2016-01-03 19:22:00          4
#>  5: Francis 2016-01-08 08:44:00          5
#>  6: Francis 2016-01-08 20:22:00          6
#>  7: Francis 2016-01-10 17:36:00          7
#>  8: Francis 2016-01-15 16:56:00          8
#>  9:  Isabel 2016-01-01 11:01:00          9
#> 10:  Isabel 2016-01-02 08:59:00         10
#> 11:  Isabel 2016-01-05 18:18:00         11
#> 12:  Isabel 2016-01-07 19:03:00         12
#> 13:  Isabel 2016-01-08 19:01:00         13
#> 14:   Sally 2016-01-03 10:00:00         14
#> 15:  Vivian 2016-01-01 09:10:00         15
#> 16:  Vivian 2016-01-09 02:15:00         16

print(paypal)

#>       name       purchase_time payment_id
#> 1:   Erica 2016-01-03 08:02:00          1
#> 2: Francis 2016-01-03 19:28:00          2
#> 3: Francis 2016-01-08 20:33:00          3
#> 4: Francis 2016-01-10 17:46:00          4
#> 5:  Isabel 2016-01-08 19:10:00          5
#> 6:     Mom 2015-12-02 17:58:00          6
#> 7:   Sally 2016-01-03 10:06:00          7
#> 8:   Sally 2016-01-03 10:15:00          8

In contrast to the blog post, no join_time is added to the tables. This is done on purpose in order to show what happens with the columns that are rolled.

Let’s use a left rolling join to obtain the session_id that immediately preceded a purchase, if any:

paypal %>%
    left_join(website, name, purchase_time = session_start_time, roll = Inf)

#>       name  session_start_time session_id payment_id
#> 1:   Erica 2016-01-03 08:02:00         NA          1
#> 2: Francis 2016-01-03 19:28:00          4          2
#> 3: Francis 2016-01-08 20:33:00          6          3
#> 4: Francis 2016-01-10 17:46:00          7          4
#> 5:  Isabel 2016-01-08 19:10:00         13          5
#> 6:     Mom 2015-12-02 17:58:00         NA          6
#> 7:   Sally 2016-01-03 10:06:00         14          7
#> 8:   Sally 2016-01-03 10:15:00         14          8

We can see that the rows returned are from the left-hand side (paypal), and since neither Mom nor Erica visited the website before their purchases, their session_id ended as NA.

The order of the columns in the on expressions is the same as above. The tricky part is that the rolled column ended up with the name from the right-hand side, but keeping the values from the left-hand side. If we “invert” the join, the result is the same, but the rolled column’s name is now from the expression’s left-hand side.

website %>%
    right_join(paypal, name, session_start_time = purchase_time, roll = Inf)

#>       name  session_start_time session_id payment_id
#> 1:   Erica 2016-01-03 08:02:00         NA          1
#> 2: Francis 2016-01-03 19:28:00          4          2
#> 3: Francis 2016-01-08 20:33:00          6          3
#> 4: Francis 2016-01-10 17:46:00          7          4
#> 5:  Isabel 2016-01-08 19:10:00         13          5
#> 6:     Mom 2015-12-02 17:58:00         NA          6
#> 7:   Sally 2016-01-03 10:06:00         14          7
#> 8:   Sally 2016-01-03 10:15:00         14          8

Note, however, that roll stayed equal to Inf. This is because even though the column order in the expressions changed, we could understand the rolling expressions as follows:

For left_join, the rolling column on the left is purchase_time, so with roll = Inf, the values from session_start_time are rolled forward onto purchase_time to find a match while joining.
For right_join, the rolling column on the right is purchase_time, so roll must stay as Inf to keep the same semantics.

Now let’s say we want to keep all the rows from website and find the closest payment_id that occurred after the visit. This could be expressed as:

website %>%
    left_join(paypal, name, session_start_time = purchase_time, roll = -Inf)

#>        name       purchase_time payment_id session_id
#>  1:   Erica 2016-01-04 19:12:00         NA          1
#>  2:   Erica 2016-01-04 21:05:00         NA          2
#>  3: Francis 2016-01-02 13:09:00          2          3
#>  4: Francis 2016-01-03 19:22:00          2          4
#>  5: Francis 2016-01-08 08:44:00          3          5
#>  6: Francis 2016-01-08 20:22:00          3          6
#>  7: Francis 2016-01-10 17:36:00          4          7
#>  8: Francis 2016-01-15 16:56:00         NA          8
#>  9:  Isabel 2016-01-01 11:01:00          5          9
#> 10:  Isabel 2016-01-02 08:59:00          5         10
#> 11:  Isabel 2016-01-05 18:18:00          5         11
#> 12:  Isabel 2016-01-07 19:03:00          5         12
#> 13:  Isabel 2016-01-08 19:01:00          5         13
#> 14:   Sally 2016-01-03 10:00:00          7         14
#> 15:  Vivian 2016-01-01 09:10:00         NA         15
#> 16:  Vivian 2016-01-09 02:15:00         NA         16

In order to simplify the meaning of rollends a bit, we could think of it as missing or being a single TRUE/FALSE. If it’s missing, rolling works according to the value of roll, otherwise:

When rollends = TRUE, the value of roll is inverted only for those rows that would have no match otherwise.
When rollends = FALSE, a matching roll will only occur if the column’s value falls in a gap with values both before and after.

website %>%
    left_join(paypal, name, session_start_time = purchase_time, roll = -Inf, rollends = TRUE)

#>        name       purchase_time payment_id session_id
#>  1:   Erica 2016-01-04 19:12:00          1          1
#>  2:   Erica 2016-01-04 21:05:00          1          2
#>  3: Francis 2016-01-02 13:09:00          2          3
#>  4: Francis 2016-01-03 19:22:00          2          4
#>  5: Francis 2016-01-08 08:44:00          3          5
#>  6: Francis 2016-01-08 20:22:00          3          6
#>  7: Francis 2016-01-10 17:36:00          4          7
#>  8: Francis 2016-01-15 16:56:00          4          8
#>  9:  Isabel 2016-01-01 11:01:00          5          9
#> 10:  Isabel 2016-01-02 08:59:00          5         10
#> 11:  Isabel 2016-01-05 18:18:00          5         11
#> 12:  Isabel 2016-01-07 19:03:00          5         12
#> 13:  Isabel 2016-01-08 19:01:00          5         13
#> 14:   Sally 2016-01-03 10:00:00          7         14
#> 15:  Vivian 2016-01-01 09:10:00         NA         15
#> 16:  Vivian 2016-01-09 02:15:00         NA         16

website %>%
    left_join(paypal, name, session_start_time = purchase_time, roll = -Inf, rollends = FALSE)

#>        name       purchase_time payment_id session_id
#>  1:   Erica 2016-01-04 19:12:00         NA          1
#>  2:   Erica 2016-01-04 21:05:00         NA          2
#>  3: Francis 2016-01-02 13:09:00         NA          3
#>  4: Francis 2016-01-03 19:22:00         NA          4
#>  5: Francis 2016-01-08 08:44:00          3          5
#>  6: Francis 2016-01-08 20:22:00          3          6
#>  7: Francis 2016-01-10 17:36:00          4          7
#>  8: Francis 2016-01-15 16:56:00         NA          8
#>  9:  Isabel 2016-01-01 11:01:00         NA          9
#> 10:  Isabel 2016-01-02 08:59:00         NA         10
#> 11:  Isabel 2016-01-05 18:18:00         NA         11
#> 12:  Isabel 2016-01-07 19:03:00         NA         12
#> 13:  Isabel 2016-01-08 19:01:00         NA         13
#> 14:   Sally 2016-01-03 10:00:00         NA         14
#> 15:  Vivian 2016-01-01 09:10:00         NA         15
#> 16:  Vivian 2016-01-09 02:15:00         NA         16

Vivian’s payment_ids are always NA because she has never purchased anything. On the other hand, no one except Francis has visited the website both before and after a purchase.

Non-equi joins

Non-equi joins are similar to rolling joins, but instead of rolling a single row’s value, they can return several values per row.

Using the same data as before, we could find all the session_ids that preceded a payment_id, giving “priority” to paypal’s rows:

paypal %>%
    left_join(website, name, purchase_time >= session_start_time)

#>        name  session_start_time session_id payment_id
#>  1:   Erica 2016-01-03 08:02:00         NA          1
#>  2: Francis 2016-01-03 19:28:00          3          2
#>  3: Francis 2016-01-03 19:28:00          4          2
#>  4: Francis 2016-01-08 20:33:00          3          3
#>  5: Francis 2016-01-08 20:33:00          4          3
#>  6: Francis 2016-01-08 20:33:00          5          3
#>  7: Francis 2016-01-08 20:33:00          6          3
#>  8: Francis 2016-01-10 17:46:00          3          4
#>  9: Francis 2016-01-10 17:46:00          4          4
#> 10: Francis 2016-01-10 17:46:00          5          4
#> 11: Francis 2016-01-10 17:46:00          6          4
#> 12: Francis 2016-01-10 17:46:00          7          4
#> 13:  Isabel 2016-01-08 19:10:00          9          5
#> 14:  Isabel 2016-01-08 19:10:00         10          5
#> 15:  Isabel 2016-01-08 19:10:00         11          5
#> 16:  Isabel 2016-01-08 19:10:00         12          5
#> 17:  Isabel 2016-01-08 19:10:00         13          5
#> 18:     Mom 2015-12-02 17:58:00         NA          6
#> 19:   Sally 2016-01-03 10:06:00         14          7
#> 20:   Sally 2016-01-03 10:15:00         14          8

Priority above simply means that all rows from paypal are returned, even if they don’t have a match in website. Even though a column session_start_time appears in the result, the values contained therein are from paypal’s purchase_time.

A corresponding right non-equi join would yield the same result, expecting only a different order in the columns that are part of the comparisons:

website %>%
    right_join(paypal, name, session_start_time <= purchase_time)

#>        name  session_start_time session_id payment_id
#>  1:   Erica 2016-01-03 08:02:00         NA          1
#>  2: Francis 2016-01-03 19:28:00          3          2
#>  3: Francis 2016-01-03 19:28:00          4          2
#>  4: Francis 2016-01-08 20:33:00          3          3
#>  5: Francis 2016-01-08 20:33:00          4          3
#>  6: Francis 2016-01-08 20:33:00          5          3
#>  7: Francis 2016-01-08 20:33:00          6          3
#>  8: Francis 2016-01-10 17:46:00          3          4
#>  9: Francis 2016-01-10 17:46:00          4          4
#> 10: Francis 2016-01-10 17:46:00          5          4
#> 11: Francis 2016-01-10 17:46:00          6          4
#> 12: Francis 2016-01-10 17:46:00          7          4
#> 13:  Isabel 2016-01-08 19:10:00          9          5
#> 14:  Isabel 2016-01-08 19:10:00         10          5
#> 15:  Isabel 2016-01-08 19:10:00         11          5
#> 16:  Isabel 2016-01-08 19:10:00         12          5
#> 17:  Isabel 2016-01-08 19:10:00         13          5
#> 18:     Mom 2015-12-02 17:58:00         NA          6
#> 19:   Sally 2016-01-03 10:06:00         14          7
#> 20:   Sally 2016-01-03 10:15:00         14          8

Self joins

In case a self join were necessary, perhaps while using a rolling or non-equi join, the way magrittr’s pipe handles the . outside of nested calls wouldn’t allow calling a joining verb with . both in x and y. To work around this, the following verbs default to an eager self join when y is missing:

full_join
left_join
mutate_join

As a somewhat contrived example, we could add a rolling count of weekly visits per user to the website data introduced above:

website %>%
    mutate(window_start = session_start_time - as.difftime(7, units = "days")) %>%
    mutate_join(, name, window_start <= session_start_time, session_start_time >= session_start_time,
                .SDcols = .(weekly_visits = .N),
                .by_each = TRUE) %>%
    mutate(window_start = NULL)

print(website)

#>        name  session_start_time session_id weekly_visits
#>  1:   Erica 2016-01-04 19:12:00          1             1
#>  2:   Erica 2016-01-04 21:05:00          2             2
#>  3: Francis 2016-01-02 13:09:00          3             1
#>  4: Francis 2016-01-03 19:22:00          4             2
#>  5: Francis 2016-01-08 08:44:00          5             3
#>  6: Francis 2016-01-08 20:22:00          6             4
#>  7: Francis 2016-01-10 17:36:00          7             4
#>  8: Francis 2016-01-15 16:56:00          8             3
#>  9:  Isabel 2016-01-01 11:01:00          9             1
#> 10:  Isabel 2016-01-02 08:59:00         10             2
#> 11:  Isabel 2016-01-05 18:18:00         11             3
#> 12:  Isabel 2016-01-07 19:03:00         12             4
#> 13:  Isabel 2016-01-08 19:01:00         13             4
#> 14:   Sally 2016-01-03 10:00:00         14             1
#> 15:  Vivian 2016-01-01 09:10:00         15             1
#> 16:  Vivian 2016-01-09 02:15:00         16             1