Pattern Matching

Flexible manipulation (extraction, summarization, diagnostics) of specific nodes from mcmc.list objects is one of the primary features of ‘postpack’. This flexibility is provided through the use of regular expressions. Regular expressions are a method of matching character strings using patterns. As an example, suppose you wished to extract all of the b0 parameters from the cjs object, which include b0[1], b0[2], b0[3], b0[4], and b0[5]. Without regular expressions, you would have to somehow state explictly that you want to extract each of these (perhaps using paste0("b0[", 1:5, "]")).

However, with regular expressions, you simply need to find a common feature of the strings you wish to match, and create a pattern that will exclude all strings that do not have that common feature. This capability greatly increases the ease and flexibility of performing tasks on specific nodes. You will need to learn some very basic regular expression syntax to get it to work properly.

Here is a very basic example. Begin by loading ‘postpack’ and the example mcmc.list:

library(postpack)
data(cjs)

Next, obtain a basic posterior summary for all nodes that contain the letter "B":

post_summ(cjs, params = "B", digits = 3)
##          B0 sig_B0    B1 sig_B1
## mean  1.590  0.844 0.415  0.243
## sd    0.491  0.619 0.213  0.262
## 50%   1.569  0.682 0.397  0.178
## 2.5%  0.551  0.277 0.062  0.006
## 97.5% 2.588  2.439 0.864  0.821

Because all of the node names contain a "B", they are matched and returned. So if you only wanted sig_B0 and sig_B1, you could pass params = "sig" as an argument to the ‘postpack’ function in question, to perform the action for only nodes that contain "sig":

diag_plots(cjs, params = "sig")

match_params()

Pattern matching is driven by the match_params() function (which is essentially a wrapper for stringr::str_detect()). Every function in ‘postpack’ that uses a param or params argument passes that argument to match_params() early on their calculations, so understanding the way it works is important.

Suppose you want to summarize the random slope and intercept for year 2 (named "b0[2]" and "b1[2]" in the output). Use match_params() to help you get the right regular expression code by checking what will be matched when you try to subset your output:

match_params(cjs, "2")
## [1] "b0[2]"    "b1[2]"    "SIG[2,1]" "SIG[1,2]" "SIG[2,2]" "p[2]"

All nodes with a "2" would be returned. You could exclude the "SIG" elements using:

match_params(cjs, "[2]")
## [1] "b0[2]" "b1[2]" "p[2]"

Automatic Escape of Brackets

Users already familiar with regular expressions will see the square brackets ("[" and "]") as special characters and may be confused why this works. When using regular expressions, normally "[" and "]" have a special meaning and users would need to escape them (i.e., "\\[" and "\\]") to allow them to be matched as plain text. However, because "[" and "]" are frequently part of node names and users will wish to use them in pattern matching more frequently than as special characters, ‘postpack’ automatically escapes "[" and "]" if they are not already escaped.

If you do not wish to automatically escape "[" and "]" to allow a regular expression where brackets serve as special characters like "[:alnum:]", you must include the auto_escape = FALSE argument:

match_params(cjs, "[:alnum:]", auto_escape = FALSE)
##  [1] "B0"       "sig_B0"   "B1"       "sig_B1"   "b0[1]"    "b0[2]"   
##  [7] "b0[3]"    "b0[4]"    "b0[5]"    "b1[1]"    "b1[2]"    "b1[3]"   
## [13] "b1[4]"    "b1[5]"    "SIG[1,1]" "SIG[2,1]" "SIG[1,2]" "SIG[2,2]"
## [19] "p[2]"     "p[3]"     "p[4]"

If you then also wish to match a "[" as plain text in the same regular expression, you can explicitly escape it:

match_params(cjs, "[:alnum:]\\[", auto_escape = FALSE)
##  [1] "b0[1]"    "b0[2]"    "b0[3]"    "b0[4]"    "b0[5]"    "b1[1]"   
##  [7] "b1[2]"    "b1[3]"    "b1[4]"    "b1[5]"    "SIG[1,1]" "SIG[2,1]"
## [13] "SIG[1,2]" "SIG[2,2]" "p[2]"     "p[3]"     "p[4]"

All functions that accept either param or params argument also accept the auto_escape argument, and it is always TRUE by default. For what it is worth, I have only rarely used this feature, but chose to include it to allow complete regular expression capabilities.

Wild Cards

With that out of the way, now further refine the regular expression to exclude the p[2] node from the output:

match_params(cjs, "b.[2]")
## [1] "b0[2]" "b1[2]"

The "." serves as a wild card, which says match any one character. If you would have supplied simply "b[2]", you would get an error:

## Error: 
##   Supplied value(s) of params ("b[2]") did not have any matches in the nodes stored in post.
##   All elements of params must have at least one match.
##   The base names of all monitored nodes are:
##     "B0", "sig_B0", "B1", "sig_B1", 
##     "b0", "b1", "SIG", and "p"

Which is what will happen when any element supplied to params does not match one of the node names stored in the output. The wild card is necessary to specify that some character (here a "0" or a "1") falls between the "b" and the "[" characters in your desired match. You can now pass "b.[2]" to post_summ() with confidence knowing that only the two coefficients you want will be returned:

post_summ(cjs, "b.[2]", digits = 3)
##       b0[2] b1[2]
## mean  2.319 0.423
## sd    0.356 0.218
## 50%   2.278 0.403
## 2.5%  1.784 0.046
## 97.5% 3.163 0.916

NOTE: Because "." is a special character (a wild card), if you wish to match it as plain text, you must escape it. For example if you have both nodes "a.b" and "a_b" in your mcmc.list object x, running match_params(x, "a.b") will match both "a.b" and "a_b". If you want only "a.b", you must run match_params(x, "a\\.b")

Anchors

As another example, instead of accessing all nodes that contain "B" in their name, suppose you wish to only extract those that contain "B" as the first character. You can force your match to start with certain patterns using the "^" symbol:

match_params(cjs, "^B")
## [1] "B0" "B1"

Likewise, you could extract only nodes that end with "0" using the "$" symbol:

match_params(cjs, "0$")
## [1] "B0"     "sig_B0"

Note that no elements of "b0" are returned, because these nodes end with "]" (e.g., "b0[1]").

Repetition

You can repeat a wild card or some other character one or more times using the "+" symbol:

match_params(cjs, "^s.+0$")
## [1] "sig_B0"

Which forces the match to start with an "s", end with a "0", and have anything in between.

More resources

Because match_params() is a wrapper for stringr::str_detect(), most everything that works there will also with with ‘postpack’. There is a comprehensive description of all regular expression syntax accepted by ‘stringr’ here (also accessible in R using vignette("regular-expressions", package = "stringr")).