Validation

Evgeni Chasnovski

2023-03-28

This vignette will describe the actual validation step (called ‘exposure’) of ruler workflow and show some examples of what one can do with validation results. Packs from vignette about rule packs will be used for this.

Exposure

Overview

Exposing data to rules means applying rule packs to data, collecting results in common format and attaching them to the data as an exposure attribute. In this way actual exposure can be done in multiple steps and also be a part of a general data preparation pipeline.

After attaching exposure to data frame one can extract information from it using the following functions:

For exposing data to rules use expose():

Simple example:

mtcars %>%
  expose(my_group_packs) %>%
  get_exposure()
#>   Exposure
#> 
#> Packs info:
#> # A tibble: 1 × 4
#>   name          type       fun        remove_obeyers
#>   <chr>         <chr>      <list>     <lgl>         
#> 1 group_pack__1 group_pack <grop_pck> TRUE          
#> 
#> Tidy data validation report:
#> # A tibble: 2 × 5
#>   pack          rule      var      id value
#>   <chr>         <chr>     <chr> <int> <lgl>
#> 1 group_pack__1 any_cyl_6 0.0       0 FALSE
#> 2 group_pack__1 any_cyl_6 1.1       0 FALSE

Don’t remove obeyers

By default exposing removes obeyers. One can leave obeyers by setting .remove_obeyers to FALSE.

mtcars %>%
  expose(my_group_packs, .remove_obeyers = FALSE) %>%
  get_exposure()
#>   Exposure
#> 
#> Packs info:
#> # A tibble: 1 × 4
#>   name          type       fun        remove_obeyers
#>   <chr>         <chr>      <list>     <lgl>         
#> 1 group_pack__1 group_pack <grop_pck> FALSE         
#> 
#> Tidy data validation report:
#> # A tibble: 4 × 5
#>   pack          rule      var      id value
#>   <chr>         <chr>     <chr> <int> <lgl>
#> 1 group_pack__1 any_cyl_6 0.0       0 FALSE
#> 2 group_pack__1 any_cyl_6 0.1       0 TRUE 
#> 3 group_pack__1 any_cyl_6 1.0       0 TRUE 
#> 4 group_pack__1 any_cyl_6 1.1       0 FALSE

Set pack name

Notice imputed group pack name group_pack__1. To change it one can set name during creation with group_packs() or write the following:

mtcars %>%
  expose(new_group_pack = my_group_packs[[1]]) %>%
  get_report()
#> Tidy data validation report:
#> # A tibble: 2 × 5
#>   pack           rule      var      id value
#>   <chr>          <chr>     <chr> <int> <lgl>
#> 1 new_group_pack any_cyl_6 0.0       0 FALSE
#> 2 new_group_pack any_cyl_6 1.1       0 FALSE

Expose step by step

One can expose to several packs at ones or do it step by step:

mtcars_one_step <- mtcars %>%
  expose(my_data_packs, my_col_packs)

mtcars_two_step <- mtcars %>%
  expose(my_data_packs) %>%
  expose(my_col_packs)

identical(mtcars_one_step, mtcars_two_step)
#> [1] TRUE

Guessing

By default expose() guesses which type of pack function represents (if it is not set manually). This is useful for interactive experiments. Guess is based on features of pack’s output structures (see ?expose for more details).

mtcars %>%
  expose(some_data_pack = . %>% summarise(nrow = nrow(.) == 10)) %>%
  get_exposure()
#>   Exposure
#> 
#> Packs info:
#> # A tibble: 1 × 4
#>   name           type      fun        remove_obeyers
#>   <chr>          <chr>     <list>     <lgl>         
#> 1 some_data_pack data_pack <data_pck> TRUE          
#> 
#> Tidy data validation report:
#> # A tibble: 1 × 5
#>   pack           rule  var      id value
#>   <chr>          <chr> <chr> <int> <lgl>
#> 1 some_data_pack nrow  .all      0 FALSE

However there are some edge cases (especially for group packs). To write strict and robust code one should use .guess = FALSE option.

mtcars %>%
  expose(some_data_pack = . %>% summarise(nrow = nrow(.) == 10),
    .guess = FALSE)
#> Error in expose_single.default(X[[i]], ...): There is unsupported class of rule pack.

Using different rule separator

If for some reason not default rule separator was used in rules() one should take this into consideration by using argument .rule_sep. It takes regular expression describing the separator. Note that by default it is a string ‘._.’ surrounded by any number of ‘non alpha-numeric characters’ (with use of inside_punct()). This is done to take account of the dplyr’s default separator _.

regular_col_packs <- col_packs(
  . %>% summarise_all(rules(mean(.) > 1))
)

irregular_col_packs <- col_packs(
  . %>% summarise_all(rules(mean(.) > 1, .prefix = "a_a_"))
)

regular_report <- mtcars %>%
  expose(regular_col_packs) %>%
  get_report()

irregular_report <- mtcars %>%
  expose(irregular_col_packs, .rule_sep = inside_punct("a_a_")) %>%
  get_report()

identical(regular_report, irregular_report)
#> [1] TRUE

# Note suffix '_' after column names
mtcars %>%
  expose(irregular_col_packs, .rule_sep = "a_a_") %>%
  get_report()
#> Tidy data validation report:
#> # A tibble: 2 × 5
#>   pack        rule    var      id value
#>   <chr>       <chr>   <chr> <int> <lgl>
#> 1 col_pack__1 rule__1 vs_       0 FALSE
#> 2 col_pack__1 rule__1 am_       0 FALSE

Acting after exposure

General actions

With exposure attached to data one can perform different kinds of actions: exploration, assertion, imputation and so on.

General actions are recommended to be done with act_after_exposure(). It takes two arguments:

If trigger didn’t notify then the input data is returned untouched. Otherwise the output of .actor() is returned. Note that act_after_exposure() is often used for creating side effects (printing, throwing error etc.) and in that case should invisibly return its input (to be able to use it with pipe %>%).

trigger_one_pack <- function(.tbl) {
  packs_number <- .tbl %>%
    get_packs_info() %>%
    nrow()

  packs_number > 1
}

actor_one_pack <- function(.tbl) {
  cat("More than one pack was applied.\n")

  invisible(.tbl)
}

mtcars %>%
  expose(my_col_packs, my_row_packs) %>%
  act_after_exposure(
    .trigger = trigger_one_pack,
    .actor = actor_one_pack
  ) %>%
  invisible()
#> More than one pack was applied.

Assert presence of rule breaker

ruler has function assert_any_breaker() which can notify about presence of any breaker in exposure.

mtcars %>%
  expose(my_col_packs, my_row_packs) %>%
  assert_any_breaker()
#>   Breakers report
#> Tidy data validation report:
#> # A tibble: 4 × 5
#>   pack          rule               var      id value
#>   <chr>         <chr>              <chr> <int> <lgl>
#> 1 my_col_pack_1 mean_low           vs        0 FALSE
#> 2 my_col_pack_1 mean_low           am        0 FALSE
#> 3 col_pack__2   rule__1            vs        0 FALSE
#> 4 my_row_pack_1 is_common_row_mean .all     15 FALSE
#> Error: assert_any_breaker: Some breakers found in exposure.