hablar

David Sjoberg

The mission of hablar is for you to get non-astonishing results! That means that functions return what you expected. R has some intuitive quirks that beginners and experienced programmers fail to identify. Some of the first weird features of R that hablar solves:

hablar follows the syntax API of tidyverse and works seamlessly with dplyr and tidyselect.

Missing values that astonishes you

A common issue in R is how R treats missing values (i.e. NA). Sometimes NA in your data frame means that there is missing values in the sense that you need to estimate or replace them with values. But often it is not a problem! Often NA means that there is no value, and should not be. hablar provide useful functions that handle NA intuitively. Let’s take a simple example:

#> # A tibble: 3 × 3
#>   name    graduation_date   age
#>   <chr>   <date>          <int>
#> 1 Fredrik 2016-06-15         21
#> 2 Maria   NA                 16
#> 3 Astrid  2014-06-15         23
Change min() to min_()

The graduation_date is missing for Maria. In this case it is not because we do not know. It is because she has not graduated yet, she is younger than Fredrik and Astrid. If we would like to know the first graduation date of the three observation in R with a naive min() we get NA. But with min_() from hablar we get the minimum value that is not missing. See:

df %>% 
  mutate(min_baseR = min(graduation_date),
         min_hablar = min_(graduation_date))
#> # A tibble: 3 × 5
#>   name    graduation_date   age min_baseR min_hablar
#>   <chr>   <date>          <int> <date>    <date>    
#> 1 Fredrik 2016-06-15         21 NA        2014-06-15
#> 2 Maria   NA                 16 NA        2014-06-15
#> 3 Astrid  2014-06-15         23 NA        2014-06-15

The hablar package provides the same functionality for

… and more. For more documentation type help(min_()) or vignette("s") for an in-depth description.

Change type in a snap - safely

In hablar the function convert provides a robust, readable and dynamic way to change type of a column.

mtcars %>% 
  convert(int(cyl, am),
          num(disp:drat))
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

The above chunk converts the columns cyl and am to integers, and the columns disp through drat to numeric. If a column is of type factor it always converts it to character before further conversion.

Fix all your types in the same function

With convert and tidyselect you can easily change type of a wide range of columns.

mtcars %>% 
  convert(
    chr(last_col()),       # Last colum to character
    int(1:2),              # First two columns to integer
    fct(hp, wt),           # hp and wt to factors
    dte(vs),               # vs to date (if you really want)
    num(contains("car"))   # car as in carb to numeric
  )           
#>                     mpg cyl  disp  hp drat    wt  qsec         vs am gear carb
#> Mazda RX4            21   6 160.0 110 3.90  2.62 16.46 1970-01-01  1    4    4
#> Mazda RX4 Wag        21   6 160.0 110 3.90 2.875 17.02 1970-01-01  1    4    4
#> Datsun 710           22   4 108.0  93 3.85  2.32 18.61 1970-01-02  1    4    1
#> Hornet 4 Drive       21   6 258.0 110 3.08 3.215 19.44 1970-01-02  0    3    1
#> Hornet Sportabout    18   8 360.0 175 3.15  3.44 17.02 1970-01-01  0    3    2
#> Valiant              18   6 225.0 105 2.76  3.46 20.22 1970-01-02  0    3    1
#> Duster 360           14   8 360.0 245 3.21  3.57 15.84 1970-01-01  0    3    4
#> Merc 240D            24   4 146.7  62 3.69  3.19 20.00 1970-01-02  0    4    2
#> Merc 230             22   4 140.8  95 3.92  3.15 22.90 1970-01-02  0    4    2
#> Merc 280             19   6 167.6 123 3.92  3.44 18.30 1970-01-02  0    4    4
#> Merc 280C            17   6 167.6 123 3.92  3.44 18.90 1970-01-02  0    4    4
#> Merc 450SE           16   8 275.8 180 3.07  4.07 17.40 1970-01-01  0    3    3
#> Merc 450SL           17   8 275.8 180 3.07  3.73 17.60 1970-01-01  0    3    3
#> Merc 450SLC          15   8 275.8 180 3.07  3.78 18.00 1970-01-01  0    3    3
#> Cadillac Fleetwood   10   8 472.0 205 2.93  5.25 17.98 1970-01-01  0    3    4
#> Lincoln Continental  10   8 460.0 215 3.00 5.424 17.82 1970-01-01  0    3    4
#> Chrysler Imperial    14   8 440.0 230 3.23 5.345 17.42 1970-01-01  0    3    4
#> Fiat 128             32   4  78.7  66 4.08   2.2 19.47 1970-01-02  1    4    1
#> Honda Civic          30   4  75.7  52 4.93 1.615 18.52 1970-01-02  1    4    2
#> Toyota Corolla       33   4  71.1  65 4.22 1.835 19.90 1970-01-02  1    4    1
#> Toyota Corona        21   4 120.1  97 3.70 2.465 20.01 1970-01-02  0    3    1
#> Dodge Challenger     15   8 318.0 150 2.76  3.52 16.87 1970-01-01  0    3    2
#> AMC Javelin          15   8 304.0 150 3.15 3.435 17.30 1970-01-01  0    3    2
#> Camaro Z28           13   8 350.0 245 3.73  3.84 15.41 1970-01-01  0    3    4
#> Pontiac Firebird     19   8 400.0 175 3.08 3.845 17.05 1970-01-01  0    3    2
#> Fiat X1-9            27   4  79.0  66 4.08 1.935 18.90 1970-01-02  1    4    1
#> Porsche 914-2        26   4 120.3  91 4.43  2.14 16.70 1970-01-01  1    5    2
#> Lotus Europa         30   4  95.1 113 3.77 1.513 16.90 1970-01-02  1    5    2
#> Ford Pantera L       15   8 351.0 264 4.22  3.17 14.50 1970-01-01  1    5    4
#> Ferrari Dino         19   6 145.0 175 3.62  2.77 15.50 1970-01-01  1    5    6
#> Maserati Bora        15   8 301.0 335 3.54  3.57 14.60 1970-01-01  1    5    8
#> Volvo 142E           21   4 121.0 109 4.11  2.78 18.60 1970-01-02  1    4    2

For more information, see help(hablar) or vignette("convert").

Find the problem

When cleaning data you spend a lot of time understanding your data. Sometimes you get more row than you expected when doing a left_join(). Or you did not know that certain column contained missing values NA or irrational values like Inf or NaN.

In hablar the find_* functions speeds up your search for the problem. To find duplicated rows you simply df %>% find_duplicates(). You can also find duplicates in in specific columns, which can be useful before joins.

# Create df with duplicates
df <- mtcars %>% 
  bind_rows(mtcars %>% slice(1, 5, 9))

# Return rows with duplicates in cyl and am
df %>% 
  find_duplicates(cyl, am)
#> # A tibble: 35 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> # … with 31 more rows
#> # ℹ Use `print(n = ...)` to see more rows

There are also find functions for other cases. For example find_na() returns rows with missing values.

starwars %>% 
  find_na(height)
#> # A tibble: 6 × 14
#>   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#> 1 Arvel Crynyd     NA    NA brown   fair    brown        NA male  mascu… <NA>   
#> 2 Finn             NA    NA black   dark    dark         NA male  mascu… <NA>   
#> 3 Rey              NA    NA brown   light   hazel        NA fema… femin… <NA>   
#> 4 Poe Dameron      NA    NA brown   light   brown        NA male  mascu… <NA>   
#> # … with 2 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

If you rather want a Boolean value instead then e.g. check_duplicates() returns TRUE if the data frame contains duplicates, otherwise it returns FALSE.

…apply the solution

Let’s say that we have found a problem is caused by missing values in the column height and you want to replace all missing values with the integer 100. hablar comes with an additional ways of doing if-or-else.

starwars %>% 
  find_na(height) %>% 
  mutate(height = if_na(height, 100L))
#> # A tibble: 6 × 14
#>   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#> 1 Arvel Crynyd    100    NA brown   fair    brown        NA male  mascu… <NA>   
#> 2 Finn            100    NA black   dark    dark         NA male  mascu… <NA>   
#> 3 Rey             100    NA brown   light   hazel        NA fema… femin… <NA>   
#> 4 Poe Dameron     100    NA brown   light   brown        NA male  mascu… <NA>   
#> # … with 2 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

In the chunk above we successfully replaced all missing heights with the integer 100. hablar also contain the self explained:

which works in the same way as the examples above.

Introducing a third way to if or else

The generic function if_else_() provides the same rigidity as if_else() in dplyr but ads some flexibility. In dplyr you need to specify which type NA should have. In if_else_() you can write:

starwars %>% 
  mutate(skin_color = if_else_(hair_color == "brown", NA, hair_color))
#> # A tibble: 87 × 14
#>   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#> 1 Luke Skywal…    172    77 blond   blond   blue       19   male  mascu… Tatooi…
#> 2 C-3PO           167    75 <NA>    <NA>    yellow    112   none  mascu… Tatooi…
#> 3 R2-D2            96    32 <NA>    <NA>    red        33   none  mascu… Naboo  
#> 4 Darth Vader     202   136 none    none    yellow     41.9 male  mascu… Tatooi…
#> # … with 83 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

In if_else() from dplyr you would have had to specified NA_character_.