The s
function is a simple function that helps you get
intuitive results when summarizing data. It is made to be used in
conjuction with summarize functions, for example min
,
sum
and mean
. s
takes a vector
and mutates it in the following ways:
It replaces all non-rational numbers from numeric vectors and
replace them with NA
. Non-rational numbers are
Inf
, -Inf
and NaN
.
It removes NA
from the vector by default
If the vector has length zero or only consists of NA
it returns a single NA
.
s(..., ignore_na = T)
where … is one or more vector(s). If missing values should not be
omitted use ignore_na = F
.
Removing NA
:
<- c(NA, 1, 2)
x s(x)
#> [1] 1 2
Replacing non-rational numbers with NA
and then removes
NA
:
<- c(NaN, 1, Inf)
x s(x)
#> [1] 1
Empty vectors return a single NA
:
<- c()
x s(x)
#> NULL
In conjuction with a summary function:
<- c(NaN, Inf, 3, 4)
x median(s(x))
#> [1] 3.5
All programming languages have their special cases when you get non-intuitive results that you did not expect. This is also true for R. The s-function provides intuitive outcomes of some of the most basic commands in R. In the next parts of the vignette some problems it solves are explained in greater detail.
When learning R users might be surprised when creating suprised when
using simple summary function. A summary function is a function that
takes a vector and returns a single one value. For example,
min(x)
, sum(x)
and mean(x)
. A
simple example:
<- c(1, 2, 3, 4, 5)
x sum(x)
#> [1] 15
In this example the output of sum() was, which is expected since all entries in x sums to 15. However, in more messy data, the output is oftentimes less intuitive. New users to R might be confused that the next example results in NA (a missing value):
<- c(1, 2, 3, NA, 4)
x mean(x)
#> [1] NA
Since the vector above have an a missing value R does not know how to
find the mean of the vector. The missing value could be anything, and
thus R thus returns the output NA
. However, since missing
values are common when working with real data, it is also a common
practise to ignore missing values. Usually the user tells R to ignore
the missing value and return the mean of the vector that have values
that could be averaged. The error in the previous example could be fixed
by adding na.rm = TRUE
that drops all missing values before
calculating the mean:
<- c(1, 2, 3, NA, 4)
x mean(x, na.rm = TRUE)
#> [1] 2.5
Generally, R is strict about missing values so that you do not miss them, which often is helpful rather than harsh! However, often the programmer want R to return a ‘real’ value from the data, if there is one, even if it ignores missing values.
The s
function helps you with this. Since it by default
removes missing values you can simply enter:
<- c(1, 2, 3, NA, 4)
x mean(s(x))
#> [1] 2.5
Adding an argument to remove all missing is common practise when summarizing data. However, it is not uncommon that some vectors only have missing values. Imagine an example where Amanda, David and Viktor sold sodas by the beach for three days. If someone did not show up they get a missing value.
#> # A tibble: 9 × 3
#> day name sold_sodas
#> <dbl> <chr> <dbl>
#> 1 1 Amanda 3
#> 2 2 Amanda NA
#> 3 3 Amanda 8
#> 4 1 David NA
#> # … with 5 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Now we want to see the maximum number of sodas each person sold on a
single day. The above data frame if saved as df
.
%>%
df group_by(name) %>%
summarize(n_sodas_best_day = max(sold_sodas, na.rm = T))
#> # A tibble: 3 × 2
#> name n_sodas_best_day
#> <chr> <dbl>
#> 1 Amanda 8
#> 2 David -Inf
#> 3 Viktor 4
Amanda sold the most sodas in a single day. However, David who was
absent on all days, got the output -Inf
. This means that
negative infinity was the number of sodas he sold during his most
productive day. That is astonishing! One would perhaps think that the
more intuitive output would be NA
.
The reason for result is that we told R to remove all missing values before calculating the maximal value. It is equivalent to:
<- c()
x max(x)
#> [1] -Inf
We could try to remove the na.rm = TRUE
argument from
max()
.
%>%
df group_by(name) %>%
summarize(n_sodas_best_day = max(sold_sodas))
#> # A tibble: 3 × 2
#> name n_sodas_best_day
#> <chr> <dbl>
#> 1 Amanda NA
#> 2 David NA
#> 3 Viktor 4
Suddenly R tells us that Viktor had the best day and Amanda, who was absent the second day, got NA because R doesn’t not know how to find the maximum value. However, David also got NA this time, which makes sense.
Sometimes, calculating simple descriptive statistics can be a
cumbersome task. The s function is there to support you! Since it
returns NA
if the vector is empty we get:
%>%
df group_by(name) %>%
summarize(n_sodas_best_day = max(s(sold_sodas)))
#> # A tibble: 3 × 2
#> name n_sodas_best_day
#> <chr> <dbl>
#> 1 Amanda 8
#> 2 David NA
#> 3 Viktor 4
Another astonishing result one might encounter occurs when R tries to
return a value when there is none. Take this extract df
from the starwars
dataset from the R package
dplyr
.
%>% head(10) df
#> # A tibble: 10 × 4
#> name homeworld species height
#> <chr> <chr> <chr> <int>
#> 1 Luke Skywalker Tatooine Human 172
#> 2 C-3PO Tatooine Droid 167
#> 3 R2-D2 Naboo Droid 96
#> 4 Darth Vader Tatooine Human 202
#> # … with 6 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Say that we want to calculate find the height of the tallest human from each homeworld. For precautionary reasons, we drop all rows with missing values from the height column so that we do not get the same problem as before.
%>%
df filter(!is.na(height)) %>%
group_by(homeworld) %>%
summarize(tallest_human = max(height[species == "Human"]))
#> # A tibble: 49 × 2
#> homeworld tallest_human
#> <chr> <dbl>
#> 1 Alderaan 191
#> 2 Aleen Minor -Inf
#> 3 Bespin 175
#> 4 Bestine IV 180
#> # … with 45 more rows
#> # ℹ Use `print(n = ...)` to see more rows
We got negative infinity -Inf
again. How could this
be?
This is because some homeworld have no humans, e.g. Cerea. R tries to
calculate the maximum value of nothing. The s
function can
help you out! Since it returns NA
if the vector is empty we
get:
%>%
df filter(!is.na(height)) %>%
group_by(homeworld) %>%
summarize(tallest_human = max(s(height[species == "Human"])))
#> # A tibble: 49 × 2
#> homeworld tallest_human
#> <chr> <int>
#> 1 Alderaan 191
#> 2 Aleen Minor NA
#> 3 Bespin 175
#> 4 Bestine IV 180
#> # … with 45 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Now we get missing values for the homeworlds that does not have any humans. Makes sense.
Numerical vectors in R can include more than numbers and missing
values NA
. They can also include infinite numbers
Inf
and -Inf
as shown in the examples above.
Furthermore, numerical vectors can include NaN
‘s which
means ’not-a-number’. If the data frame you are using have
NaN
or Inf
it may cause you problems when
summarizing your data. Some examples:
<- c(NaN, 1)
x min(x)
#> [1] NaN
<- c(Inf, 3, 4)
x mean(x)
#> [1] Inf
<- c(5, -Inf, 2)
x sum(x)
#> [1] -Inf
Often when you summarize vectors that have NaN
or
Inf
you want to treat them as a missing value. Maybe they
have appeared as a mistake when you accidentally divided a value by zero
since 1/0 = Inf
in R. The s
function solves
this for you be replacing them with NA
.
<- c(NaN, 1)
x min(s(x))
#> [1] 1
<- c(Inf, 3, 4)
x mean(s(x))
#> [1] 3.5
<- c(5, -Inf, 2)
x sum(s(x))
#> [1] 7
s
and summary functionsIf things get too messy with an extra function you might prefer the
wrapper functions of s
. All major summary functions have an
s wrapped alternative in hablar
. These are accessed by
adding an underscore to the name of the summary function,
i.e. min_(x)
and is equal to min(s(x))
.
Repeating the previous exercises using wrappers for s
would
look like:
<- c(NaN, 1)
x min_(x)
#> [1] 1
<- c(Inf, 3, 4)
x mean_(x)
#> [1] 3.5
<- c(5, -Inf, 2)
x sum_(x)
#> [1] 7
To summarize, s
can help you to get results when you
summarize your data, if there is an sensible answer in the vector. If
not, you will get NA
.