Get started

Koen Derks

last modified: 11-08-2021

Welcome to the ‘Get started’ page of the digitTests package. In this vignette you are able to find detailed examples of how you can incorporate the functions provided by the package.

Function: extract_digits()

The workhorse of the package is the extract_digits() function. This function takes a vector of numbers and returns the requested digits (with or without including 0’s).

Example:

x <- c(0.00, 0.20, 1.23, 40.00, 54.04)
extract_digits(x, check = 'first', include.zero = FALSE)
## [1] NA  2  1  4  5

Functions: distr.test() & distr.btest()

The functions distr.test() and distr.btest() take a vector of numeric values, extract the requested digits, and compares the frequencies of these digits to a reference distribution. The function distr.test() performs a frequentist hypothesis test of the null hypothesis that the digits are distributed according to the reference distribution and produces a p value. The function distr.btest() performs a Bayesian hypothesis test of the null hypothesis that the digits are distributed according to the reference distribution against the alternative hypothesis (using the prior parameters specified in alpha) that the digits are not distributed according to the reference distribution and produces a Bayes factor (Kass & Raftery, 1995). The possible options for the check argument are taken over from extract_digits().

Example:

Benford’s law (Benford, 1938) is a principle that describes a pattern in many naturally-occurring numbers. According to Benford’s law, each possible leading digit d in a naturally occurring, or non-manipulated, set of numbers occurs with a probability p(d) = log10(1 + 1/d). The distribution of leading digits in a data set of financial transaction values (e.g., the sinoForest data) can be extracted and tested against the expected frequencies under Benford’s law using the code below.

# Frequentist hypothesis test
distr.test(sinoForest$value, check = 'first', reference = 'benford')
## 
##  Digit distribution test
## 
## data:  sinoForest$value
## n = 772, X-squared = 7.6517, df = 8, p-value = 0.4682
## alternative hypothesis: leading digit(s) are not distributed according to the benford distribution.
# Bayesian hypothesis test using default prior
distr.btest(sinoForest$value, check = 'first', reference = 'benford', BF10 = FALSE)
## 
##  Digit distribution test
## 
## data:  sinoForest$value
## n = 772, BF01 = 6899678
## alternative hypothesis: leading digit(s) are not distributed according to the benford distribution.

Function: rv.test()

The function rv.test() analyzes the frequency with which values get repeated within a set of numbers. Unlike Benford’s law, and its generalizations, this approach examines the entire number at once, not only the first or last digit. For the technical details of this procedure, see Simohnsohn (2019). The possible options for the check argument are taken over from extract_digits().

Example:

In this example we analyze a data set from a (retracted) paper that describes three experiments run in Chinese factories, where workers were nudged to use more hand-sanitizer. These data were shown to exhibited two classic markers of data tampering: impossibly similar means and the uneven distribution of last digits (Yu, Nelson, & Simohnson, 2018). We can use the rv.test() function to test if these data also contain a greater amount of repeated values than expected if the data were not tampered with.

rv.test(sanitizer$value, check = 'lasttwo', B = 2000)
## 
##  Repeated values test
## 
## data:  sanitizer$value
## n = 1600, AF = 1.5225, p-value = 0.0025
## alternative hypothesis: average frequency in data is greater than for random data.