| Type: | Package |
| Title: | Provides Batch Functions and Visualisation for Basic Statistical Procedures |
| Version: | 3.0.0 |
| Description: | Designed to streamline data analysis and statistical testing, reducing the length of R scripts while generating well-formatted outputs in 'pdf', 'Microsoft Word', and 'Microsoft Excel' formats. In essence, the package contains functions which are sophisticated wrappers around existing R functions that are called by using 'f_' (user f_riendly) prefix followed by the normal function name. This third version of the 'rfriend' package focuses primarily on data exploration, including tools for creating summary tables, f_summary(), summary figures, f_scan(), outlier detection and removal, f_outlier() and f_remove_outliers(), performing data transformations, f_boxcox() in part based on 'MASS/boxcox' and 'rcompanion', and f_bestNormalize() which wraps and extends functionality from the 'bestNormalize' package. Furthermore, 'rfriend' can automatically (or on request) generate visualizations such as boxplots, f_boxplot(), QQ-plots, f_qqnorm(), histograms f_hist(), and density plots f_density(). Additionally, the package includes several statistical test functions: f_aov(), f_chisq_test(), f_corplot(), f_kruskal_test(), f_lmer(), f_glm(), f_t_test(), f_wilcox_test(), for sequential testing and visualisation of the similar named 'stats' functions. These functions, except for f_chisq_test(), support testing multiple response variables and predictors, while also handling assumption checks, data transformations, and post hoc tests. Post hoc results are automatically summarized in a table using the compact letter display (cld) format for easy interpretation. The package also provides a function to do model comparison, f_model_comparison(), and several utility functions to simplify common R tasks. For example, f_clear() clears the workspace and restarts R with a single command; f_setwd() sets the working directory to match the directory of the current script; f_theme() quickly changes 'RStudio' themes; and f_factors() converts multiple columns of a data frame to factors, and much more. If you encounter any issues or have feature requests, please feel free to contact me via email. |
| Note: | When loading, both MuMIn and rstatix are imported. Since rstatix internally depends on broom, this may trigger a warning about S3 method overwrites, specifically for nobs.fitdistr and nobs.multinom. These warnings are harmless and do not affect functionality. |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| Depends: | R (≥ 4.4.0) |
| Imports: | bestNormalize, crayon, DHARMa, dplyr, emmeans, ggplot2, grDevices, gridExtra, knitr, lme4, lmerTest, magick, magrittr, multcomp, multcompView, MuMIn, nortest, pander, png, rlang, rmarkdown, rstatix, rstudioapi, stringr, this.path, tidyr, writexl, xfun |
| Suggests: | MASS, nnet, pbkrtest, testthat (≥ 3.0.0), tibble |
| Config/testthat/edition: | 3 |
| RoxygenNote: | 7.3.3 |
| SystemRequirements: | Pandoc (>= 3.2) |
| NeedsCompilation: | no |
| Packaged: | 2026-04-28 20:14:31 UTC; shvan |
| Author: | Sander H. van Delden [aut, cre] |
| Maintainer: | Sander H. van Delden <plantmind@proton.me> |
| URL: | https://delde001.github.io/rfriend/ |
| Repository: | CRAN |
| Date/Publication: | 2026-04-28 21:10:02 UTC |
Convert a data frame to a contingency table
Description
Convert a data frame to a contingency table
Usage
df_to_table(df, label_col = NULL)
Arguments
df |
A data frame. Either (a) one column contains row labels and the rest are
numeric, (b) a fully numeric data frame with meaningful |
label_col |
Index or name of the column containing row labels. If NULL (default),
the function auto-detects the first character/factor column. If no such column is
found, the function falls back to using |
Value
A contingency table.
Perform multiple aov() functions with optional data transformation, inspection and Post Hoc test.
Description
Performs an Analysis of Variance (ANOVA) on a given dataset with options for (Box-Cox) transformations, normality tests, and post hoc analysis. Several response parameters can be analysed in sequence and the generated output can be in various formats ('Word', 'pdf', 'Excel').
Usage
f_aov(
formula,
data = NULL,
norm_plots = TRUE,
interaction_plots = TRUE,
ANCOVA = FALSE,
transformation = TRUE,
force_transformation = NULL,
force_aov = FALSE,
alpha = 0.05,
adjust = "sidak",
intro_text = TRUE,
close_generated_files = FALSE,
open_generated_files = interactive(),
output_type = "default",
save_as = NULL,
save_in_wdir = FALSE,
...
)
Arguments
formula |
A formula specifying the model to be fitted. More response variables can be added using |
data |
A data frame containing the variables in the model. |
norm_plots |
Logical. If |
interaction_plots |
Logical. If |
ANCOVA |
Logical. If |
transformation |
Logical or character string. If |
force_transformation |
Character string. A vector containing the names of response variables that should be transformed regardless of the normality test. Default is |
force_aov |
Logical. If |
alpha |
Numeric. Significance level for ANOVA, post hoc tests, and Shapiro-Wilk test. Default is |
adjust |
Character string specifying the method used to adjust p-values for multiple comparisons. Available methods include:
Default is |
intro_text |
Logical. If |
close_generated_files |
Logical. Closes open Excel or Word (NOT pdf) files before writing, depending on the output format. Works on Windows (taskkill), macOS (pkill) and Linux (pkill/soffice). Default |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
output_type |
Character string specifying the output format. Default is
|
save_as |
Character string specifying the output file path (without extension).
If a full path is provided, output is saved to that location.
If only a filename is given, the file is saved in |
save_in_wdir |
Logical. If |
... |
Additional arguments forwarded to |
Details
The function performs the following steps:
Check if all specified variables are present in the data.
Ensure that the response variable is numeric.
Perform Analysis of Variance (ANOVA) using the specified formula and data.
If
shapiro = TRUE, check for normality of residuals using the Shapiro-Wilk test.If residuals are not normal and
transformation = TRUEapply a data transformation.If significant differences are found in ANOVA, proceed with post hoc tests using estimated marginal means from
emmeans()and Sidak adjustment (or another option ofadjust =.
More response variables can be added using - or + (e.g., response1 + response2 ~ predictor) to do a sequential aov() for each response parameter captured in one output file.
Outputs can be generated in multiple formats ("pdf", "word", "excel" and "rmd") as specified by output_type. The function also closes any open 'Word' files to avoid conflicts when generating 'Word' documents. If output_type = "rmd" is used it is adviced to use it in a chunk with {r, echo=FALSE, results='asis'}
*Non-significant ANOVA results*: When the overall F-test is not significant, f_aov still reports the estimated marginal means table, but with all pairwise comparison letters replaced by *"ns"*. The numeric estimates (and their confidence intervals) are provided because they are often needed for manuscript tables, especially when the response was back-transformed from a Box-Cox or bestNormalize scale - the raw descriptive means and the emmeans values can differ, and it is the emmeans values that correspond to the actual model. The *"ns"* labels signal that pairwise differences should not be interpreted.
This function requires [Pandoc](https://github.com/jgm/pandoc/releases/tag) (version 1.12.3 or higher), a universal document converter.
-
Windows: Install Pandoc and ensure the installation folder.
(e.g., "C:/Users/your_username/AppData/Local/Pandoc") is added to your system PATH. -
macOS: If using Homebrew, Pandoc is typically installed in "/usr/local/bin". Alternatively, download the .pkg installer and verify that the binary's location is in your PATH.
-
Linux: Install Pandoc through your distribution's package manager (commonly installed in "/usr/bin" or "/usr/local/bin") or manually, and ensure the directory containing Pandoc is in your PATH.
If Pandoc is not found, this function may not work as intended.
Value
An object of class 'f_aov' containing results from aov(), normality tests, transformations, and post hoc tests. Using the option "output_type", it can also generate output in the form of: R Markdown code, 'Word', 'pdf', or 'Excel' files. Includes print and plot methods for 'f_aov' objects.
Multiple Testing Across Response Variables
When several response variables are analysed in a single call
(e.g. y1 + y2 + y3 ~ treatment), each ANOVA is an independent
null-hypothesis test at level alpha. The post hoc adjustments
(adjust = "sidak", "tukey", etc.) only control the
family-wise error rate within one ANOVA (across pairwise group
comparisons for that response). They do not protect against
the inflation of Type I error across the set of responses.
Practical implication: With k independent response
variables all tested at \alpha = 0.05, the probability of
obtaining at least one false positive is
1 - (1 - 0.05)^k, which reaches ~40% for k = 10.
When this matters: The risk is highest in exploratory studies where many responses are screened simultaneously without a clear a priori hypothesis for each one. It is less of a concern when each response is a pre-specified primary outcome with its own biological rationale.
Possible remedies:
-
Bonferroni correction across responses: use
alpha = 0.05 / kwherekis the number of response variables. Conservative but simple. -
False Discovery Rate (FDR): apply
p.adjust(p_values, method = "fdr")to the vector of per-response ANOVA p-values after the fact. -
MANOVA: if the responses are correlated and you want a single omnibus test across all of them, use
manova()before interpreting individual ANOVAs. -
Pre-registration: declare primary vs. exploratory responses before data collection to justify differential correction thresholds.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# Make a factor of Species.
iris$Species <- factor(iris$Species)
# The left hand side contains two response variables,
# so two aov's will be conducted, i.e. "Sepal.Width"
# and "Sepal.Length" in response to the explanatory variable: "Species".
f_aov_out <- f_aov(Sepal.Width + Sepal.Length ~ Species,
data = iris,
# Save output in MS Word file (Default is console)
output_type = "word",
# Do bestNormalize transformation for non-normal residual (Default is boxcox)
transformation = "bestnormalize"
)
# Print output to the console.
print(f_aov_out)
# Plot residual plots.
plot(f_aov_out)
#To print rmd output set chunck option to results = 'asis' and use cat().
f_aov_rmd_out <- f_aov(Sepal.Width ~ Species, data = iris, output_type = "rmd")
cat(f_aov_rmd_out$rmd)
f_bestNormalize: Automated Data Normalization with bestNormalize
Description
Applies optimal normalization transformations using 'bestNormalize', provides diagnostic checks, and generates comprehensive reports.
Usage
f_bestNormalize(
data,
alpha = 0.05,
plots = FALSE,
data_name = NULL,
output_type = "default",
save_as = NULL,
save_in_wdir = FALSE,
close_generated_files = FALSE,
open_generated_files = interactive(),
...
)
Arguments
data |
Numeric vector or single-column data frame. |
alpha |
Numeric. Significance level for normality tests (default = |
plots |
Logical. If |
data_name |
A character string to manually set the name of the data for plot axis and reporting. Default extracts name from input object. |
output_type |
Character string specifying the output format. Default is
|
save_as |
Character string specifying the output file path (without extension).
If a full path is provided, output is saved to that location.
If only a filename is given, the file is saved in |
save_in_wdir |
Logical. If |
close_generated_files |
Logical. Closes open Excel or Word (NOT pdf) files before writing, depending on the output format. Works on Windows (taskkill), macOS (pkill) and Linux (pkill/soffice). Default |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
... |
Additional arguments passed to bestNormalize. |
Details
This is a wrapper around the 'bestNormalize' package. Providing a fancy output and the settings of 'bestNormalize' are tuned based on sample size n.
If n < 100, loo = TRUE, allow_orderNorm = FALSE and r doesn't matter as loo = TRUE.
If 100 <= n < 200, loo = FALSE, allow_orderNorm = TRUE and r = 50.
If n >= 200, loo = FALSE, allow_orderNorm = TRUE, r = 10. These setting can be overwritten by user options.
This function requires [Pandoc](https://github.com/jgm/pandoc/releases/tag) (version 1.12.3 or higher), a universal document converter.
-
Windows: Install Pandoc and ensure the installation folder
(e.g., "C:/Users/your_username/AppData/Local/Pandoc") is added to your system PATH. -
macOS: If using Homebrew, Pandoc is typically installed in "/usr/local/bin". Alternatively, download the .pkg installer and verify that the binary's location is in your PATH.
-
Linux: Install Pandoc through your distribution's package manager (commonly installed in "/usr/bin" or "/usr/local/bin") or manually, and ensure the directory containing Pandoc is in your PATH.
If Pandoc is not found, this function may not work as intended.
Value
Returns an object of class 'f_bestNormalize' containing:
-
transformed_dataNormalized vector. -
bestNormalizeFull bestNormalize object from original package. -
data_nameName of the analyzed dataset. -
transformation_nameName of selected transformation. -
shapiro_originalShapiro-Wilk test results for original data. -
shapiro_transformedShapiro-Wilk test results for transformed data. -
norm_statsData frame of normality statistics for all methods. -
rmdRmd code if outputype = "rmd".
Also generates reports in 'Word', or 'pdf' files. When using output to console and plots = TRUE, the function prints QQ-plots, Histograms and a summary data transformation report. Includes print and plot methods for objects of class 'f_bestNormalize'.
Author(s)
Sander H. van Delden plantmind@proton.me
References
Peterson, C. (2025). bestNormalize: Flexibly calculate the best normalizing transformation for a vector. Available at: https://cran.r-project.org/package=bestNormalize
Examples
# Use set.seed to keep the outcome of bestNormalize stable.
set.seed(123)
# Create some skewed data (e.g., using a log-normal distribution).
skewed_data <- rlnorm(100, meanlog = 0, sdlog = 1)
# Basic usage: transform and store the full result object.
result <- f_bestNormalize(skewed_data, data_name = "Skewed log-normal data")
# Print a summary of the transformation.
print(result)
# Inspect normality statistics for all candidate transformations.
result$norm_stats
# Plot histograms and QQ-plots for original vs. transformed data.
plot(result)
# Use plots = TRUE to auto-plot when output_type = "default" (default).
result2 <- f_bestNormalize(skewed_data, plots = TRUE)
# Extract only the transformed (data) vector directly.
transformed_data <- f_bestNormalize(skewed_data)$transformed_data
# data.frame input: column name is used as data_name automatically.
df <- data.frame(measurement = skewed_data)
result_df <- f_bestNormalize(df)
# Data with NAs: NAs are preserved at their original positions.
skewed_na <- skewed_data
skewed_na[c(5, 20)] <- NA
result_na <- f_bestNormalize(skewed_na)
# Access a specific alternative transformation (first check what is available).
names(result$bestNormalize$other_transforms)
# Then extract the one you want, e.g.:
# result$bestNormalize$other_transforms$yeojohnson$x.t
# Force output to console (prints report + plots automatically).
f_bestNormalize(skewed_data, output_type = "console")
# Generate a PDF report saved to a custom path.
f_bestNormalize(skewed_data,
output_type = "pdf",
save_as = "my_report"
)
# Generate R Markdown output for use inside a .Rmd chunk
# (set chunk option results = 'asis').
rmd_result <- f_bestNormalize(skewed_data, output_type = "rmd")
cat(rmd_result$rmd)
f_boxcox: A User-Friendly Box-Cox Transformation
Description
Performs a Box-Cox transformation on a dataset to stabilize variance and make the data more normally distributed. It also provides diagnostic plots and tests for normality. The transformation is based on code of MASS/R/boxcox.R. The function prints \lambda to the console and returns (output) the transformed data set.
Usage
f_boxcox(
data = data,
digits = 3,
range = c(-2, 2),
plots = NULL,
transform.data = TRUE,
eps = 1/50,
xlab = expression(lambda),
ylab = "log-Likelihood",
alpha = 0.05,
open_generated_files = interactive(),
close_generated_files = FALSE,
output_type = "default",
save_as = NULL,
save_in_wdir = FALSE,
...
)
Arguments
data |
A numeric vector or a data frame with a single numeric column. The data to be transformed. |
digits |
Numeric. Determines the accuracy of the estimate for lambda. Higher values increase computation time. Defaults to |
range |
A numeric vector of length 2 defining the search interval for lambda. Defaults to |
plots |
Logical. If |
transform.data |
Logical. If |
eps |
A small positive value used to determine when to switch from the power transformation to the log transformation for numerical stability. Default is |
xlab |
Character string. Label for the x-axis in plots. Default is an expression object representing |
ylab |
Character string. Label for the y-axis in plots. Default is "log-Likelihood". |
alpha |
Numeric. Significance level for the Shapiro-Wilk test of normality. Default is |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
close_generated_files |
Logical. Closes open Excel or Word (NOT pdf) files before writing, depending on the output format. Works on Windows (taskkill), macOS (pkill) and Linux (pkill/soffice). Default |
output_type |
Character string specifying the output format. Default is
|
save_as |
Character string specifying the output file path (without extension).
If a full path is provided, output is saved to that location.
If only a filename is given, the file is saved in |
save_in_wdir |
Logical. If |
... |
Additional arguments passed to plotting functions. |
Details
The function uses the following formula for transformation:
y(\lambda) =
\begin{cases}
\frac{y^\lambda - 1}{\lambda}, & \lambda \neq 0 \\ \log(y), & \lambda = 0
\end{cases}
where (y) is the data being transformed, and (\lambda) the transformation parameter, which is estimated from the data using maximum likelihood. The function computes the Box-Cox transformation for a range of \lambda values and identifies the \lambda that maximizes the log-likelihood function. The beauty of this transformation is that, it checks suitability of many of the common transformations in one run. Examples of most common transformations and their \lambda value is given below:
\lambda-Value | Transformation |
| ———————– | ———————– |
| -2 | \frac{1}{x^2} |
| -1 | \frac{1}{x} |
| -0.5 | \frac{1}{\sqrt{x}} |
| 0 | log(x) |
| 0.5 | \sqrt{x} |
| 1 | x |
| 2 | x^2 |
| ———————– | ———————– |
If the estimated transformation parameter closely aligns with one of the values listed in the previous table, it is generally advisable to select the table value rather than the precise estimated value. This approach simplifies interpretation and practical application.
The function provides diagnostic plots: a plot of log-likelihood against \lambda values and a Q-Q plot of the transformed data.It also performs a Shapiro-Wilk test for normality on the transformed data if the sample size is less than or equal to 5000.
Note: For sample sizes greater than 5000, Shapiro-Wilk test results are not provided due to limitations in its applicability.
This function requires [Pandoc](https://github.com/jgm/pandoc/releases/tag) (version 1.12.3 or higher), a universal document converter.
-
Windows: Install Pandoc and ensure the installation folder
(e.g., "C:/Users/your_username/AppData/Local/Pandoc") is added to your system PATH. -
macOS: If using Homebrew, Pandoc is typically installed in "/usr/local/bin". Alternatively, download the .pkg installer and verify that the binary's location is in your PATH.
-
Linux: Install Pandoc through your distribution's package manager (commonly installed in "/usr/bin" or "/usr/local/bin") or manually, and ensure the directory containing Pandoc is in your PATH.
If Pandoc is not found, this function may not work as intended.
Value
An object of class 'f_boxcox' containing, among others, results from the boxcox transformation, lambda, the input data, transformed data, Shapiro-Wilk test on original and transformed data. Using the option "output_type", it can also generate output in the form of: R Markdown code, 'Word', or 'pdf' files. Includes print and plot methods for 'f_boxcox' objects.
Author(s)
Sander H. van Delden plantmind@proton.me
Salvatore Mangiafico, mangiafico@njaes.rutgers.edu
W. N. Venables and B. D. Ripley
References
The core of calculating \lambda and the plotting was taken from:
file MASS/R/boxcox.R copyright (C) 1994-2004 W. N. Venables and B. D. Ripley
Some code to present the result was taken and modified from file:
rcompanion/R/transformTukey.r. (Developed by Salvatore Mangiafico)
The explanation on BoxCox transformation provided here was provided by r-coder:
See Also
Examples
# Create non-normal data in a data.frame or vector.
df <- data.frame(values = rlnorm(100, meanlog = 0, sdlog = 1))
# Store the transformation in object "bc".
bc <- f_boxcox(df$values)
# Print lambda and Shaprio.
print(bc)
# Plot the QQ plots, Histograms and Lambda Log-Likelihood estimation.
plot(bc)
# Or Directly use the transformed data from the f_boxcox object.
df$values_transformed <- f_boxcox(df$values)$transformed_data
print(df$values_transformed)
Generate a Boxplot Report of a data.frame
Description
Generates boxplots for all numeric variables in a given dataset, grouped by factor variables. The function automatically detects numeric and factor variables. It allows two output formats ('pdf', 'Word') and includes an option to add a general explanation about interpreting boxplots.
Usage
f_boxplot(x, ...)
## S3 method for class 'formula'
f_boxplot(formula, data, ...)
## S3 method for class 'data.frame'
f_boxplot(x, ...)
f_boxplot_worker(
formula = NULL,
data,
fancy_names = NULL,
output_type = "pdf",
outliers = TRUE,
coef = 1.5,
limit_columns = 7,
save_as = NULL,
save_in_wdir = FALSE,
close_generated_files = FALSE,
open_generated_files = interactive(),
boxplot_explanation = TRUE,
detect_factors = TRUE,
jitter = FALSE,
width = 8,
height = 7,
units = "in",
res = 300,
las = 2
)
Arguments
x |
A data.frame or formula (dispatches to the right method). |
... |
Further arguments forwarded to |
formula |
A formula specifying the factor to be plotted. More response variables can be added using |
data |
A |
fancy_names |
An optional named vector mapping column names in |
output_type |
Character string, specifying the output format: |
outliers |
Logical. If |
coef |
Numeric. The multiplier for the Interquartile Range (IQR) used for outlier detection. Default |
limit_columns |
Integer or |
save_as |
Character string specifying the output file path (without extension).
If a full path is provided, output is saved to that location.
If only a filename is given, the file is saved in |
save_in_wdir |
Logical. If |
close_generated_files |
Logical. Closes open Excel or Word (NOT pdf) files before writing, depending on the output format. Works on Windows (taskkill), macOS (pkill) and Linux (pkill/soffice). Default |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
boxplot_explanation |
A logical value indicating whether to include an explanation of how to interpret boxplots in the report. Defaults to |
detect_factors |
A logical value indicating whether to automatically detect factor variables in the dataset. Defaults to |
jitter |
A logical value, if |
width |
Numeric, png figure width default |
height |
Numeric, png figure height default |
units |
Character string, png figure units default |
res |
Numeric, png figure resolution default 300 dpi |
las |
An integer ( |
Details
The function performs the following steps:
Detects numeric and factor variables in the dataset.
Generates boxplots for each numeric variable grouped by each factor variable.
Outputs the report in the specified format ('pdf', 'Word' or 'Rmd').
If output_type = "rmd" is used it is adviced to use it in a chunk with {r, echo=FALSE, results='asis'}
If no factor variables are detected, the function stops with an error message since factors are required for creating boxplots.
This function will plot all numeric and factor candidates, use the function subset() to prepare a selection of columns before submitting to f_boxplot().
Note that there is an optional jitter option to plot all individual data points over the boxplots.
This function requires [Pandoc](https://github.com/jgm/pandoc/releases/tag) (version 1.12.3 or higher), a universal document converter.
Windows: Install Pandoc and ensure the installation folder
(e.g., "C:/Users/your_username/AppData/Local/Pandoc") is added to your system PATH.
macOS: If using Homebrew, Pandoc is typically installed in "/usr/local/bin". Alternatively, download the .pkg installer and verify that the binary's location is in your PATH.
Linux: Install Pandoc through your distribution's package manager (commonly installed in "/usr/bin" or "/usr/local/bin") or manually, and ensure the directory containing Pandoc is in your PATH.
If Pandoc is not found, this function may not work as intended.
Value
The return value depends on output_type:
-
"pdf"and"word": Writes a report file tosave_as(ortempdir()by default) and returnsNULLinvisibly. The file can optionally be opened withopen_generated_files = TRUE. -
"png": Writes one PNG file per response x factor combination into the directory given bysave_asand returnsNULLinvisibly. -
"rmd": Returns the generated R Markdown content as a single character string (invisibly). No file is written and nothing is printed to the console. The caller cancat()the string, assign it to a variable, or embed it in a larger report (see Examples).
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# Example usage:
data(iris)
new_names = c(
"Sepal.Length" = "Sepal length (cm)" ,
"Sepal.Width" = "Sepal width (cm)",
"Petal.Length" = "Petal length (cm)",
"Petal.Width" = "Petal width (cm)",
"Species" = "Cultivar"
)
# Use the whole data.frame to generate an MS Word report and don't open it.
f_boxplot(iris,
fancy_names = new_names,
output_type = "word"
)
# Use a formula to plot several response parameters (response 1 + response 2 etc)
# and generate a rmd output without boxplot_explanation.
data(mtcars)
f_boxplot(hp + disp ~ gear*cyl,
data=mtcars,
boxplot_explanation = FALSE,
output_type = "word"
)
# Capture the R Markdown output as a string and render it inline.
# Use output_type = "rmd" to get the markdown back as a character value
# instead of writing a file. Useful for embedding in a larger knitr document.
rmd <- f_boxplot(iris,
output_type = "rmd",
boxplot_explanation = FALSE,
outliers = FALSE
)
# Display it in the console
cat(rmd)
# ...or splice it into a knitr child chunk with results = "asis":
# ```{r, echo=FALSE, results='asis'}
# cat(rmd)
# ```
Chi-squared Test with post hoc Analysis
Description
Performs a chi-squared test chisq.test, then automatically conducts post hoc analysis if the test is significant. The function provides adjusted p-values for each cell in the contingency table using a specified correction method.
Usage
f_chisq_test(
x,
y,
p = NULL,
method = "bonferroni",
digits = 3,
alpha = 0.05,
force_posthoc = FALSE,
...
)
Arguments
x |
A numeric vector (or factor), or a contingency table in matrix or table form. If a data frame is entered the function will try to convert it to a table using |
y |
A numeric vector; ignored if x is a matrix, table or data.frame. If x is a factor, y should be a factor of the same length. |
p |
A vector of probabilities of the same length as x. Default is |
method |
Character string specifying the adjustment method for p-values. Default is |
digits |
Integer specifying the number of decimal places for rounding. Default is |
alpha |
Numeric threshold for significance. Default is |
force_posthoc |
Logical indicating whether to perform post hoc tests even if the chi-squared test is not significant. Default is |
... |
Additional arguments passed to |
Details
The function first performs a chi-squared test using chisq.test. If the test is
significant (p < alpha) or if force_posthoc = TRUE, it conducts post hoc analysis by examining
the standardized residuals. The p-values for these residuals are adjusted using the specified method
to control for multiple comparisons.
If the input is a data frame, the function attempts to convert it to a table and displays the resulting table for verification.
Value
An object of class f_chisq_test containing:
-
chisq_test_output: The output fromchisq.test. -
adjusted_p_values: Matrix of adjusted p-values (for table/matrix input). -
observed_vs_adj_p_value: Interleaved table of observed values and adjusted p-values. -
stdres_vs_adj_p_value: Interleaved table of standardized residuals and adjusted p-values. -
adj_p_values: Vector of adjusted p-values (for vector input). -
posthoc_output_table: Data frame with observed values, expected values, standardized residuals, and adjusted p-values (for vector input). -
observed_vs_adj_p_value: Interleaved table of observed values and adjusted p-values (for table/matrix input). -
stdres_vs_adj_p_value: Interleaved table of standardized residuals and adjusted p-values (for table/matrix input).
Author(s)
Sander H. van Delden plantmind@proton.me
References
This function implements a post hoc analysis for chi-squared tests inspired by the methodology in:
Beasley, T. M., & Schumacker, R. E. (1995). Multiple Regression Approach to Analyzing Contingency Tables: Post Hoc and Planned Comparison Procedures. The Journal of Experimental Education, 64(1), 79-93.
The implementation draws inspiration from the 'chisq.posthoc.test' package by Daniel Ebbert.
Examples
# Chi.square on independence: Association between two variables.
# Create a contingency table.
my_table <- as.table(rbind(c(100, 150, 50), c(120, 90, 40)))
dimnames(my_table) <- list(Gender = c("Male", "Female"),
Response = c("Agree", "Neutral", "Disagree"))
# Perform chi-squared test with post hoc analysis.
f_chisq_test(my_table)
# Use a different adjustment method.
f_chisq_test(my_table, method = "holm")
# Other forms still work like Goodness-of-Fit: Match to theoretical distribution.
# Observed frequencies of rolling with a die 1 - 6.
observed <- c(2, 2, 10, 20, 15, 11)
# Expected probabilities under a fair die.
expected_probs <- rep(1/6, 6)
# Chi-Square Goodness-of-Fit Test.
f_chisq_test(x = observed, p = expected_probs)
f_clear: Clear Various Aspects of the R Environment
Description
Provides a convenient way to clear different components of the R environment, including the console, memory, graphics, and more. It also offers the option to restart the R session. This can come in handy at the start of an R script.
Usage
f_clear(env = TRUE, gc = TRUE, console = TRUE, graph = TRUE, restart = FALSE)
Arguments
env |
Logical. If |
gc |
Logical. If |
console |
Logical. If |
graph |
Logical. If |
restart |
Logical. If |
Details
Console Clearing: Clears the console output.
Garbage Collection: Performs garbage collection to free memory from unreferenced objects.
Graph Clearing: Closes all open graphics devices.
Environment Clearing: Removes all objects from the global environment.
Session Restart: Restarts the R session (only available in 'RStudio').
Value
No return value, called for side effects, see details.
Note
The restart parameter requires 'RStudio' and its API package ('rstudioapi') to be installed and available.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# Clear console, memory, graphs, and for example NOT the environment.
f_clear(env = FALSE)
Conditional Rounding for Numeric Values
Description
Conditionally formats numeric values based on their magnitude. Values that are very small or very large are formatted using scientific notation, while other values are rounded to a specified number of decimal places. Integers are preserved without decimal places. When applied to a data frame, only numeric columns are processed. All output is character string.
Usage
f_conditional_round(
x,
threshold_small = 0.01,
threshold_large = 10000,
digits = 3,
replace_na = TRUE,
na_string = "-",
allow_integer_decimal_mix = FALSE
)
Arguments
x |
A numeric vector or data frame containing numeric columns to be formatted. |
threshold_small |
Numeric value. Values with absolute magnitude smaller than this
threshold will be formatted using scientific notation. Default is |
threshold_large |
Numeric value. Values with absolute magnitude larger than or equal
to this threshold will be formatted using scientific notation. Default is |
digits |
Integer. Number of decimal digits to use in formatting. Default is |
replace_na |
Logical. If |
na_string |
Character string used to replace |
allow_integer_decimal_mix |
Logical. If |
Details
The function applies the following formatting rules:
Values smaller than
threshold_smallor larger thanthreshold_largeare formatted in scientific notation with decimaldigits.Integer values are formatted without decimal places.
Non-integer values that don't require scientific notation are rounded to
digitsdecimal places.NA values are replaced with empty strings if
replace_na = TRUE.Empty strings in the input are preserved.
For data frames, only numeric columns are processed; other columns remain unchanged.
Value
If input is a vector: A character vector of the same length as the input, with values formatted according to the specified rules.
If input is a data frame: A data frame with the same structure as the input, but with character columns formatted according to the specified rules.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# Vector examples.
f_conditional_round(c(0.0001, 0.5, 3, 10000))
# Returns: "1.000e-04" "0.500" "3" "1.000e+04".
f_conditional_round(c(0.0001, 0.5, 3, 10000, NA), replace_na = TRUE)
# Returns: "1.000e-04" "0.500" "3" "1.000e+04" ""
# Data frame example.
df <- data.frame(
name = c("A", "B", "C"),
small_val = c(0.0001, 0.002, 0.5),
integer = c(1, 2, 3),
integer_mix = c(10, 20, 30.1),
large_val = c(10000, 5000, NA)
)
# Show only two digits.
f_conditional_round(df, digits = 2)
# To keep Integers as Integers (no digits)
# in columns with mixed data (Integers and digits)
# set allow_integer_decimal_mix = TRUE
f_conditional_round(df, allow_integer_decimal_mix = TRUE)
# Custom NA replacement string.
f_conditional_round(c(0.5, NA, 3), replace_na = TRUE, na_string = "-")
# Returns: "0.500" "-" "3"
f_conditional_round(c(0.5, NA, 3), replace_na = TRUE, na_string = "")
# Returns: "0.500" "" "3"
Correlation Plots with Factor Detection and Multiple Correlation Coefficients
Description
Creates correlation plots for numeric variables in a data frame. The upper
triangle displays Pearson r, Spearman \rho, and Kendall
\tau simultaneously for each pair. Factor variables are automatically detected and
used for grouping, i.e. point colouring and shaping. Ordinal variables are supported via
ordinal_vars: their diagonal labels are italicised and Pearson r
is greyed and bracketed for any pair that involves them.A separate legend file documents both
the grouping factors and the meaning of all three correlation symbols.
Usage
f_corplot(
data,
detect_factors = TRUE,
factor_table = FALSE,
factor_exclude = NULL,
factor_select = NULL,
unique_num_treshold = 8,
repeats_threshold = 2,
color_factor = "auto",
shape_factor = "auto",
print_legend = TRUE,
fancy_names = NULL,
ordinal_vars = NULL,
width = 15,
height = 15,
res = 600,
pointsize = 10,
close_generated_files = FALSE,
open_generated_files = interactive(),
output_type = "word",
save_as = NULL,
save_in_wdir = FALSE
)
Arguments
data |
A |
detect_factors |
Logical. If |
factor_table |
Logical. If |
factor_exclude |
A character vector specifying the names of the columns NOT to convert into factors. If |
factor_select |
A character vector specifying the names of the columns to convert into factors. If |
unique_num_treshold |
Numeric. A threshold of the amount of unique numbers a numeric column should have to keep it numeric, i.e. omit factor conversion. Default |
repeats_threshold |
Numeric. A threshold of the minimal number of repeats a numeric column should have to convert it to a factor. Default |
color_factor |
Character. Name of the factor variable used for point
colours; |
shape_factor |
Character. Name of the factor variable used for point
shapes; |
print_legend |
Logical. If |
fancy_names |
Named character vector or |
ordinal_vars |
Character vector or |
width |
Numeric. Plot width in centimetres. Default 15. |
height |
Numeric. Plot height in centimetres. Default 15. |
res |
Numeric. Resolution in DPI. Default 600. |
pointsize |
Numeric. Base font size. Default 8. |
close_generated_files |
Logical. Closes open Excel or Word (NOT pdf) files before writing, depending on the output format. Works on Windows (taskkill), macOS (pkill) and Linux (pkill/soffice). Default |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
output_type |
Character. One of |
save_as |
Character or |
save_in_wdir |
Logical. If |
Details
-
Three correlations per panel: Every upper-triangle panel shows
r(Pearson),\rho(Spearman), and\tau(Kendall) stacked vertically, so the reader can choose the most appropriate coefficient for each variable pair. -
Ordinal variables: Specify column names with
ordinal_vars. Those variables appear in italic on the diagonal. For any pair where at least one variable is ordinal, Pearsonris shown greyed and in parentheses to signal it is technically inappropriate; Spearman and Kendall remain prominent. -
Factor detection: Only unordered factors are used for colour/shape aesthetics. Ordered factors (
is.ordered()) are treated as ordinal data, not as grouping variables. -
Legend: The legend file documents the grouping factor levels (when present) and always includes an explanation of all three correlation symbols whenever a legend is generated.
-
Constant columns: Zero-variance columns produce
NAin all correlation panels.
This function requires [Pandoc](https://github.com/jgm/pandoc/releases/tag) (version 1.12.3 or higher), a universal document converter.
-
Windows: Install Pandoc and ensure the installation folder
(e.g., "C:/Users/your_username/AppData/Local/Pandoc") is added to your system PATH. -
macOS: If using Homebrew, Pandoc is typically installed in "/usr/local/bin". Alternatively, download the .pkg installer and verify that the binary's location is in your PATH.
-
Linux: Install Pandoc through your distribution's package manager (commonly installed in "/usr/bin" or "/usr/local/bin") or manually, and ensure the directory containing Pandoc is in your PATH.
If Pandoc is not found, this function may not work as intended.
Value
No value is returned to the R environment. Output files are saved and opened automatically.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
data(mtcars)
mtcars_sub <- subset(mtcars, select = -c(am, qsec, vs))
f_corplot(mtcars_sub,
color_factor = "gear",
shape_factor = "cyl",
output_type = "png"
)
# With ordinal variables
data(iris)
fancy_names <- c(Sepal.Length = "Sepal Length (cm)",
Sepal.Width = "Sepal Width (cm)")
f_corplot(iris,
fancy_names = fancy_names,
ordinal_vars = "Petal.Width",
output_type = "png",
open_generated_files = FALSE)
Convert multiple columns to Factors in a data frame
Description
Converts multiple specified columns of a data frame into factors. If no columns are specified, it automatically detects and converts columns that are suitable to be factors. The function returns the entire data frame including non factor columns and can report the properties of this new data frame in the console (properties = TRUE).
Usage
f_factors(
data,
select = NULL,
exclude = NULL,
properties = FALSE,
force_factors = FALSE,
unique_num_treshold = 8,
repeats_threshold = 2,
...
)
Arguments
data |
A data frame containing the columns to be converted. |
select |
A character vector specifying the names of the columns to convert into factors. If |
exclude |
A character vector specifying the names of the columns NOT to convert into factors. If |
properties |
Logical. If |
force_factors |
Logical. If |
unique_num_treshold |
Numeric. A threshold of the amount of unique numbers a numeric column should have to keep it numeric, i.e. omit factor conversion. Default |
repeats_threshold |
Numeric. A threshold of the minimal number of repeats a numeric column should have to convert it to a factor. Default |
... |
Additional arguments passed to the |
Details
If
selectisNULL, the function identifies columns with character data or numeric data with fewer than 8 unique values as candidates for conversion to factors.The function checks if all specified columns exist in the data frame and stops execution if any are missing.
Converts specified columns into factors, applying any additional arguments provided.
Outputs a summary data frame with details about each column, including its type, class, number of observations, missing values, factor levels, and labels.
Value
Returns the modified data frame with the specified (or all suitable) columns converted to factors. Can also force a print of a summary of the data frame's structure to the console (properties = TRUE).
Author(s)
Sander H. van Delden plantmind@proton.me
See Also
Examples
# Make a data.frame:
df <- data.frame(a = c("yes", "no", "yes", "yes", "no",
"yes", "yes", "no", "yes"),
b = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
c = c("apple", "kiwi", "banana", "apple", "kiwi",
"banana", "apple", "kiwi", "banana"),
d = c(1.1, 1.1, 3.4, 4.5, 5.4, 6.7, 7.8, 8.1, 9.8)
)
str(df)
# Convert specified columns to factors:
df1 <- f_factors(df, select = c("a", "c"))
str(df1)
# Convert all potential factor columns to factor but exclude column "b":
df2 <- f_factors(df, exclude = c("b"))
str(df2)
# Convert all columns to factor but exclude column "b":
df3 <- f_factors(df, exclude = c("b"), force_factors = TRUE)
str(df3)
# Or automatically detect and convert suitable columns to factors.
# Thus obtaining the same results as above automatically:
df4 <- f_factors(df)
str(df4)
# In example above col b was converted to a factor as the number of repeats = 2
# and the amount of unique numbers < 8. In order to keep b numeric we can also
# adjust the unique_num_treshold and/or repeats_threshold:
df5 <- f_factors(df, unique_num_treshold = 2)
str(df5)
# Use `properties = TRUE` to view the data frame's structure.
# This forces a printed output which is more insight than standard str() output.
df6 <- f_factors(df, properties = TRUE)
Perform multiple glm() functions with diagnostics, assumption checking, and post hoc analysis
Description
Performs Generalized Linear Model (GLM) analysis on a given dataset with options for diagnostics, assumption checking, and post hoc analysis. Several response parameters can be analyzed in sequence and the generated output can be in various formats ('Word', 'pdf', 'Excel').
Usage
f_glm(
formula,
family = gaussian(),
data = NULL,
diagnostic_plots = TRUE,
alpha = 0.05,
adjust = "sidak",
type = "response",
intro_text = TRUE,
dispersion_test = TRUE,
output_type = "default",
save_as = NULL,
save_in_wdir = FALSE,
close_generated_files = FALSE,
open_generated_files = interactive(),
influence_threshold = 2,
...
)
Arguments
formula |
A formula specifying the model to be fitted. More response variables can be
added using |
family |
The error distribution and link function to be used in the model (default: gaussian()).
This can be a character string naming a family function, a family function or
the result of a call to a family function. (See |
data |
A data frame containing the variables in the model. |
diagnostic_plots |
Logical. If |
alpha |
Numeric. Significance level for tests. Default is |
adjust |
Character string specifying the method used to adjust p-values for multiple comparisons. Available methods include:
Default is |
type |
Character string specifying the scale of emmeans post hoc results:
|
intro_text |
Logical. If |
dispersion_test |
Logical. If |
output_type |
Character string specifying the output format. Default is
|
save_as |
Character string specifying the output file path (without extension).
If a full path is provided, output is saved to that location.
If only a filename is given, the file is saved in |
save_in_wdir |
Logical. If |
close_generated_files |
Logical. Closes open Excel or Word (NOT pdf) files before writing, depending on the output format. Works on Windows (taskkill), macOS (pkill) and Linux (pkill/soffice). Default |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
influence_threshold |
Numeric multiplier for the leverage threshold. Observations
with hat values exceeding |
... |
Additional arguments passed to |
Details
The function first checks if all specified variables are present in the data and ensures that the response variable is numeric.
It fits a Generalized Linear Model (GLM) using the specified formula, family, and data. Model diagnostics are performed with DHARMa (simulation-based residual checks including a KS test, dispersion test, and outlier test). High-leverage observations are flagged using hat values.
Significance of each predictor is assessed via Type II Analysis of Deviance (stats::drop1()). If significant effects are found, post hoc pairwise comparisons are performed using estimated marginal means from emmeans() with the chosen p-value adjustment method (default: Sidak). When complete separation is detected, the function falls back to likelihood ratio test (LRT) based pairwise comparisons, which are robust to separation.
More response variables can be added using + (e.g., response1 + response2 ~ predictor) to fit a sequential GLM for each response variable, captured in one output file.
Outputs can be generated in multiple formats ("pdf", "word", "excel" and "rmd") as specified by output_type. The function also closes any open 'Word' files to avoid conflicts when generating 'Word' documents. If output_type = "rmd" is used it is advised to use it in a chunk with {r, echo=FALSE, results='asis'}
This function requires [Pandoc](https://github.com/jgm/pandoc/releases/tag) (version 1.12.3 or higher), a universal document converter.
-
Windows: Install Pandoc and ensure the installation folder
(e.g., "C:/Users/your_username/AppData/Local/Pandoc") is added to your system PATH. -
macOS: If using Homebrew, Pandoc is typically installed in "/usr/local/bin". Alternatively, download the .pkg installer and verify that the binary's location is in your PATH.
-
Linux: Install Pandoc through your distribution's package manager (commonly installed in "/usr/bin" or "/usr/local/bin") or manually, and ensure the directory containing Pandoc is in your PATH.
If Pandoc is not found, this function may not work as intended.
Value
An object of class 'f_glm' (a named list, one entry per response variable) containing:
- model
The fitted
glmobject.- summary
Output of
summary(glm_fit).- drop1
Type II Analysis of Deviance table from
stats::drop1().- diagnostics
DHARMa residual checks and hat-value based leverage diagnostics.
- posthoc
Estimated marginal means, pairwise comparisons, CLD letters, and summary table.
- sep_flag
Logical indicating whether complete separation was detected.
- lrt_pct_explained
McFadden's Pseudo-R
^2.
Using the option output_type, it can also generate output in the form of: R Markdown code, 'Word', 'pdf', or 'Excel' files. Includes print and plot methods for 'f_glm' objects.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# GLM Binomial example with output to console
mtcars_mod <- mtcars
mtcars_mod$cyl <- as.factor(mtcars_mod$cyl)
glm_bin <- f_glm(vs ~ cyl,
family = binomial,
data = mtcars_mod,
output_type = "default")
print(glm_bin)
# GLM Binomial example with output to MS Word file
glm_bin_word <- f_glm(vs ~ cyl,
family = binomial,
data = mtcars_mod,
output_type = "word"
)
# GLM Poisson example with output to rmd text
data(warpbreaks)
glm_pos <- f_glm(breaks ~ wool + tension,
data = warpbreaks,
family = poisson(link = "log"),
intro_text = FALSE,
output_type = "rmd")
cat(glm_pos$rmd)
Plot a Histogram with an Overlaid Normal Curve
Description
This function creates a histogram of the provided data and overlays it with a normal distribution curve.
Usage
f_hist(
data,
main = NULL,
xlab = NULL,
probability = TRUE,
col = "white",
border = "black",
line_col = "red",
save_png = FALSE,
open_png = TRUE,
save_as = NULL,
save_in_wdir = FALSE,
width = 8,
height = 7,
units = "in",
res = 300,
...
)
Arguments
data |
A numeric vector of data values to be plotted. |
main |
A character string specifying the title of the histogram. Default is |
xlab |
A character string specifying the label for the x-axis. Default is the name of the data variable. |
probability |
A logical value indicating whether to plot a probability or frequency histogram. Default is |
col |
A character string specifying the fill color of the histogram bars. Default is |
border |
A character string specifying the color of the histogram bar borders. Default is |
line_col |
A character string specifying the color of the normal curve line. Default is |
save_png |
A logical value default |
open_png |
Logical. If |
save_as |
Character string specifying the output file path (without extension).
If a full path is provided, output is saved to that location.
If only a filename is given, the file is saved in |
save_in_wdir |
Logical. If |
width |
Numeric, png figure width default |
height |
Numeric, png figure height default |
units |
Character string, png figure units default |
res |
Numeric, png figure resolution default |
... |
Additional arguments to be passed to the |
Details
The function first captures the name of the input variable for labeling purposes. It then calculates a sequence of x-values and corresponding y-values for a normal distribution based on the mean and standard deviation of the data. The histogram is plotted with specified aesthetics, and a normal curve is overlaid. To increase resolution you can use png(...,res = 600) or the 'RStudio' chunk setting, e.g. dpi=600.
Value
A histogram plot is created and the function returns this as a recordedplot.
Author(s)
Sander H. van Delden plantmind@proton.me
See Also
Examples
# Example usage:
set.seed(123)
sample_data <- rnorm(100)
f_hist(sample_data)
Perform multiple Kruskal-Wallis tests with a user-friendly output file, do data inspection and Dunn's test (of 'rstatix') as post hoc.
Description
Performs the Kruskal-Wallis rank sum test to assess whether there are statistically significant differences in the distributions (mean ranks) of three or more independent groups. It provides detailed outputs, including plots, assumption checks, and post hoc analyses using Dunn's test. Results can be saved in various formats ('pdf', 'Word', 'Excel', or console only) with customizable output options.
Usage
f_kruskal_test(
formula,
data = NULL,
plot = TRUE,
alpha = 0.05,
output_type = "default",
save_as = NULL,
save_in_wdir = FALSE,
intro_text = TRUE,
adjust = "bonferroni",
close_generated_files = FALSE,
open_generated_files = interactive(),
...
)
Arguments
formula |
A formula specifying the response and predictor variable (e.g., |
data |
A |
plot |
Logical. If |
alpha |
Numeric. The significance level for the Kruskal-Wallis test and Dunn's
test. Default is |
output_type |
Character string specifying the output format. Default is
|
save_as |
Character string specifying the output file path (without extension).
If a full path is provided, output is saved to that location.
If only a filename is given, the file is saved in |
save_in_wdir |
Logical. If |
intro_text |
Logical. If |
adjust |
Character string. Adjustment method for pairwise comparisons in Dunn's test. Options include |
close_generated_files |
Logical. Closes open Excel or Word (NOT pdf) files before writing, depending on the output format. Works on Windows (taskkill), macOS (pkill) and Linux (pkill/soffice). Default |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
... |
Additional arguments forwarded to |
Details
This function offers a comprehensive workflow for non-parametric analysis using the Kruskal-Wallis test:
Assumption Checks: Optionally includes a summary of assumptions in the output.
Visualization: Generates density plots and boxplots to visualize group distributions.
Post hoc Analysis: Conducts Dunn's test with specified correction methods if significant differences are found.
———–
Output files are generated in the format specified by output_type = and saved to the working directory, options are "pdf", "word" or "excel". If output_type = "rmd" is used it is advised to use it in a chunk with {r, echo=FALSE, results='asis'}
This function requires [Pandoc](https://github.com/jgm/pandoc/releases/tag) (version 1.12.3 or higher), a universal document converter.
-
Windows: Install Pandoc and ensure the installation folder
(e.g., "C:/Users/your_username/AppData/Local/Pandoc") is added to your system PATH. -
macOS: If using Homebrew, Pandoc is typically installed in "/usr/local/bin". Alternatively, download the .pkg installer and verify that the binary's location is in your PATH.
-
Linux: Install Pandoc through your distribution's package manager (commonly installed in "/usr/bin" or "/usr/local/bin") or manually, and ensure the directory containing Pandoc is in your PATH.
If Pandoc is not found, this function may not work as intended.
Value
An object of class 'f_kruskal_test' (a named list, one entry per response-predictor combination) containing:
- kruskal.test
The
htestobject fromkruskal.test().- dunn_test
Data frame of pairwise Dunn's test results from
rstatix::dunn_test().- summary_table
Descriptive statistics with compact letter display (Letters column).
- alpha
The significance level used.
- DunnTest_adjust
The p-value adjustment method used.
- distributions
ggplot density plot (if
plot = TRUE).- Boxplot
ggplot boxplot with CLD letters (if
plot = TRUE).
Using the option output_type, it can also generate output in the form of: R Markdown code, 'Word', 'pdf', or 'Excel' files. Includes print and plot methods for 'f_kruskal_test' objects.
Multiple Testing Across Response Variables
When several response variables are analysed in a single call
(e.g. y1 + y2 + y3 ~ treatment), each Kruskal-Wallis test is an
independent null-hypothesis test at level alpha. The post hoc
adjustment (e.g. adjust = "bonferroni") only controls the
family-wise error rate within one test (across pairwise Dunn
comparisons for that response). It does not protect against
the inflation of Type I error across the set of responses.
Practical implication: With k independent response
variables all tested at \alpha = 0.05, the probability of
obtaining at least one false positive is
1-(1-0.05)^k, which reaches ~40% for k = 10.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# Example usage:
data(iris)
# Perform Kruskal-Wallis test on Sepal.Length and Sepal.Width by Species
# with "holm" correction for posthoc dunn_test, without showing the output.
output <- f_kruskal_test(
Sepal.Width + Sepal.Length ~ Species,
data = iris,
plot = FALSE,
output_type = "word",
adjust = "holm"
)
# Save Kruskal-Wallis test and posthoc to Excel sheets: Sepal.Width and Sepal.Length.
f_kruskal_out <- f_kruskal_test(
Sepal.Width + Sepal.Length ~ Species,
data = iris,
plot = FALSE,
output_type = "excel",
adjust = "holm"
)
Fit a linear mixed model with lme4::lmer() including assumption checks, diagnostics, R-squared and post hoc tests.
Description
Fits a linear mixed-effects model using lme4::lmer() (with p-values
from lmerTest) and produces a fully-formatted report containing
the fixed-effects table, random-effects variance components, model-fit
indices (AIC, BIC, logLik, marginal & conditional R^2), residual and
BLUP diagnostics, convergence / singular-fit warnings, and post hoc
comparisons (emmeans) on factor fixed effects. Results can be
returned to the console or written to 'pdf', 'Word' or 'Excel'.
Usage
f_lmer(
formula,
data = NULL,
REML = TRUE,
ddf = "Satterthwaite",
alpha = 0.05,
adjust = "sidak",
norm_plots = TRUE,
post_hoc = TRUE,
intro_text = TRUE,
output_type = "default",
save_as = NULL,
save_in_wdir = FALSE,
close_generated_files = FALSE,
open_generated_files = interactive(),
...
)
Arguments
formula |
A two-sided formula passed to More than one response variable can be supplied on the left-hand side
using |
data |
A data frame containing the variables in the model. |
REML |
Logical. If |
ddf |
Character. Method for computing denominator degrees of freedom for fixed-effects p-values. One of:
|
alpha |
Numeric. Significance level for the fixed-effects table and
the post hoc tests. Default is |
adjust |
Character. Method used to adjust p-values for multiple
pairwise comparisons in the post hoc step (passed to
|
norm_plots |
Logical. If |
post_hoc |
Logical. If |
intro_text |
Logical. If |
output_type |
Character. Output format. One of:
|
save_as |
Character. Output file path. See |
save_in_wdir |
Logical. If |
close_generated_files |
Logical. Closes any open Word or Excel
files before writing. Cross-platform (Windows taskkill, macOS / Linux
pkill). Default |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
... |
Additional arguments forwarded to |
Details
What is a linear mixed model?
A linear mixed model (LMM) extends ordinary regression / ANOVA by
allowing two kinds of effects:
-
Fixed effects - factors you actively manipulated or whose specific levels you care about (treatment, dose, time, genotype). Reported as estimates with confidence intervals.
-
Random effects - grouping structure that creates non-independence in your data but whose levels are a random sample from a larger population (subjects measured repeatedly, plots within fields, observers, batches). Reported as variance components.
Use an LMM whenever observations share something that makes them more
alike than two random observations from the dataset. Ignoring such
grouping (running a plain aov or lm) is
pseudoreplication, i.e. treating non-independent observations
as if they were independent: standard errors shrink, p-values shrink,
false positives explode.
Vocabulary.
Before going further, a few terms used throughout the report:
-
Subject - the experimental unit that is measured repeatedly (a person, animal, pot and plot, cell line); in
lme4syntax it is the grouping factor on the right of the|, e.g.(1 | subject). -
Within-subject factor - a predictor whose levels vary within the same subject (time in a longitudinal study, treatment in a cross-over study).
-
Between-subject factor - a predictor whose levels vary across subjects but are constant within a subject (sex, genotype, treatment arm in a parallel-groups trial). Both within- and between-subject factors are fixed effects.
-
BLUP - Best Linear Unbiased Predictor. The model's estimate of the random-effect value for each subject (e.g. how much a particular subject deviates from the population intercept). BLUPs are checked for normality just like residuals.
-
ICC - intraclass correlation coefficient. The share of total variance attributable to between-group differences. ICC = 0 means the grouping factor is irrelevant; ICC = 1 means observations within a group are identical.
-
REML - restricted maximum likelihood. The default fitting method for variance components; gives less biased estimates than ordinary maximum likelihood.
-
Satterthwaite / Kenward-Roger - methods to approximate the denominator degrees of freedom for fixed-effect p-values, since there is no exact df in an LMM.
Reading the (1 | group) syntax.
Every random-effects term has the form ( <varying> | <group> ).
The bar reads as "varies by". The grouping factor on the right is what
creates the non-independence. The left side is what is allowed to differ
between groups. Common patterns:
-
(1 | subject)- random intercept per subject (each subject has its own baseline). Repeated measures, longitudinal data. -
(1 | field)- randomised block design or multi-site trial; one intercept per block. -
(1 | field/plot)-plotnested infield; equivalent to(1|field) + (1|field:plot). Split-plot or hierarchical sampling. -
(1 + time | subject)- random intercept and random slope oftimeper subject. Subjects differ both in baseline and in how fast they change. Growth curves. -
(1 | subject) + (1 | observer)- crossed random effects: every observer can rate every subject. Inter-rater designs.
Rule of thumb: if you can answer "if I duplicated this
experiment, would I draw new levels of this factor?" with yes,
it belongs on the right of a |. If you would re-use the exact
same levels (e.g. control vs treated) it is a fixed effect.
When to use a linear mixed model.
The most common reason is a repeated-measures design, in
which the same experimental units are measured on more than one
occasion or under more than one treatment. Compared with a
between-groups design analysed by plain ANOVA this gives two real
advantages: fewer experimental units are needed (each subject acts
as its own control, removing between-subject variation from the
comparison) and individual differences cannot bias the treatment
groups (in a cross-over design every subject receives every
treatment). Two canonical examples:
-
Longitudinal study - same subjects measured at several time points:
y ~ time + (1 | subject). If subjects also differ in how fast they change, add a random slope:y ~ time + (1 + time | subject). -
Cross-over design - every subject receives every treatment in sequence:
y ~ treatment + (1 | subject). If carry-over between periods is a concern, addperiodas a fixed effect.
LMMs also apply to non-repeated structures that still create non-independence: randomised block designs, split-plot trials, multi-site studies, inter-rater designs.
Assumptions of a linear mixed model:
Linearity in the parameters of the fixed-effects part.
Independence of observations conditional on the random effects. If structure remains (e.g. temporal autocorrelation), more random effects or a correlation structure are needed.
Normality of level-1 residuals (Q-Q plot of
residuals(m)).Normality of the random-effect BLUPs (Q-Q plot of
ranef(m)). This is the assumption most users forget.Homoscedasticity: residual variance roughly constant across fitted values and across grouping levels.
At least
~5levels of each grouping factor; with3-4levels it is usually better to treat the factor as fixed.
If Levene's test or the Shapiro-Wilk tests on residuals or BLUPs indicate a violation, the report adds a Recommendations for Heteroscedasticity and/or non-normal residuals section after the diagnostics with concrete next steps (generalised mixed model, transformation).
Convergence and singular fits.
f_lmer surfaces lme4 convergence warnings and the
"boundary (singular) fit" message prominently in the output. A singular
fit usually means the random-effects structure is too complex for the
data (often a random slope with too few levels) - simplify the model
before interpreting results.
This function requires Pandoc (>= 1.12.3) for pdf, word
and rmd output. See f_aov for installation notes.
Value
An object of class f_lmer: a named list containing the
fitted lmerModLmerTest model, the ANOVA-style fixed-effects
table, the variance components and ICC, the R^2 values, the
observed descriptives table (raw-data n, mean, sd, se, min, Q1,
median, Q3, max grouped by the categorical fixed-effect predictors),
post hoc results (if any), diagnostic plots, and convergence
diagnostics. When more than one response variable is supplied on the
left-hand side, these elements are nested one level deep under each
response name, e.g. out$y1$fixed_effects,
out$y2$fixed_effects. When output_type = "rmd" the
markdown string is stored in $rmd.
Multiple Testing Across Response Variables
When several response variables are analysed in a single call
(e.g. y1 + y2 + y3 ~ treatment + (1 | subject)), each linear
mixed model is an independent null-hypothesis test at level
alpha. The post hoc adjustments (adjust = "sidak",
"tukey", etc.) only control the family-wise error rate
within one model (across pairwise contrasts for that
response). They do not protect against the inflation of
Type I error across the set of responses.
Practical implication: With k independent response
variables all tested at \alpha = 0.05, the probability of
obtaining at least one false positive is
1 - (1 - 0.05)^k, which reaches ~40% for k = 10.
When this matters: The risk is highest in exploratory studies where many responses are screened simultaneously without a clear a priori hypothesis for each one. It is less of a concern when each response is a pre-specified primary outcome with its own biological rationale.
Possible remedies:
-
Bonferroni correction across responses: use
alpha = 0.05 / kwherekis the number of response variables. Conservative but simple. -
False Discovery Rate (FDR): apply
p.adjust(p_values, method = "fdr")to the vector of per-response fixed-effect p-values after the fact. -
Multivariate model: if the responses are correlated and you want a single omnibus test, fit a joint multivariate mixed model (e.g.
MCMCglmm,brms) before interpreting individual responses. -
Pre-registration: declare primary vs. exploratory responses before data collection to justify differential correction thresholds.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# sleepstudy: reaction time vs days of sleep deprivation,
# repeated measures within Subject (ships with lme4).
data(sleepstudy, package = "lme4")
# 1) Random intercept per subject - the simplest mixed model.
# Each subject has its own baseline reaction time; the fixed
# effect of Days is the average slope across subjects.
# With output_type = "default" (the default), the result auto-
# prints if not assigned, so no print() call is needed.
f_lmer_out <- f_lmer(Reaction ~ Days + (1 | Subject),
data = sleepstudy)
# Re-print the stored result and show the diagnostic plots.
print(f_lmer_out)
plot(f_lmer_out)
# 2) Random intercept AND random slope of Days per subject,
# fitted with Kenward-Roger denominator df, saved to MS Word.
f_lmer(Reaction ~ Days + (1 + Days | Subject),
data = sleepstudy,
ddf = "Kenward-Roger",
output_type = "word"
)
# 3) A factor fixed effect triggers a post hoc test.
# Bin Days into three sleep-deprivation phases so that the
# fixed effect is categorical and emmeans pairwise comparisons
# with a compact letter display are produced automatically.
sleepstudy$Phase <- cut(sleepstudy$Days,
breaks = c(-Inf, 2, 6, Inf),
labels = c("early", "mid", "late"))
f_lmer(Reaction ~ Phase + (1 | Subject),
data = sleepstudy,
adjust = "tukey")
# 4) A minimal report: suppress the intro text and the diagnostic
# plots, and save it directly to MS Word. Useful when embedding
# many models in one document or when you only need the tables.
f_lmer(Reaction ~ Days + (1 | Subject),
data = sleepstudy,
intro_text = FALSE,
norm_plots = FALSE,
output_type = "word"
)
# 5) Get the raw markdown back for embedding in an R Markdown
# document. Use it inside a chunk with results = 'asis'.
f_lmer_rmd_out <- f_lmer(Reaction ~ Days + (1 | Subject),
data = sleepstudy,
output_type = "rmd")
cat(f_lmer_rmd_out$rmd)
# 6) Two response variables analysed in one call. A separate model
# is fit for each, sharing the same right-hand side. The results
# are nested under each response name.
sleepstudy$Reaction2 <- sleepstudy$Reaction + rnorm(nrow(sleepstudy), 0, 5)
multi_out <- f_lmer(Reaction + Reaction2 ~ Days + (1 | Subject),
data = sleepstudy,
intro_text = FALSE,
norm_plots = FALSE)
multi_out$Reaction$fixed_effects
multi_out$Reaction2$fixed_effects
Install and Load Multiple R Packages
Description
Checks if the specified packages are installed. If not, it installs them and then loads them into the global R session.
Usage
f_load_packages(...)
Arguments
... |
Unquoted or quoted names of packages to be installed and loaded. These should be valid package names available on CRAN. |
Details
The function takes a list or vector indicating package names, installs any that are missing, and loads all specified packages into the global environment of the R session. It uses requireNamespace() to check for installation and library() to load the packages.
Value
None. The function is called for its side effects of installing and loading packages.
Author(s)
Sander H. van Delden plantmind@proton.me
Transform 'Wide' (Excel) data to 'Long' (R) format
Description
This function converts "wide" data (e.g. Excel tables) into a "long" list format. This is the essential first step to prepare your data for analysis and plotting in R.
Usage
f_long(
data,
measure_columns = NULL,
keep_cols = NULL,
category_name = "name",
value_name = "value",
category_labels = NULL,
...
)
Arguments
data |
The input data frame (e.g., from |
measure_columns |
(Optional) The columns containing your numeric measurements. These values are often the response variables, i.e. will end up on the Y-axis.
If NULL (default), the function will pivot ALL columns except those in |
keep_cols |
(Optional) The columns that identify your samples (IDs).
E.g., "SampleID", "PatientID", "Treatment", "Student number".
These are repeated for every measurement.
*If left empty, all non-measured columns are kept.*
Important: If |
category_name |
Name for the new column containing the headers. Default is "name". Choose something logical like "Timepoints", "Genes", or "Condition". |
value_name |
Name for the new column containing the numbers. Default is "value". Choose something logical like "Absorbance", "Ct_Value", or "Weight". |
category_labels |
(Optional) A character vector of new, readable names for your categories, i.e. the measure_columns that you entered.
Note: The order must match the order of |
... |
Additional arguments passed to |
Details
Research data in Excel or output from lab instruments often contains measurements side-by-side (in columns). Many R functions require measurements in a single column (rows). 'f_long' performs this translation for you.
It performs three actions in one go: 1. Selects your measurement columns ('measure_columns'). 2. Keeps your important ID columns ('keep_cols') and removes the rest. 3. (Optional) Renames cryptic column headers into readable labels ('category_labels').
Value
A "Tidy" data frame (tibble) of class f_long.
Note
The custom class and attributes (f_long_value, f_long_category)
are used by the plot and summary methods. Be aware that most
dplyr or tidyr operations (e.g., filter, mutate)
will silently strip these attributes. If that happens, use f_scan or
f_summary directly with explicit column names instead.
Examples
# --- Example 1: Using the 'iris' dataset ---
# Scenario: The iris dataset looks clean, but it is actually "Wide".
# It has 4 columns of measurements side-by-side.
# To compare Sepal Length vs Width in a plot, we must stack them.
head(iris)
# Reshape: Combine Length and Width into one column and plot the data.
iris_long <- f_long(
data = iris,
measure_columns = c("Sepal.Length", "Sepal.Width"),
keep_cols = "Species",
category_name = "Sepal_Dimension", # Describes the grouping (What did we measure?)
value_name = "Size_cm", # Describes the value (What is the number?)
category_labels=c("Length", "Width") # New category labels
)
head(iris_long)
# Plot the data using f_scan
plot(iris_long)
# Make a f_summary table of iris_long
summary(iris_long)
# --- Example 2: Using the 'airquality' dataset ---
# Scenario: Pivot daily measurements of Wind and Temperature over time.
head(airquality)
weather_long <- f_long(
data = airquality,
measure_columns = c("Wind", "Temp"),
keep_cols = c("Month", "Day"),
category_name = "Climate_Parameter", # Descriptive name
value_name = "Reading_Value", # Generic name (since units differ: mph vs F)
values_drop_na = TRUE
)
head(weather_long)
Compare Two Statistical Models
Description
Compares two statistical models by calculating key metrics such as AIC, BIC, log-likelihood, R^2,
and others. Supports comparison of nested models using ANOVA tests.
Usage
f_model_compare(
model1,
model2,
nested = NULL,
model1_name = NULL,
model2_name = NULL,
digits = 3
)
Arguments
model1 |
The first model object. Supported classes include: |
model2 |
The second model object. Supported classes include: |
nested |
Logical. If |
model1_name |
Optional character string. A custom name for model1 in the output. If |
model2_name |
Optional character string. A custom name for model2 in the output. If |
digits |
Integer. The number of decimal places to round the output metrics. Defaults to |
Details
Calculate various metrics to assess model fit:
-
AIC/BIC: Lower values indicate better fit.
-
Log-Likelihood: Higher values (less negative) indicate better fit.
-
R^2: Proportion of variance explained by the model. -
Adjusted
R^2:R^2penalized for the number of parameters (for linear models). -
Nagelkerke
R^2: A pseudo-R^2for generalized linear models (GLMs). -
Marginal/Conditional
R^2: For mixed models, marginalR^2reflects fixed effects, while conditionalR^2includes random effects. -
Sigma: Residual standard error.
-
Deviance: Model deviance.
-
SSE: Sum of squared errors.
-
Parameters (df): Number of model parameters.
-
Residual df: Residual degrees of freedom.
When nested models are detected or specified, model1 is always treated as the simpler model (fewer parameters). If the user passes the complex model first, the function automatically swaps them and issues a message.
If the models are nested, an ANOVA test is performed to compare them, and a p-value is provided to assess whether the more complex model significantly improves fit.
Value
A list of class "f_model_comparison" containing:
model1_name |
The name of the first model (always the simpler model when nested). |
model2_name |
The name of the second model (always the more complex model when nested). |
model1_class |
The class of the first model. |
model2_class |
The class of the second model. |
metrics_table |
A data frame summarizing metrics for both models, their differences, and (if applicable) the ANOVA p-value. |
formatted_metrics_table |
A formatted version of the metrics table for printing. |
anova_comparison |
The ANOVA comparison results if the models are nested and an ANOVA test was performed. |
nested |
Logical indicating whether the models were treated as nested. |
swapped |
Logical indicating whether the model order was swapped to ensure model1 is the simpler model. |
Supported Model Classes
The function supports the following model classes:
Linear models ("lm")
Generalized linear models ("glm")
Analysis of variance models ("aov")
Linear mixed models ("lmerMod")
Generalized linear mixed models ("glmerMod")
Nonlinear least squares models ("nls")
Note: Multi-stratum AOV models (fitted with Error()) are not supported
and will produce a warning.
Note
The function supports a variety of model types but may issue warnings if unsupported or partially supported classes are used.
For GLMs, Nagelkerke's
R^2is used as a pseudo-R^2approximation, computed from the model's null deviance to avoid refitting a null model.For mixed models, the function relies on the 'r.squaredGLMM' function from the 'MuMIn' package for
R^2calculation.For NLS models,
R^2is provided for convenience but should be interpreted with caution as it does not have the same statistical properties as in linear models.The idea of this function (not the code), I got from Dustin Fife's function 'model.comparison' in the super cool 'flexplot package'.
Author(s)
Sander H. van Delden plantmind@proton.me
See Also
AIC, BIC, anova, logLik, r.squaredGLMM
Examples
# Example with linear models.
model1 <- lm(mpg ~ wt, data = mtcars)
model2 <- lm(mpg ~ wt + hp, data = mtcars)
comparison <- f_model_compare(model1, model2)
print(comparison)
# Example with GLMs.
model1 <- glm(am ~ wt, data = mtcars, family = binomial)
model2 <- glm(am ~ wt + hp, data = mtcars, family = binomial)
comparison <- f_model_compare(model1, model2)
print(comparison)
# Models can be passed in any order - the function auto-swaps if needed.
complex <- lm(mpg ~ wt + hp + qsec, data = mtcars)
simple <- lm(mpg ~ wt, data = mtcars)
comparison <- f_model_compare(complex, simple)
# model1 will be "simple", model2 will be "complex" in the output
# Example with custom model names (useful when calling from wrapper functions).
comparison <- f_model_compare(model1, model2,
model1_name = "Weight only",
model2_name = "Weight + Horsepower")
print(comparison)
Open a File with the Default Application
Description
Opens a specified file using the default application associated with its file type. It automatically detects the operating system (Windows, Linux, or macOS) and uses the appropriate command to open the file.
Usage
f_open_file(filepath)
Arguments
filepath |
A character string specifying the path to the file to be opened. The path can be absolute or relative. |
Details
- On Windows, the f_open_file() function uses shell.exec() to open the file.
- On Linux, it uses xdg-open via the system() function.
- On macOS, it uses open via the system() function.
If an unsupported operating system is detected, the function will throw a message.
Value
Does not return a value; it is called for its side effect of opening a file.
Author(s)
Sander H. van Delden plantmind@proton.me
See Also
[shell.exec()], [system()]
Examples
# NOTE: The use of "if(interactive())" prevents this example from running
# during automated CRAN checks. This is necessary because the example
# opens a file, a behavior restricted by CRAN policies for automated
# testing.You don't need to use "if(interactive())" in your own scripts.
if(interactive()) {
# Open a PDF file.
f_open_file("example.pdf")
# Open an image file.
f_open_file("image.png")
# Open a text file.
f_open_file("document.txt")
}
Identify Outliers within Groups using Tukey's Fences
Description
'f_outliers()' scans numerical column(s) for outliers based on the Interquartile Range (IQR) method. It can detect outliers across the entire dataset or within specified subgroups.
It returns a dataframe containing only the outlier rows, preserving the original data structure
and adding a row_id column for traceability.
Usage
f_outliers(x, ...)
## S3 method for class 'numeric'
f_outliers(x, ...)
## S3 method for class 'integer'
f_outliers(x, ...)
## S3 method for class 'formula'
f_outliers(formula, data, ...)
## S3 method for class 'data.frame'
f_outliers(
x,
columns,
group_vars = NULL,
id_var = NULL,
coef = 1.5,
digits = NULL,
export_to_excel = FALSE,
close_generated_files = FALSE,
open_generated_files = interactive(),
save_as = NULL,
save_in_wdir = FALSE,
check_input = TRUE,
digits_excel = NULL,
allow_integer_decimal_mix = FALSE,
...
)
Arguments
x |
A data.frame or formula (dispatches to the right method). |
... |
Further arguments forwarded to |
formula |
A formula specifying the columns (right hand side) to be checked per subgroup(s) (left hand side).
More columns or groups can be added using |
data |
A |
columns |
The numerical columns to analyze if no formula is used. Can be entered
as a single character string (e.g., |
group_vars |
A character vector specifying the grouping variables in |
id_var |
(Optional) A character string naming a user-specific ID columns (e.g., |
coef |
A number indicating the IQR multiplier. Default is
|
digits |
Integer. Number of decimal places for the R console output.
Default is |
export_to_excel |
Logical. If |
close_generated_files |
Logical. If |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
save_as |
Character string. Custom path or filename for the Excel export.
|
save_in_wdir |
Logical. If |
check_input |
Logical. If |
digits_excel |
Integer. Number of decimal places for the Excel file cells. Default |
allow_integer_decimal_mix |
Logical. If |
Details
The Outlier Logic (Tukey's Method): An observation is flagged as an outlier if it falls outside the calculated fences:
-
Lower Fence:
Q1 - (coef \times IQR) -
Upper Fence:
Q3 + (coef \times IQR)
Where Q1 is the 25th percentile, Q3 is the 75th percentile, and IQR = Q3 - Q1.
Output Structure:
The function returns a subset of the original data. It automatically adds a row_id
columns, which corresponds to the row number in the original dataframe. This ensures you can
strictly map the outliers back to the source data.
Value
A data.frame containing the identified outlier rows. Returns NULL (with a message)
if no outliers are found.
See Also
f_remove_outliers to remove the rows identified by this function.
Examples
# --- Setup: Create Dummy Data ---
set.seed(42)
df <- data.frame(
Team = rep(c("A", "B"), each = 20),
Department = rep(c("Sales", "IT"), each = 10, times = 2),
Salary = rnorm(40, mean = 50000, sd = 2000),
Age = rnorm(40, mean = 35, sd = 3),
EmployeeID = paste0("E", sprintf("%03d", 1:40))
)
# Inject outliers
df[2, "Salary"] <- 57000 # Mild outlier (between 1.5 and 3.0 fence)
df[1, "Salary"] <- 100000 # Extreme high
df[35, "Salary"] <- 1000 # Extreme low
# --- Example 1: Basic detection (data.frame notation) ---
# Scan the entire dataset for Salary outliers (no grouping)
out <- f_outliers(df, columns = "Salary")
print(out)
# --- Example 2: Basic detection (formula notation) ---
# Equivalent to Example 1 using the formula interface
# LHS = column(s) to scan, RHS = grouping variable(s)
out <- f_outliers(Salary ~ 1, data = df)
print(out)
# --- Example 3: Grouped detection (both notations) ---
# Outliers are now evaluated *within* each Team separately,
# making detection sensitive to group-level distributions
# data.frame notation:
out <- f_outliers(df, columns = "Salary", group_vars = "Team")
# Formula notation (identical result):
out <- f_outliers(Salary ~ Team, data = df)
print(out)
# --- Example 4: Multi-column + multi-group (formula notation) ---
# Scan both Salary and Age for outliers, grouped by Team and Department.
# Returns a named list: one data.frame per column scanned.
out <- f_outliers(Salary + Age ~ Team + Department, data = df)
print(out) # prints both result tables
out$Salary # access Salary outliers directly
out$Age # access Age outliers directly
# --- Example 5: Strict detection with a custom ID column ---
# coef = 3.0 flags only extreme outliers (the "far out" Tukey fence).
# id_var places EmployeeID first in the output for easy identification.
# data.frame notation:
out <- f_outliers(df,
columns = "Salary",
group_vars = "Team",
id_var = "EmployeeID",
coef = 3.0)
# Formula notation (identical result):
out <- f_outliers(Salary ~ Team, data = df,
id_var = "EmployeeID",
coef = 3.0)
print(out)
# --- Example 6: Sensitivity comparison ---
# Compare how coef = 1.5 (standard) vs coef = 3.0 (extreme-only)
# affects the number of flagged rows.
out_standard <- f_outliers(Salary ~ Team, data = df, coef = 1.5)
out_extreme <- f_outliers(Salary ~ Team, data = df, coef = 3.0)
nrow(out_standard$output_df) # 3 -- catches mild + extreme outliers
nrow(out_extreme$output_df) # 2 -- catches extreme outliers only
# --- Example 7: Vector input ---
# Pass a column directly as a vector -- no data.frame needed.
# The column name is captured automatically from the call.
out <- f_outliers(df$Salary)
print(out)
# Works with coef and other parameters too
out <- f_outliers(df$Salary, coef = 3.0)
print(out)
# Inline vectors fall back to the column name "value"
out <- f_outliers(c(1, 2, 3, 4, 5, 100))
print(out)
Fancy Pander Table Output
Description
Is a wrapper around the pander function from the 'pander' package, designed to produce a fancy table output with specific formatting options.
Usage
f_pander(
table,
col_width = 10,
table_width = NULL,
limit_columns = NULL,
style = "multiline",
console = TRUE,
...
)
Arguments
table |
A data frame, matrix, or other table-like structure to be rendered. |
col_width |
Integer. Specifies the maximum number of characters allowed in table header columns before a line break is inserted. Defaults to |
table_width |
Integer or |
limit_columns |
Integer or |
style |
Character. Pander table style. Defaults to |
console |
Logical. Whether to process headers for console output. Defaults to |
... |
Additional arguments passed to the |
Details
This function sets several pander options to ensure that the table output is formatted in a visually appealing manner. The options set include:
-
table.alignment.default: Aligns all columns to the left. -
table.alignment.rownames: Aligns row names to the left. -
keep.trailing.zeros: Keeps trailing zeros in numeric values. -
knitr.auto.asis: Ensures output is not automatically treated as 'asis'. -
table.caption.prefix: Removes the default "Table" prefix in captions. -
keep.line.breaks: Preserves line breaks in cell content. -
table.split.table: Controls table splitting (set toInfiftable_widthisNULLorFALSE). -
table.split.cells: Inserts line breaks in headers everycol_widthcharacters.
This function requires [Pandoc](https://github.com/jgm/pandoc/releases/tag) (version 1.12.3 or higher), a universal document converter.
-
Windows: Install Pandoc and ensure the installation folder
(e.g., "C:/Users/your_username/AppData/Local/Pandoc") is added to your system PATH. -
macOS: If using Homebrew, Pandoc is typically installed in "/usr/local/bin". Alternatively, download the .pkg installer and verify that the binary's location is in your PATH.
-
Linux: Install Pandoc through your distribution's package manager (commonly installed in "/usr/bin" or "/usr/local/bin") or manually, and ensure the directory containing Pandoc is in your PATH.
If Pandoc is not found, this function may not work as intended.
Value
None. The function is called for its side effects of setting 'pander' options and creates a pander formatted table in R Markdown.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# Example usage of f_pander
df <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Score = c(88.5, 92.3, 85.0)
)
# Render the data frame as a fancy table
f_pander(df)
Normal Q-Q Plot with Confidence Bands
Description
This function creates a normal Q-Q plot for a given numeric vector and adds confidence bands to visualize the variability of the quantiles.
Usage
f_qqnorm(
x,
main = NULL,
ylab = NULL,
conf_level = 0.95,
col = NULL,
pch = NULL,
cex = NULL,
save_png = FALSE,
open_png = TRUE,
save_as = NULL,
save_in_wdir = FALSE,
width = 8,
height = 7,
units = "in",
res = 300,
...
)
Arguments
x |
A numeric vector of data values. |
main |
A character string specifying the title of the histogram. Default is "Histogram with Normal Curve". |
ylab |
A character string specifying the y-axsis label. Default name is |
conf_level |
Numeric, between 0 and 1. Confidence level for the confidence bands. Default is 0.95 (95% confidence). |
col |
Numeric, optional parameter for color of point with default 'black'. |
pch |
Numeric, optional parameter shape of points default |
cex |
Numeric, optional parameter for graph cex with default |
save_png |
A logical value default |
open_png |
Logical. If |
save_as |
Character string specifying the output file path (without extension).
If a full path is provided, output is saved to that location.
If only a filename is given, the file is saved in |
save_in_wdir |
Logical. If |
width |
Numeric, png figure width default |
height |
Numeric, png figure height default |
units |
Numeric, png figure units default inch. |
res |
Numeric, png figure resolution default |
... |
Additional graphical parameters to be passed to the |
Details
The function calculates theoretical quantiles for a normal distribution and compares them with the sample quantiles of the input data.
It also computes confidence intervals for the order statistics using the Blom approximation and displays these intervals as shaded bands on the plot.
The reference line is fitted based on the first and third quartiles of both the sample data and theoretical quantiles.
To increase resolution you can use png(...,res = 600) or the 'RStudio' chunck setting, e.g. dpi = 600.
Value
A Q-Q plot is created and the function returns this as a recordedplot.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# Generate random normal data
set.seed(123)
data <- rnorm(100)
# Create a Q-Q plot with confidence bands
f_qqnorm(data)
# Customize the plot with additional graphical parameters
f_qqnorm(data, conf_level = 0.99, pch = 16, col = "blue")
Remove Outliers from Data
Description
'f_remove_outliers()' removes specific rows from a dataframe based on a list of identifiers.
It is designed to work seamlessly with the output of f_outliers, but can also
accept a custom vector of IDs.
Usage
f_remove_outliers(data, outliers, by = "row_id", verbose = TRUE)
Arguments
data |
A data.frame, tibble, or data.table containing the original data. |
outliers |
Either:
|
by |
A character string specifying the column to match on. Default is |
verbose |
Logical. If |
Details
Safe Deletion Logic:
This function performs a "anti-join" style filtering. It keeps rows where the identifier in
by is not found in the outliers list.
Handling Row IDs:
If you use the default by = "row_id" and your original data does not have a
column named "row_id", the function assumes you are referring to the intrinsic
row numbers of the data.frame, tibble, or data.table. It will temporarily generate IDs to
perform the deletion and then return the clean data with the original structure
(without adding a permanent row_id column to the result).
Value
An object of the same class as the input data (data.frame, tibble, or data.table)
with the specified outlier rows removed.
See Also
f_outliers to identify the rows to be removed.
Examples
# --- Setup: Create Dummy Data ---
set.seed(42)
df <- data.frame(
Team = rep(c("A", "B"), each = 20),
Department = rep(c("Sales", "IT"), each = 10, times = 2),
Salary = c(rnorm(19, 50000, 500), 100000,
rnorm(18, 50000, 500), 57000, 1000),
Age = c(rnorm(38, 35, 2), 90, 35),
EmployeeID = paste0("E", sprintf("%03d", 1:40)),
stringsAsFactors = FALSE
)
# row 20: extreme high Salary (Team A)
# row 39: mild Salary outlier at coef = 1.5 only
# row 40: extreme low Salary (Team B)
# row 39: extreme high Age
# --- Example 1: Basic two-step workflow (data.frame notation) ---
# The most common use case: find then remove in two lines.
bad_rows <- f_outliers(df, columns = "Salary")
clean_df <- f_remove_outliers(df, bad_rows)
nrow(df) # 40
nrow(clean_df) # 40 minus flagged rows
# --- Example 2: Basic two-step workflow (formula notation) ---
# Identical result to Example 1 using the formula interface.
bad_rows <- f_outliers(Salary ~ 1, data = df)
clean_df <- f_remove_outliers(df, bad_rows)
nrow(clean_df)
# --- Example 3: Grouped detection then removal (both notations) ---
# Outliers are identified *within* each Team separately before removal.
# data.frame notation:
bad_rows <- f_outliers(df, columns = "Salary", group_vars = "Team")
clean_df <- f_remove_outliers(df, bad_rows)
# Formula notation (identical result):
bad_rows <- f_outliers(Salary ~ Team, data = df)
clean_df <- f_remove_outliers(df, bad_rows)
nrow(clean_df)
# --- Example 4: Selective removal -- only act on a subset of outliers ---
# Find all flagged rows, but only remove the extreme high salaries.
# Step 1: Identify all Salary outliers grouped by Team
bad_rows <- f_outliers(Salary ~ Team, data = df)
all_flagged <- bad_rows$output_df
# Step 2: Filter to keep only the rows where Salary > 90000
really_bad <- all_flagged[all_flagged$Salary > 90000, ]
# Step 3: Remove only those rows -- low outlier (row 40) is preserved
clean_df <- f_remove_outliers(df, really_bad)
range(clean_df$Salary) # low outlier still present, high one is gone
# --- Example 5: Multi-column outlier removal ---
# f_outliers scans both Salary and Age; f_remove_outliers removes
# every row flagged by either column in one call.
# Formula notation:
bad_rows <- f_outliers(Salary + Age ~ Team, data = df)
clean_df <- f_remove_outliers(df, bad_rows)
# data.frame notation (identical result):
bad_rows <- f_outliers(df, columns = c("Salary", "Age"), group_vars = "Team")
clean_df <- f_remove_outliers(df, bad_rows)
nrow(clean_df) # rows flagged by Salary OR Age are removed
# --- Example 6: Strict detection + custom ID column ---
# coef = 3.0 flags only extreme outliers. EmployeeID is used
# as the matching key instead of the default row_id.
# Formula notation:
bad_rows <- f_outliers(Salary ~ Team, data = df,
id_var = "EmployeeID", coef = 3.0)
# data.frame notation (identical result):
bad_rows <- f_outliers(df, columns = "Salary", group_vars = "Team",
id_var = "EmployeeID", coef = 3.0)
# Remove by EmployeeID rather than row position
clean_df <- f_remove_outliers(df, bad_rows$output_df, by = "EmployeeID")
# Confirm the flagged employees are no longer in the clean data
bad_ids <- bad_rows$output_df$EmployeeID
any(clean_df$EmployeeID %in% bad_ids) # FALSE
Rename Specific Columns in a Data Frame
Description
Renames specific columns in a data frame based on a named vector (name_map). It ensures that only the specified columns are renamed, while others remain unchanged.
Usage
f_rename_columns(df, name_map)
Arguments
df |
A data frame whose columns are to be renamed. |
name_map |
A named vector where the names correspond to the current column names in |
Details
This function is particularly useful when you want to rename only a subset of columns in a data frame. It performs input validation to ensure that:
-
name_mapis a named vector. All names in
name_mapexist as column names indf.
If these conditions are not met, the function will throw an error with an appropriate message.
Value
A data frame with updated column names. Columns not specified in name_map remain unchanged.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# Create a sample data frame.
df <- data.frame(a = 1:3, b = 4:6, c = 7:9)
# Define a named vector for renaming specific columns.
name_map <- c(a = "alpha", c = "gamma")
# Rename columns.
df <- f_rename_columns(df, name_map)
# View updated data frame.
print(df)
Rename Elements of a Vector Based on a Mapping
Description
Renames elements of a vector based on a named mapping vector. Elements that match the names in the mapping vector are replaced with their corresponding values, while elements not found in the mapping remain unchanged.
Usage
f_rename_vector(vector, name_map)
Arguments
vector |
A character vector containing the elements to be renamed. |
name_map |
A named vector where the names correspond to the elements in |
Details
This function iterates through each element of vector and checks if it exists in the names of name_map. If a match is found, the element is replaced with the corresponding value from name_map. If no match is found, the original element is retained. The result is returned as an unnamed character vector.
Value
A character vector with updated element names. Elements not found in name_map remain unchanged.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# Define a vector and a name map.
vector <- c("Species", "Weight", "L")
name_map <- c(Species = "New_species_name", L = "Length_cm")
# Rename elements of the vector.
updated_vector <- f_rename_vector(vector, name_map)
# View updated vector
print(updated_vector)
Perform a visual check on your data
Description
Creates a 3-panel diagnostic dashboard to check data distribution and assumptions. It can also output a data summary table and identify outliers.
Usage
f_scan(x, ...)
## S3 method for class 'formula'
f_scan(formula, data, ...)
## S3 method for class 'data.frame'
f_scan(
x,
columns,
group_vars = NULL,
summary = TRUE,
outliers = TRUE,
coef = 1.5,
limit_columns = 7,
fancy_names = NULL,
advice = FALSE,
close_generated_files = FALSE,
open_generated_files = interactive(),
output_type = "default",
save_as = NULL,
save_in_wdir = FALSE,
digits = NULL,
...
)
Arguments
x |
A data.frame or formula (dispatches to the right method). |
... |
Further arguments forwarded to |
formula |
A formula specifying the columns (right hand side) to be summarized by maximal 3 groups (left hand side). More columns or groups can be added using |
data |
A 'data.frame', 'data.table', or 'tibble'. |
columns |
The numerical column(s) to summarize if no formula is used. Can be entered as a single character string (e.g., |
group_vars |
Character vector of up to 3 grouping variables (e.g., |
summary |
Logical. Show a summary table of the data. Default is |
outliers |
Logical. If |
coef |
Numeric. The multiplier for the Interquartile Range (IQR) used for outlier detection. Default |
limit_columns |
Integer or |
fancy_names |
Named character vector or |
advice |
Logical. If |
close_generated_files |
Logical. Closes open Excel or Word (NOT pdf) files before writing, depending on the output format. Works on Windows (taskkill), macOS (pkill) and Linux (pkill/soffice). Default |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
output_type |
Character string specifying the output format. Default is
|
save_as |
Character string specifying the output file path (without extension).
If a full path is provided, output is saved to that location.
If only a filename is given, the file is saved in |
save_in_wdir |
Logical. If |
digits |
Integer. Decimal places for printed tables in 'pdf' and 'Word' output files. Default |
Details
f_scan automatically adapts the visualization based on the number of grouping variables provided:
-
0 groups: Univariate analysis (Single density/boxplot).
-
1 group : Main grouping variable (X-axis and Color).
-
2 groups: Adds Facet Wrapping.
-
3 groups: Adds Facet Grid (Row vs Column).
This function requires [Pandoc](https://github.com/jgm/pandoc/releases/tag) (version 1.12.3 or higher), a universal document converter.
-
Windows: Install Pandoc and ensure the installation folder.
(e.g., "C:/Users/your_username/AppData/Local/Pandoc") is added to your system PATH. -
macOS: If using Homebrew, Pandoc is typically installed in "/usr/local/bin". Alternatively, download the .pkg installer and verify that the binary's location is in your PATH.
-
Linux: Install Pandoc through your distribution's package manager (commonly installed in "/usr/bin" or "/usr/local/bin") or manually, and ensure the directory containing Pandoc is in your PATH.
If Pandoc is not found, this function may not work as intended.
Value
A list of class f_scan containing plots, the summary table, and the outlier table. Using the option "output_type", it can also generate output in the form of: R Markdown code, 'Word', 'pdf', or 'Excel' files. Includes print, summary and plot methods for 'f_scan' objects.
Examples
# 1. Non-formula | No groups | Default output (default)
result <- f_scan(iris, columns = "Sepal.Length")
print(result)
# 2. Non-formula | 1 group | Console output
result <- f_scan(
mtcars,
columns = "mpg",
group_vars = "cyl",
output_type = "console"
)
# 3. Non-formula | 2 groups | Multiple columns | Excel output
result <- f_scan(
mtcars,
columns = c("mpg", "hp"),
group_vars = c("cyl", "am"),
outliers = TRUE,
coef = 1.5,
output_type = "excel",
save_as = "mtcars_scan"
)
# 4. Formula | 1 group | Strict outlier detection | Word output
result <- f_scan(
Sepal.Width ~ Species,
data = iris,
outliers = TRUE,
coef = 3.0,
output_type = "word",
save_as = "iris_scan"
)
# 5. Formula | 2 groups | Multiple columns | Fancy names
result <- f_scan(
mpg + hp + wt ~ vs + am,
data = mtcars,
fancy_names = c(mpg = "Fuel Efficiency", hp = "Horsepower",
wt = "Weight", vs = "Engine Type",
am = "Transmission"),
summary = TRUE
)
print(result)
#Create a small reproducible dataset with 3 grouping variables
set.seed(42)
plant_data <- data.frame(
weight = c(rnorm(60, 10, 2), rnorm(60, 14, 2)),
species = rep(c("A", "B"), each = 60),
treatment = rep(rep(c("control", "treated"), each = 30), 2),
batch = factor(rep(c("1", "2", "3"), 40))
)
# 6. Formula | 3 groups | Facet Grid
result <- f_scan(
weight ~ species + treatment + batch,
data = plant_data,
coef = 2.0,
digits = 2,
output_type = "word"
)
print(result)
# 7. With statistical advice
result <- f_scan(
Sepal.Length ~ Species,
data = iris,
advice = TRUE
)
#' print(result)
result[["Sepal.Length"]]$advice$y_type
Set Working Directory Based on Current File or Specified Path
Description
A wrapper around setwd() that sets the working directory to the location of the currently open file in 'RStudio' if no path is provided. If a path is specified, it sets the working directory to that path instead.
Usage
f_setwd(path = NULL)
Arguments
path |
A character string specifying the desired working directory. If |
Details
If path is not provided (NULL), this function uses the this.path package to determine the location of the currently open file and sets that as the working directory. The file must be saved for this to work properly.
If a valid path is provided, it directly sets the working directory to that path.
Value
None. The function is called for its side effects of changing the working directory.
Note
The function checks whether the currently open file is saved before setting its location as the working directory.
If the function is called from an unsaved script or directly from the console, an error will be thrown.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# NOTE: The use of "if(interactive())" prevents this example from running
# during automated CRAN checks. This is necessary because the example
# requires to be run from an R script. You don't need to use
# "if(interactive())" in your own scripts.
if(interactive()) {
# Store the current working directory, so we can reset it after the example.
current_wd <- getwd()
print(current_wd)
# Run this commando from a saved R script file, or R Notebook to set the working
# directory to scripts' file location
f_setwd()
# Restore your current working directory
f_setwd(current_wd)
}
Statistical Test Wizard
Description
Analyzes your data structure based on a formula and recommends the appropriate statistical test. Checks variable types, normality of residuals, homogeneity of variance, and checks if f_boxcox transformation can fix non-normality. Recommends rfriend functions as primary code, with base R alternatives shown as fallback.
Supports standard formulas including y ~ ., y ~ as.factor(x), and interaction
terms. Formulas with random effects (e.g. (1|ID)) are detected and handled separately.
Multivariate responses (e.g. cbind(y1, y2) ~ x) and transformed responses
(e.g. log(y) ~ x) are not supported.
Usage
f_stat_wizard(x, ...)
## S3 method for class 'formula'
f_stat_wizard(
formula,
data,
id_col = NULL,
run = FALSE,
plots = FALSE,
output_type = "word",
interactive = FALSE,
data_name = NULL,
...
)
## S3 method for class 'data.frame'
f_stat_wizard(
x,
formula,
id_col = NULL,
run = FALSE,
plots = FALSE,
output_type = "word",
interactive = FALSE,
data_name = NULL,
...
)
Arguments
x |
A formula (e.g., |
... |
Additional arguments (currently unused). |
formula |
A formula specifying the relationship (used with the data.frame method). |
data |
A data frame containing the variables referenced in the formula. |
id_col |
Character string. Name of the column identifying subjects/blocks
for paired or repeated-measures designs. When supplied, the wizard (a) verifies
the pairing structure (each subject should appear in every group exactly once),
(b) treats the design as paired/repeated measures, and (c) embeds the real column
name into the generated code. Omit for independent-samples designs. Default |
run |
Logical. If |
plots |
Logical. If |
output_type |
Character string specifying the output format of the
recommended rfriend function (when |
interactive |
Logical. If |
data_name |
Character string to name the data base used. Default |
Value
An object of class "f_stat_wizard": a list containing:
- formula
The formula used.
- formula_text
Character string of the formula.
- data_name
Name of the data object as passed by the user.
- n
Effective sample size (after NA removal).
- n_dropped
Number of rows removed due to missing values.
- paired
Logical. Whether a paired/repeated-measures design was detected (via
id_col).- id_col
Character. Name of the subject/block column supplied, or
NULL.- y_var
Name of the response variable.
- y_type
Detected type of the response:
"binary","count","multinomial","ratio_normal","ratio_non_normal","ratio_unknown", or"unsupported".- x_vars
Character vector of explanatory variable names.
- x_types
Character vector of detected types (
"nominal","ordinal","ratio").- n_groups
Number of groups (for single categorical X), or
NULL.- group_sizes
Table of per-group sample sizes, or
NULL.- is_ancova
Logical.
TRUEif the model mixes nominal and ratio predictors.- has_interaction
Logical.
TRUEif interaction terms were detected.- normality
A list with
p_value(Shapiro-Wilk) andis_normal(logical orNA).- variance
A list with
test_used("Levene"or"Bartlett"),p_value, andis_equal(logical).- boxcox
A list with
attempted(logical),can_fix(logical), andp_value_after(numeric orNA).- overdispersion
A list with
is_overdispersed(logical, from DHARMa dispersion test) andp_value. Only meaningful for count data.- recommended_call
A language object representing the rfriend function call, or
NULLif no single function could be determined.- run_result
The result of executing the recommended test (when
run=TRUE), orNULL.- histogram
A
recordedplotfromf_hist()(whenplots=TRUE), orNULL.- qqplot
A
recordedplotfromf_qqnorm()of model residuals (whenplots=TRUEand Y is continuous), orNULL.- report
Character vector of the human-readable report lines (used by
print.f_stat_wizard).
Examples
# Formula interface (recommended)
f_stat_wizard(Sepal.Length ~ Species, data = iris)
# Data-first interface (backward compatible)
f_stat_wizard(iris, Sepal.Length ~ Species)
# Paired design -- supply the id_col that identifies matched subjects
f_stat_wizard(extra ~ group, data = sleep, id_col = "ID")
# With diagnostic plots
f_stat_wizard(Sepal.Length ~ Species, data = iris, plots = TRUE)
# Run the recommended test directly
result <- f_stat_wizard(Sepal.Length ~ Species, data = iris, run = TRUE)
result$run_result
# Inspect metadata
result <- f_stat_wizard(Sepal.Length ~ Species, data = iris)
result$y_type
result$normality
result$group_sizes
Summarize a Data Frame with Grouping Variables
Description
Computes summary statistics (n, mean, sd, etc.) for a specified numerical columns in a data frame. The data can be analyzed as a whole or split by one or more grouping variables.
The function returns a formatted data frame and includes options to export the results directly to an 'Excel' file.
Usage
f_summary(x, ...)
## S3 method for class 'formula'
f_summary(x, data, ...)
## S3 method for class 'data.frame'
f_summary(
x,
columns,
group_vars = NULL,
show_name = TRUE,
show_n = TRUE,
show_mean = TRUE,
show_sd = TRUE,
show_se = TRUE,
show_min = TRUE,
show_max = TRUE,
show_median = TRUE,
show_Q1 = TRUE,
show_Q3 = TRUE,
show_skew = FALSE,
show_kurtosis = FALSE,
digits = NULL,
export_to_excel = FALSE,
close_generated_files = FALSE,
open_generated_files = interactive(),
save_as = NULL,
save_in_wdir = FALSE,
check_input = TRUE,
digits_excel = NULL,
allow_integer_decimal_mix = FALSE,
...
)
Arguments
x |
A data.frame or formula (dispatches to the right method). |
... |
Further arguments forwarded to |
data |
A 'data.frame', 'data.table', or 'tibble'. |
columns |
The numerical column(s) to summarize if no formula is used. Can be entered as a single character string (e.g., |
group_vars |
A character vector specifying the grouping variables in |
show_name |
Logical. Include variable name. Default |
show_n |
Logical. Include count ( |
show_mean |
Logical. Include mean. Default |
show_sd |
Logical. Include standard deviation. Default |
show_se |
Logical. Include standard error. Default |
show_min |
Logical. Include minimum value. Default |
show_max |
Logical. Include maximum value. Default |
show_median |
Logical. Include median. Default |
show_Q1 |
Logical. Include first quartile (25th percentile). Default |
show_Q3 |
Logical. Include third quartile (75th percentile). Default |
show_skew |
Logical. Include Skewness (measure of asymmetry). Default |
show_kurtosis |
Logical. Include Excess Kurtosis (measure of "tailedness"). Default |
digits |
Integer. Number of decimal places for the R console output.
Default is |
export_to_excel |
Logical. If |
close_generated_files |
Logical. If |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
save_as |
Character string. Custom path or filename for the Excel export.
|
save_in_wdir |
Logical. If |
check_input |
Logical. If |
digits_excel |
Integer. Number of decimal places for the Excel file cells. Default |
allow_integer_decimal_mix |
Logical. If |
formula |
A formula specifying the columns (right hand side) to be summarized by groups (left hand side). More columns or groups can be added using |
Details
The function computes the following statistics:
-
n: number of observations -
mean: arithmetic mean -
sd: standard deviation -
se: standard error (sd / \sqrt{n}) -
min: minimum value -
max: maximum value -
median: median value -
Q1: 25th percentile -
Q3: 75th percentile -
skew: Sample skewness (if requested). -
kurt: Sample excess kurtosis (if requested).
skew stands for Skewness which is a measure of asymmetry of a distribution around its mean. Where skew values near 0 indicate approximate symmetry, while large positive or negative values indicate noticeable asymmetry.
-
> 0: Right-skewed (long or heavier tail to the right). -
< 0: Left-skewed (long or heavier tail to the left).
kurt stands for Excess Kurtosis: Tells you about the "tails" and the peak.
-
0: Same tail heaviness as the normal distribution (mesokurtic). -
> 0: Heavier tails than normal (Leptokurtic) – indicates frequent outliers. -
< 0: Lighter tails than normal (Platykurtic) – indicates fewer (or less extreme) outliers than a normal distribution.
If group_vars are provided, the statistics are calculated for each group combination.
When export_to_excel = TRUE, the file is automatically generated.
Value
A list of class f_summary containing the results data frame.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# --- Example 1: Basic Usage (data.frame notation) ---
# Summarize "hp" grouped by "cyl"; columns and group_vars can be positional
summary_mtcars <- f_summary(mtcars, columns = "hp", group_vars = "cyl")
summary_mtcars <- f_summary(mtcars, "hp", "cyl") # shorthand equivalent
print(summary_mtcars)
# --- Example 2: Multiple Columns & Groups with Custom Toggles ---
# Summarize "hp" and "disp", grouped by "cyl" and "gear", hide Q1/Q3
summary_custom <- f_summary(mtcars,
columns = c("hp", "disp"),
group_vars = c("cyl", "gear"),
show_Q1 = FALSE,
show_Q3 = FALSE)
print(summary_custom)
# --- Example 3: Formula Notation ---
# Identical result to Example 2 using formula interface
# and export output to excel
summary_formula <- f_summary(hp + disp ~ cyl + gear,
data = mtcars,
show_Q1 = FALSE,
show_Q3 = FALSE,
export_to_excel = TRUE)
print(summary_formula)
# --- Example 4: Distributional Stats & Digits ---
# Add skewness and kurtosis, control rounding
summary_dist <- f_summary(Sepal.Length + Petal.Length ~ Species,
data = iris,
show_skew = TRUE,
show_kurtosis = TRUE,
digits = 3)
print(summary_dist)
# --- Example 5: Custom Print Formatting ---
summary_iris <- f_summary(iris, "Sepal.Length", group_vars = "Species")
print(summary_iris, col_width = 10, table_width = 70)
Perform multiple t-tests with optional data transformation, inspection and visualization.
Description
Performs One-sample, Two-sample (Independent), or Paired t-tests on a given dataset
with options for (Box-Cox/BestNormalize) transformations, normality tests, and
visualization. Several response parameters can be analysed in sequence
(formula interface). Additionally, a vector interface similar to stats::t.test()
is supported.
Usage
f_t_test(x, ...)
## S3 method for class 'formula'
f_t_test(
formula,
data = NULL,
paired = FALSE,
var.equal = NULL,
conf.level = NULL,
mu = 0,
alternative = "two.sided",
norm_plots = TRUE,
transformation = TRUE,
force_transformation = NULL,
alpha = 0.05,
intro_text = TRUE,
close_generated_files = FALSE,
open_generated_files = interactive(),
output_type = "default",
save_as = NULL,
save_in_wdir = FALSE,
...
)
## Default S3 method:
f_t_test(
x,
y = NULL,
paired = FALSE,
var.equal = NULL,
conf.level = NULL,
mu = 0,
alternative = "two.sided",
norm_plots = TRUE,
transformation = TRUE,
force_transformation = NULL,
alpha = 0.05,
intro_text = TRUE,
close_generated_files = FALSE,
open_generated_files = interactive(),
output_type = "default",
save_as = NULL,
save_in_wdir = FALSE,
...
)
Arguments
x |
Numeric vector of data values (one-sample or first group for two-sample),
or a formula of the form |
... |
For the formula method: additional arguments forwarded to
the row-filtering step. The arguments |
formula |
A formula specifying the model (alternative to using x/y).
More response variables can be added using |
data |
A data frame containing the variables when using the formula interface. |
paired |
Logical. If |
var.equal |
Logical or |
conf.level |
Numeric. Confidence level. Default is |
mu |
Numeric. The true value to test against: the mean (one-sample), the mean
of differences (paired), or the difference in means (two-sample). Default is 0.
For transformed analyses, |
alternative |
Character string. |
norm_plots |
Logical. If |
transformation |
Logical or character string. If |
force_transformation |
Character vector. Names of variables to transform regardless of normality results. |
alpha |
Numeric. Significance level. Default is |
intro_text |
Logical. If |
close_generated_files |
Logical. Closes open Excel/Word files before writing.
Default |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
output_type |
Character string specifying the output format. Default is
|
save_as |
Character. Specific path/filename for output. |
save_in_wdir |
Logical. Save in working directory. Default |
y |
Optional numeric vector (second group) for two-sample tests if using the vector interface. Ignored when a formula is supplied. |
Value
An object of class 'f_t_test', a named list with one element per
response variable. Each element contains the t-test result, normality test
results, variance diagnostic results, transformation object (if applied),
and back-transformed confidence interval (if applicable).
Author(s)
Sander H. van Delden plantmind@proton.me
References
Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch's t-test instead of Student's t-test. International Review of Social Psychology, 30(1), 92-101. doi:10.5334/irsp.82
Examples
# 1. Two-sample independent Welch's t-test (default)
f_t_test(mpg ~ am, data = mtcars, output_type = "console", norm_plots = FALSE)
# 2. Multiple response variables in one call
f_t_test(mpg + hp ~ am, data = mtcars, output_type = "console", norm_plots = FALSE)
# 3. One-sample t-test: test if mean mpg equals 20
f_t_test(mpg ~ 1, data = mtcars, mu = 20,
output_type = "console", norm_plots = FALSE)
# 4. Paired t-test (sleep dataset is already in AABB order)
f_t_test(extra ~ group, data = sleep, paired = TRUE,
output_type = "console", norm_plots = FALSE)
# 5. Vector interface: two-sample independent
group_auto <- mtcars$mpg[mtcars$am == 0]
group_manual <- mtcars$mpg[mtcars$am == 1]
f_t_test(group_auto, group_manual, output_type = "console", norm_plots = FALSE)
# 6. Vector interface: one-sample
f_t_test(mtcars$mpg, mu = 20, output_type = "console", norm_plots = FALSE)
# 7. Force Student's t-test (equal variances assumed)
f_t_test(mpg ~ am, data = mtcars, var.equal = TRUE,
output_type = "console", norm_plots = FALSE)
# 8. One-sided test
f_t_test(mpg ~ am, data = mtcars, alternative = "greater",
output_type = "console", norm_plots = FALSE)
# 9. Custom significance level (alpha = 0.01 is equivalent to conf.level = 0.99)
f_t_test(mpg ~ am, data = mtcars, alpha = 0.01,
output_type = "console", norm_plots = FALSE)
# 10. Box-Cox transformation with back-transformed CI
# The back-transformed CI estimates the MEDIAN, not the arithmetic mean.
result <- f_t_test(hp ~ am, data = mtcars, transformation = TRUE,
output_type = "console", norm_plots = FALSE)
result[["hp"]]$ci_backtransformed
# 11. One-sample with non-zero mu and back-transformation
f_t_test(hp ~ 1, data = mtcars, mu = 100, transformation = TRUE,
output_type = "console", norm_plots = FALSE)
# 12. BestNormalize transformation (set seed for reproducibility)
set.seed(123)
f_t_test(hp ~ am, data = mtcars, transformation = "bestnormalize",
output_type = "console", norm_plots = FALSE)
# 13. Force transformation regardless of normality
f_t_test(mpg + hp ~ am, data = mtcars, force_transformation = "mpg",
output_type = "console", norm_plots = FALSE)
# 14. Suppress transformation (diagnostic mode)
f_t_test(hp ~ am, data = mtcars, transformation = FALSE,
output_type = "console", norm_plots = FALSE)
# 15. Access return object fields directly
result <- f_t_test(mpg + hp ~ am, data = mtcars,
output_type = "default", norm_plots = FALSE, intro_text = FALSE)
result[["mpg"]]$t_test # standard htest object
result[["hp"]]$shapiro_res # Shapiro-Wilk result
result[["hp"]]$homog_p_bartlett # Bartlett p-value (diagnostic only)
result[["hp"]]$homog_p_levene # Levene p-value (diagnostic only)
result[["hp"]]$ci_backtransformed # back-transformed CI if transformed
Apply a black or white 'RStudio' Theme and Zoom Level
Description
This comes in hand when teaching, the function allows users to apply a "black" or "white" 'RStudio' theme and adjust the zoom level in the 'RStudio' IDE. It includes error handling for invalid inputs.
Usage
f_theme(color = "black", zlevel = 0)
Arguments
color |
A character string. The theme color to apply. Must be either |
zlevel |
A numeric value. The zoom level to apply, ranging from |
Details
The function performs the following actions:
Applies the specified 'RStudio' theme:
-
"black": Applies the "Tomorrow Night 80s" dark theme. -
"white": Applies the "Textmate (default)" light theme.
-
Adjusts the zoom level in 'RStudio':
-
zlevel = 0: Resets to default zoom level. -
zlevel = 1: Zooms in once. -
zlevel = 2: Zooms in twice. -
zlevel = 3: Zooms in three times. -
zlevel = 4: Zooms in four times.
-
The function includes error handling to ensure valid inputs:
-
colormust be a character string and one of"black"or"white". -
zlevelmust be a numeric value, an integer, and within the range of 0 to 4. If a non-integer is provided, it will be rounded to the nearest integer with a warning.
Value
None. The function is called for its side effects of changing the 'RStudio' theme or Zoomlevel.
This function does not return a value. It applies changes directly to the 'RStudio' IDE.
Author(s)
Sander H. van Delden plantmind@proton.me
Examples
# NOTE: This example will change your RStudio theme hence the dont run warning.
## Not run:
# Apply a dark theme with with zoom level 2:
f_theme(color = "black", zlevel = 2)
# Apply a black theme with maximum zoom level:
f_theme(color = "black", zlevel = 4)
# Apply the default light theme default zoom level:
f_theme(color = "white", zlevel = 0)
## End(Not run)
Perform multiple Wilcoxon rank sum and signed rank tests with inspection and visualization.
Description
Performs One-sample (Wilcoxon signed rank), Two-sample independent (Wilcoxon rank sum / Mann-Whitney U), or Paired (Wilcoxon signed rank) tests on a given dataset. Several response parameters can be analysed in sequence (formula interface). Additionally, a vector interface similar to stats::wilcox.test() is supported.
Usage
f_wilcox_test(x, ...)
## S3 method for class 'formula'
f_wilcox_test(
formula,
data = NULL,
paired = FALSE,
conf.level = NULL,
mu = 0,
alternative = "two.sided",
norm_plots = TRUE,
alpha = 0.05,
intro_text = TRUE,
close_generated_files = FALSE,
open_generated_files = TRUE,
output_type = "default",
save_as = NULL,
save_in_wdir = FALSE,
...
)
## Default S3 method:
f_wilcox_test(
x,
y = NULL,
paired = FALSE,
conf.level = NULL,
mu = 0,
alternative = "two.sided",
norm_plots = TRUE,
alpha = 0.05,
intro_text = TRUE,
close_generated_files = FALSE,
open_generated_files = TRUE,
output_type = "default",
save_as = NULL,
save_in_wdir = FALSE,
...
)
Arguments
x |
Numeric vector of data values (one-sample or first group for two-sample),
or a formula of the form |
... |
For the formula method: additional arguments forwarded to
the row-filtering step. The arguments |
formula |
A formula specifying the model (alternative to using x/y).
More response variables can be added using |
data |
A data frame containing the variables when using the formula interface. |
paired |
Logical. If Note on factor level order: The formula interface computes
differences as
# Set levels at creation
group <- factor(group, levels = c("pre", "post"))
# Or relevel an existing factor
group <- relevel(group, ref = "pre")
A reversed level order flips the sign of the estimate and CI but does not affect the W statistic or p-value. |
conf.level |
Numeric. Confidence level of the interval.
Default is |
mu |
Numeric. The hypothesised value of the pseudo-median (one-sample) or location shift (paired/two-sample) under H0. Default is 0. |
alternative |
Character string. |
norm_plots |
Logical. If |
alpha |
Numeric. Significance level. Default is |
intro_text |
Logical. If |
close_generated_files |
Logical. Closes open Excel/Word files before writing.
Works on Windows (taskkill), macOS (pkill) and Linux (pkill/soffice).
Default |
open_generated_files |
Logical. Whether to open the generated output
files after creation. Defaults to |
output_type |
Character string specifying the output format. Default is
|
save_as |
Character. Specific path/filename for output. |
save_in_wdir |
Logical. Save in working directory. |
y |
Optional numeric vector (second group) for two-sample tests if using the vector interface. Ignored when a formula is supplied. |
Value
An object of class 'f_wilcox_test'.
Median vs Pseudo-median
By default this function calls stats::wilcox.test(conf.int = TRUE),
which bases its confidence interval and hypothesis test on the
Hodges-Lehmann estimator, not the raw sample median. This is
standard behaviour of the Wilcoxon test, not something specific to this
function. This function explicitly labels the estimator for what it is,
because it is commonly mislabelled as "CI for the median" in textbooks
and software output.
The estimator works differently depending on the test type:
One-sample: The pseudo-median is the middle value of all possible pairwise averages of your data points (including each value paired with itself). For a perfectly symmetric distribution it equals the sample median; for skewed data the two can differ.
Paired: The paired differences (observation 1 minus observation 2 within each pair) are computed first, and the pseudo-median of those differences is estimated. This is conceptually a one-sample problem applied to the differences, not a comparison of two independent groups. The CI is for the pseudo-median of the differences, not for the difference between the two separate sample medians.
Two-sample independent: The location shift is the median
of all n_1 \times n_2 pairwise differences (one value from Group 1
minus one from Group 2). It answers: by how much does a randomly
chosen value from Group 1 tend to exceed a randomly chosen value from
Group 2? When both groups have the same distributional shape it equals
the raw difference in sample medians; when shapes differ, the two values can
diverge.
In all three cases the sample median(s) are reported separately for descriptive purposes only.
Plot an f_bestNormalize object
Description
Plots diagnostics for an object of class f_bestNormalize.
Usage
## S3 method for class 'f_bestNormalize'
plot(x, which = 1:2, ask = FALSE, ...)
Arguments
x |
An object of class |
which |
Integer determining which graph to plot. Default is |
ask |
Logical. |
... |
Further arguments passed to or from other methods. |
Details
Plot method for f_bestNormalize objects
Value
This function is called for its side effect of generating plots and does not return a useful value. It invisibly returns 'NULL'.
Plot an f_boxcox object
Description
Create diagnostic plots of an object of class f_boxcox.
Usage
## S3 method for class 'f_boxcox'
plot(x, which = 1:3, ask = FALSE, ...)
Arguments
x |
An object of class |
which |
Integer determining which graph to plot. Default is |
ask |
Logical. |
... |
Further arguments passed to or from other methods. |
Details
Plot method for f_boxcox objects
Value
This function is called for its side effect of generating plots
and does not return a useful value. It invisibly returns 1.
Plot method for f_kruskal_test objects
Description
Displays the density plot and/or boxplot stored in an f_kruskal_test
object. Plots are only available when the original call used plot = TRUE.
Usage
## S3 method for class 'f_kruskal_test'
plot(x, which = c("distributions", "Boxplot"), ...)
Arguments
x |
An object of class |
which |
Character vector indicating which plots to show. Options are
|
... |
Additional arguments (currently ignored). |
Value
Returns x invisibly.
Examples
result <- f_kruskal_test(Sepal.Width ~ Species, data = iris,
output_type = "default")
plot(result) # both plots
plot(result, which = "Boxplot") # boxplot only
Plot method for f_lmer objects
Description
Replays the four-panel diagnostic figure (residuals vs fitted, Q-Q of
residuals, Q-Q of random-effect BLUPs, scale-location) produced by
f_lmer().
Usage
## S3 method for class 'f_lmer'
plot(x, ...)
Arguments
x |
An object of class |
... |
Additional arguments (currently ignored). |
Value
Returns x invisibly.
Plot method for f_long objects
Description
Automatically runs a f_scan diagnostic plot on data created by f_long.
Usage
## S3 method for class 'f_long'
plot(x, summary = TRUE, ...)
Arguments
x |
An object of class |
summary |
Logical. If |
... |
Additional arguments passed to |
Value
Returns the output of f_scan (an object of class f_scan) invisibly.
Predict method for f_boxcox objects
Description
Applies the fitted Box-Cox transformation to new data (forward transform),
or reverses it back to the original scale (inverse transform). This is
useful for transforming hypothesis test parameters (e.g., mu) to the
transformed scale, or for back-transforming confidence intervals to the
original scale.
Usage
## S3 method for class 'f_boxcox'
predict(object, newdata, inverse = FALSE, ...)
Arguments
object |
An object of class |
newdata |
A numeric vector of values to transform. For the forward
transform ( |
inverse |
Logical. If |
... |
Further arguments passed to or from other methods (currently unused). |
Details
The forward transformation applies the standard Box-Cox formula:
y(\lambda) = \begin{cases}
\frac{y^\lambda - 1}{\lambda}, & \lambda \neq 0 \\
\log(y), & \lambda = 0
\end{cases}
The inverse transformation reverses this process to recover the original scale:
y = \begin{cases}
(y(\lambda) \cdot \lambda + 1)^{1/\lambda}, & \lambda \neq 0 \\
\exp(y(\lambda)), & \lambda = 0
\end{cases}
Note on inverse validity: When \lambda > 0, not all
transformed-scale values have a valid inverse. If
y(\lambda) \cdot \lambda + 1 \leq 0, the result is undefined and
NaN is returned with a warning.
Value
A numeric vector of the same length as newdata, containing
either the forward-transformed or back-transformed values.
See Also
Examples
# Assuming mtcars is available and f_boxcox is loaded
bc <- f_boxcox(mtcars$hp)
# Forward: transform a hypothesis value (mu) to the Box-Cox scale
mu <- 100
mu_transformed <- predict(bc, newdata = mu)
# Inverse: back-transform a confidence interval to the original scale
ci_transformed <- c(5.5, 6.8)
predict(bc, newdata = ci_transformed, inverse = TRUE)
# Round-trip sanity check should return exactly mu (e.g., 100)
predict(bc, newdata = mu_transformed, inverse = TRUE)
Print method for f_outliers objects
Description
Prints a formatted summary table to the console.
Usage
## S3 method for class 'f_outliers'
print(
x,
col_width = 6,
table_width = 90,
digits = 2,
allow_integer_decimal_mix = FALSE,
...
)
Arguments
x |
Object of class f_outliers. |
col_width |
Integer. Max characters in header before line break. Default |
table_width |
Integer or |
digits |
Integer. Number of decimal digits to use in formatting. Default is |
allow_integer_decimal_mix |
Logical. If |
... |
Additional arguments passed to |
Value
Invisibly returns 1.
Print method for f_scan objects
Description
Print method for f_scan objects
Summary method for f_scan objects
Plot method for f_scan objects
Usage
## S3 method for class 'f_scan'
print(
x,
summary = TRUE,
outliers = TRUE,
boxplot = TRUE,
histogram = TRUE,
qqplot = TRUE,
main_plot = TRUE,
advice = TRUE,
digits = 3,
...
)
## S3 method for class 'f_scan'
summary(object, digits = 3, ...)
## S3 method for class 'f_scan'
plot(x, boxplot = TRUE, histogram = TRUE, qqplot = TRUE, main_plot = TRUE, ...)
Arguments
x |
An |
summary |
Logical. Print summary statistics table? Default |
outliers |
Logical. Print outlier table? Default |
boxplot, histogram, qqplot, main_plot |
Logical. Which plots to render? |
advice |
Logical. Print statistical test recommendations? Default |
digits |
Integer. Decimal places for printed tables. Default |
... |
Further arguments passed to or from other methods. Currently
unused by the |
object |
f_scan object to make a summary table from. |
Print method for f_stat_wizard
Description
Print method for f_stat_wizard
Usage
## S3 method for class 'f_stat_wizard'
print(x, plots = TRUE, ...)
Arguments
x |
An object of class |
plots |
Logical. If |
... |
Additional arguments (ignored). |
Print method for f_summary objects
Description
Prints a formatted summary table to the console.
Usage
## S3 method for class 'f_summary'
print(
x,
col_width = 6,
table_width = 90,
digits = 2,
allow_integer_decimal_mix = FALSE,
...
)
Arguments
x |
Object of class f_summary. |
col_width |
Integer. Max characters in header before line break. Default |
table_width |
Integer or |
digits |
Integer. Number of decimal digits to use in formatting. Default is |
allow_integer_decimal_mix |
Logical. If |
... |
Additional arguments passed to |
Value
Invisibly returns 1.
Summary method for f_long objects
Description
Automatically runs the f_summary function on data created by f_long
using the attributes stored in the object.
Usage
## S3 method for class 'f_long'
summary(object, ...)
Arguments
object |
An object of class |
... |
Additional arguments passed to |
Value
Returns the summary table (usually a data frame or tibble) produced by f_summary.