---
title: "Getting started with toolero"
author: "Erwin Lares"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with toolero}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Background and motivation

`toolero` grew out of a recurring observation made while teaching and
supporting researchers at UW-Madison: the habits that make a project
reproducible, shareable, and maintainable are easiest to adopt at the very
beginning — and hardest to retrofit once a project is already underway.

The package is heavily influenced by the workflows taught in workshops run by
[The Carpentries](https://carpentries.org/) and the
[UW-Madison Libraries](https://www.library.wisc.edu/). Those workshops
emphasize consistent project organization, version control, and reproducible
data practices as foundational skills — not advanced topics. `toolero` tries
to operationalize those principles into a small set of functions that reduce
the friction of doing the right thing from the start.

The theming and branding support in `toolero` is specifically tailored to
UW-Madison's Research Computing and Instrumentation (RCI) unit, whose
Quarto-based reporting templates are baked into the package as defaults. If
you are not at UW-Madison, the branding files are optional — the rest of the
package works independently of them.

---

## Who is this for?

`toolero` is designed for researchers and analysts who:

- Work primarily in R and use RStudio as their IDE
- Write reports or analyses in Quarto
- Want consistent, reproducible project structure without having to think
  about it every time
- May need to publish content to the UW-Madison Knowledge Base

The package is intentionally small. It does not try to be comprehensive. It
tries to make the right defaults easy to reach for from the first line of
code.

---

## Installation

You can install `toolero` from CRAN:

```{r}
#| eval: false
install.packages("toolero")
```

Or install the development version from GitHub:

```{r}
#| eval: false
pak::pak("erwinlares/toolero")
```

---

## Project setup: `init_project()` and `create_qmd()`

These two functions are designed to be used together, in order. `init_project()`
creates the scaffold; `create_qmd()` populates it with a working Quarto
document.

### Starting with `init_project()`

Starting a new R project usually means the same manual steps every time:
create a folder, set up an RStudio project, create subdirectories for data
and scripts, initialize `renv`, initialize `git`. None of these steps is hard
on its own, but skipping any of them — especially early on — tends to create
friction later.

`init_project()` handles all of this in a single call:

```{r}
#| eval: false
library(toolero)

init_project(path = "~/Documents/my-project")
```

This creates a new RStudio project at the specified path with the following
folder structure already in place:

```
my-project/
├── data/         # input data
├── data-raw/     # original, unprocessed data
├── R/            # reusable functions
├── scripts/      # analysis scripts
├── plots/        # generated visualizations
├── images/       # static images and assets
├── results/      # processed outputs and tables
└── docs/         # notes, manuscripts, Quarto documents
```

> **Why this structure?** The folder layout is opinionated but not arbitrary.
> Separating `data/` from `data-raw/` makes it clear which files are original
> and which have been processed. Keeping `R/` distinct from `scripts/`
> encourages moving reusable logic into functions over time, which is a natural
> step toward more maintainable code.

By default, `init_project()` also initializes `renv` and `git`. This means
the project is reproducible and version-controlled from the first commit.

> **Why `renv` and `git` by default?** `renv` ensures that the packages your
> project depends on are recorded and reproducible. `git` provides a full
> history of changes. Both are much easier to set up at the start than to
> retrofit later.

If your project needs folders beyond the defaults:

```{r}
#| eval: false
init_project(
  path          = "~/Documents/my-project",
  extra_folders = c("notebooks", "presentations")
)
```

To apply UW-Madison RCI branding assets to the project:

```{r}
#| eval: false
init_project(
  path        = "~/Documents/my-project",
  uw_branding = TRUE
)
```

This creates an `assets/` folder and populates it with `styles.css`,
`header.html`, and `rci-banner.png` — the same assets used in the Quarto
template scaffolded by `create_qmd()`.

### Adding a Quarto document with `create_qmd()`

Once the project exists, `create_qmd()` adds a working Quarto document to it:

```{r}
#| eval: false
create_qmd(path = "~/Documents/my-project", filename = "analysis.qmd")
```

This creates:

- `analysis.qmd` — a Quarto document with a fully populated YAML header,
  three-context input resolution via `detect_execution_context()`, and a
  sample analysis using the Palmer Penguins dataset
- `data/sample.csv` — sample data to develop against immediately
- `assets/styles.css` and `assets/header.html` — UW-Madison RCI branding
- `_quarto.yml` — a project file with a post-render hook that runs `purl.R`
- `purl.R` — extracts R code from the rendered document into a companion
  `.R` file automatically on every render

> **Why the purl hook?** Having a plain `.R` companion to your `.qmd` is
> useful for sharing the analysis as a script, running it on a remote cluster,
> or archiving the code independently of the document. The hook runs
> automatically so you never have to remember to extract it manually.

To pre-populate the YAML header with your own metadata:

```{r}
#| eval: false
create_qmd(
  path      = "~/Documents/my-project",
  filename  = "analysis.qmd",
  yaml_data = "~/my-metadata.yml"
)
```

Where `my-metadata.yml` might look like:

```yaml
title: "My Analysis"
author:
  - name: "Your Name"
    affiliation: "UW-Madison"
    email: "you@wisc.edu"
```

Any keys present in the YAML file overwrite the corresponding placeholders in
the template. Keys not present are left as-is.

---

## Working with data: `read_clean_csv()` and `write_by_group()`

These two functions address common friction points in day-to-day data work.
They are general-purpose utilities — useful in any R project, not just ones
set up with `toolero`.

### Reading data with `read_clean_csv()`

`read_clean_csv()` combines `readr::read_csv()` and `janitor::clean_names()`
into a single call:

```{r}
#| eval: false
data <- read_clean_csv("data/my-file.csv")
```

Column names are automatically converted to lowercase with underscores —
consistent, predictable, and tidyverse-friendly. A column called `First Name`
becomes `first_name`. `Q1 Revenue ($)` becomes `q1_revenue`.

By default, column type messages from `readr` are suppressed. Set
`verbose = TRUE` to see them:

```{r}
#| eval: false
data <- read_clean_csv("data/my-file.csv", verbose = TRUE)
```

### Splitting data by group with `write_by_group()`

When a data frame contains multiple groups that need to be written to separate
files, `write_by_group()` handles the split and the write in a single call:

```{r}
#| eval: false
write_by_group(
  data       = penguins,
  group_col  = "species",
  output_dir = "results/by-species"
)
```

Output filenames are derived from the group values and sanitized for use as
file names — converted to lowercase with spaces and special characters
replaced by dashes. A group called `Chinstrap` becomes `chinstrap.csv`.
`Palmer Penguins` would become `palmer-penguins.csv`.

To also write a manifest listing the output files, group values, and row
counts:

```{r}
#| eval: false
write_by_group(
  data       = penguins,
  group_col  = "species",
  output_dir = "results/by-species",
  manifest   = TRUE
)
```

---

## Execution context: `detect_execution_context()`

R code often needs to behave differently depending on where it is running —
interactively in RStudio, during a `quarto render`, or as a batch `Rscript`
job on a remote cluster. `detect_execution_context()` identifies which of
these three environments is active and returns one of `"interactive"`,
`"quarto"`, or `"rscript"`.

The canonical use case is resolving input file paths portably:

```{r}
#| eval: false
context <- detect_execution_context()

input_file <- switch(context,
  interactive = "data/sample.csv",
  quarto      = params$input_file,
  rscript     = commandArgs(trailingOnly = TRUE)[1]
)
```

This pattern is built into the template scaffolded by `create_qmd()`, so you
get it for free without having to write it yourself.

---

## Knowledge Base export: `generate_kb_xml()`

> **This section is relevant only if you publish content to the UW-Madison
> Knowledge Base.** If you do not, you can safely skip it.

The UW-Madison Knowledge Base requires content to be submitted as XML with
all visual assets embedded in the HTML body. `generate_kb_xml()` automates
this process entirely.

```{r}
#| eval: false
generate_kb_xml(
  html_path  = "docs/analysis.html",
  output_dir = "exports"
)
```

The function:

1. Infers the source `.qmd` from the HTML path (or accepts it explicitly via
   `qmd_path`)
2. Re-renders the document with `embed-resources: true` so all CSS, images,
   and JavaScript are self-contained
3. Extracts metadata from the `.qmd` YAML header — `title` → `kb_title`,
   `description` → `kb_summary`, `categories` → `kb_keywords`
4. Produces a `.xml` file ready for direct KB import

This is why the `description` and `categories` fields in the `create_qmd()`
template matter — they flow through automatically into the KB article metadata
without any extra work.

> **When importing into the KB**, check the
> *Decode HTML entity in body content* option.

---

## Citation

If you use `toolero` in your work, please cite it:

```{r}
#| eval: false
citation("toolero")
```