Proper dataset documentation is crucial for reproducible research and effective data sharing. The {qtkit} package provides two main functions to help standardize and automate the documentation process:
create_data_origin()
: Creates standardized metadata
about data(set) sourcescreate_data_dictionary()
: Generates the scaffolding for
a detailed variable-level documentation or can use AI to generate
descriptions to be reviewed and updated as necessaryLet’s start with documenting the built-in mtcars
dataset:
# Create a temporary file for our documentation
origin_file <- file_temp(ext = "csv")
# Create the origin documentation template
origin_doc <- create_data_origin(
file_path = origin_file,
return = TRUE
)
#> Data origin file created at `file_path`.
# View the template
origin_doc |>
glimpse()
#> Rows: 8
#> Columns: 2
#> $ attribute <chr> "Resource name", "Data source", "Data sampling frame", "Da…
#> $ description <chr> "The name of the resource.", "URL, DOI, etc.", "Language, …
The template provides fields for essential metadata. You can either open the CSV file in a spreadsheet editor or fill it out programmatically, as shown below.
Here’s how you might fill it out for mtcars
:
origin_doc |>
mutate(description = c(
"Motor Trend Car Road Tests",
"Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.",
"US automobile market, passenger vehicles",
"1973-74",
"Built-in R dataset (.rda)",
"Single data frame with 32 observations of 11 variables",
"Public Domain",
"Citation: Henderson and Velleman (1981)"
)) |>
write_csv(origin_file)
Create a basic data dictionary without AI assistance:
# Create a temporary file for our dictionary
dict_file <- file_temp(ext = "csv")
# Generate dictionary for iris dataset
iris_dict <- create_data_dictionary(
data = iris,
file_path = dict_file
)
# View the results
iris_dict |>
glimpse()
#> Rows: 5
#> Columns: 4
#> $ variable <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Widt…
#> $ name <chr> NA, NA, NA, NA, NA
#> $ type <chr> "numeric", "numeric", "numeric", "numeric", "factor"
#> $ description <chr> NA, NA, NA, NA, NA
If you have an OpenAI API key, you can generate more detailed descriptions:
# Not run - requires API key
Sys.setenv(OPENAI_API_KEY = "your-api-key")
iris_dict_ai <- create_data_dictionary(
data = iris,
file_path = dict_file,
model = "gpt-4",
sample_n = 5
)
Example output might look like:
#> # A tibble: 2 × 4
#> variable name type description
#> <chr> <chr> <chr> <chr>
#> 1 Sepal.Length Sepal Length numeric Length of the sepal in centimeters
#> 2 Sepal.Width Sepal Width numeric Width of the sepal in centimeters
For larger datasets, you can use sampling and grouping:
The {qtkit} package provides flexible tools for standardizing dataset
documentation. By combining create_data_origin()
and
create_data_dictionary()
, you can create comprehensive
documentation that enhances reproducibility and data sharing.
help(package = "qtkit")