---
title: "WDL and WIG Model Specs"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{WDL and WIG Model Specs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(rwig) |> suppressPackageStartupMessages()
```

In this vignette, I will show how to set up the control parameters
(hyper-parameters) needed for the WDL and WIG models.

The `wdl_specs()` is a list of lists,
and consists of 5 parts (lists):
`wdl_control`, `tokenizer_control`, `word2vec_control`,
`barycenter_control`, `optimizer_control`.

The `wig_specs()` is the same as `wdl_specs()`,
with additional `wig_control`.

## `wig_control`

This is the options only needed for `wig_specs()`. By default, it is

```{r, eval=FALSE}
wig_control = list(
  group_unit = "month",
  svd_method = "docs",
  standardize = TRUE
)
```

1. `group_unit` dictates at which level of time to group the documents,
and it will be passed to `lubridate::floor_date()` as the `unit` argument.
The default option is "month" to obtain monthly time series index,
and other options can be specified following the `unit` argument
in `lubridate::floor_date()`.
2. `svd_method` can be either "docs" or "topics".
The "doc" method means the Truncated SVD will be applied on the reconstructed
documents to get the index directly; 
whereas "topics" means TSVD will be applied to the topics matrix
before the construction of the index.
The latter one is the one originally proposed in Xie (2020).
3. `standardize`: bool, whether or not to standardize the result index
as mean 100 and standard deviation 1. This is default to be true,
following Baker et al. (2016), Xie (2020).

## `wdl_control`

This is the options supplied to the WDL modelling,
and is used for both `wdl_specs()` and `wig_specs()`.

1. `num_topics`: number of topics for the topic modeling
2. `batch_size`: batch size for the training purpose
3. `epochs`: epochs (i.e. number of passes) for the training data
4. `shuffle`: bool, whether to shuffle the input data randomly
5. `verbose`: bool, whether to print out useful diagnostic information

## `tokenizer_control`

Arguments for `tokenizers::tokenize_word_stems()`.

## `word2vec_control`

Arguments for `word2vec::word2vec()`,
but with the following default parameters:

```{r, eval=FALSE}
type = "cbow"
dim = 10
min_count = 1
```

## `barycenter_control`

Identical to `barycenter_control` in `barycenter()` function,
but with default

```{r, eval=FALSE}
with_grad = TRUE
```

## `optimizer_control`

Parameters to control the optimizer (SGD, Adam, AdamW).

```{r, eval=FALSE}
optimizer_control = list(
  optimizer = "adamw",
  lr = .005,
  decay = .01,
  beta1 = .9,
  beta2 = .999,
  eps = 1e-8
)
```

The default optimizer is AdamW ("adamw"), but you can also choose vanilla
SGD ("sgd") or the vanilla ("adam").
You can also set the learning rate `lr` in your hyper-parameter search.

The other default parameters should mostly be untouched for most people,
unless you know exactly what you are doing.
For a reference, you can see Section 7.1 in Xie (2025),
and the references within.


## See Also

See also `vignette("wdl-model")`, `vignette("wig-model")`.

## References

Baker, S. R., Bloom, N., & Davis, S. J. (2016). 
Measuring economic policy uncertainty. 
*The Quarterly Journal of Economics*, 131(4), 1593–1636. 
https://doi.org/10.1093/qje/qjw024

Peyré, G., & Cuturi, M. (2019). Computational Optimal Transport:
With Applications to Data Science.
*Foundations and Trends® in Machine Learning*, 11(5–6), 355–607.
https://doi.org/10.1561/2200000073

Schmitz, M. A., Heitz, M., Bonneel, N., Ngolè, F., Coeurjolly, D.,
Cuturi, M., Peyré, G., & Starck, J.-L. (2018).
Wasserstein dictionary learning:
Optimal transport-based unsupervised nonlinear dictionary learning.
*SIAM Journal on Imaging Sciences*, 11(1), 643–678.
https://doi.org/10.1137/17M1140431

Xie, F. (2020). Wasserstein index generation model: Automatic generation of
time-series index with application to economic policy uncertainty.
*Economics Letters*, 186, 108874.
https://doi.org/10.1016/j.econlet.2019.108874

Xie, F. (2025). Deriving the Gradients of Some Popular Optimal
Transport Algorithms (No. arXiv:2504.08722). *arXiv*.
https://doi.org/10.48550/arXiv.2504.08722