---
title: "From Rolling Quarters to Monthly Estimates: SIDRA Mensalization Guide"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{From Rolling Quarters to Monthly Estimates: SIDRA Mensalization Guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<!--
MAINTAINER NOTE:
This vignette uses eval=FALSE for all code chunks.
Pre-computed figures are generated by:
  code/precompute_sidra_mensalization.R

To regenerate figures and verify code:
  source("code/precompute_sidra_mensalization.R")

Plot code in the vignette is wrapped in <details> blocks and matches
the generation script exactly. Look for VIGNETTE CODE markers in the script.
-->

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE,
  purl = FALSE
)
```

## Overview

Brazil's Continuous National Household Sample Survey (PNADC) publishes labor
market indicators as **rolling (moving) quarters** — 3-month moving averages
where each published "quarter" shares 2 months with its neighbors. This
smoothing hides short-term dynamics: turning points are delayed, seasonal
patterns are distorted, and international comparison becomes difficult.

The PNADCperiods package includes a SIDRA mensalization module that recovers
**exact monthly estimates** from rolling quarter data. This vignette explains
how to use it.

### Why Rolling Quarters Are Problematic

Each published "quarter" is actually a 3-month moving average:

- "2019-Q1" = average of Jan, Feb, Mar 2019
- "2019-Q2" = average of Feb, Mar, Apr 2019
- "2019-Q3" = average of Mar, Apr, May 2019

![Rolling quarters overlap: each 'quarter' shares 2 months with its neighbors](figures/sidra-mensalization/fig1_rolling_schematic.png){width=100%}

When unemployment jumps sharply in a single month, the rolling quarter spreads
that spike across multiple overlapping periods. The mensalization algorithm
inverts this averaging process to recover the true monthly values.

---

## Quick Start

```{r quickstart, eval=FALSE}
library(PNADCperiods)

# Step 1: Fetch rolling quarter data from SIDRA API
rolling_quarters <- fetch_sidra_rolling_quarters()

# Step 2: Convert to monthly estimates
monthly <- mensalize_sidra_series(rolling_quarters)

# Step 3: Use your monthly data!
head(monthly[, .(anomesexato, m_popocup, m_taxadesocup)])
```

That's it! You now have monthly estimates starting from January 2012.

1. **`fetch_sidra_rolling_quarters()`** downloaded 86+ economic indicators
   from IBGE's SIDRA API

2. **`mensalize_sidra_series()`** applied the mensalization formula using
   pre-computed starting points (bundled with the package)

3. The result is a `data.table` with one row per month and `m_*` columns
   for each mensalized series

---

## Understanding the Output

The mensalized output contains:

- `anomesexato`: Month identifier (YYYYMM format, e.g., 201903 = March 2019)
- `m_*` columns: Mensalized (monthly) estimates for each series
- Price indices: `ipca100dez1993`, `inpc100dez1993` (passed through for deflation)

**Key series include:**

| Column | Description | Unit |
|--------|-------------|------|
| `m_populacao` | Total population | Thousands |
| `m_pop14mais` | Population 14+ years | Thousands |
| `m_popocup` | Employed population | Thousands |
| `m_popdesocup` | Unemployed population | Thousands |
| `m_taxadesocup` | Unemployment rate | Percent |
| `m_taxapartic` | Labor force participation rate | Percent |
| `m_massahabnominaltodos` | Total nominal wage bill | Millions R$ |

Rate series (like `m_taxadesocup`) are **derived** from mensalized level
series when `compute_derived = TRUE` (the default). They are computed as
ratios of the mensalized levels, not directly mensalized from the rolling
quarter rates.

### Discovering Available Series

Use `get_sidra_series_metadata()` to explore all 86+ available series:

```{r metadata, eval=FALSE}
meta <- get_sidra_series_metadata()

# View series organized by theme
meta[, .N, by = .(theme, theme_category)]

# Filter to specific theme categories
meta[theme_category == "employment_type", .(series_name, description)]
```

The metadata uses a hierarchical taxonomy: `theme` (top level, e.g.,
"labor_market"), `theme_category` (e.g., "employment_type"), and optionally
`subcategory` (e.g., "levels", "rates").

---

## Data Flow

The mensalization process follows a three-step pipeline:

![Data flow from SIDRA to monthly estimates](figures/sidra-mensalization/fig2_data_flow.png){width=100%}

### Step 1: Fetching Rolling Quarter Data

`fetch_sidra_rolling_quarters()` downloads data from five SIDRA tables:

| Table | Content |
|-------|---------|
| 4093 | Population and labor force |
| 6390 | Income (nominal and real) |
| 6392 | Real income by occupation |
| 6399 | Employment by sector |
| 6906 | Underutilization indicators |

```{r fetch-inspect, eval=FALSE}
rq <- fetch_sidra_rolling_quarters(verbose = TRUE)

# Inspect structure
dim(rq)
names(rq)[1:20]
```

Key columns: `anomesfinaltrimmovel` (end month of rolling quarter, YYYYMM),
`mesnotrim` (month position 1/2/3), plus one column per series.

### Step 2: The Mensalization Transform

```{r mensalize-inspect, eval=FALSE}
monthly <- mensalize_sidra_series(rq, verbose = TRUE)

# Compare dimensions
cat("Rolling quarters:", nrow(rq), "rows\n")
cat("Monthly data:", nrow(monthly), "rows\n")
```

The row count is approximately the same (one per month), but the meaning
changes from "rolling quarter ending in month X" to "exact estimate for
month X".

### Step 3: Using Monthly Estimates

<details>
<summary>Show plotting code</summary>

```{r plot-unemployment, eval=FALSE}
# --- VIGNETTE CODE: plot-unemployment ---
library(ggplot2)

monthly[, date := as.Date(paste0(substr(anomesexato, 1, 4), "-",
                                  substr(anomesexato, 5, 6), "-01"))]

ggplot(monthly, aes(x = date, y = m_taxadesocup)) +
  geom_line(color = "#1976D2", linewidth = 0.8) +
  labs(title = "Monthly Unemployment Rate",
       x = NULL, y = "Unemployment Rate (%)")
```

</details>

### Population Data for Weighting

For analyses requiring monthly population estimates separately:

```{r pop-data, eval=FALSE}
pop <- fetch_monthly_population()
head(pop)
```

Returns a data.table with `ref_month_yyyymm` and `m_populacao` columns.

---

## Working with Series

### Fetching by Theme

Instead of fetching all 86+ series, filter by theme or theme category:

```{r by-theme, eval=FALSE}
# Only employment type series
employment <- fetch_sidra_rolling_quarters(theme_category = "employment_type")

# Only wage mass series
wages <- fetch_sidra_rolling_quarters(theme_category = "wage_mass")

# Only labor market theme (includes participation, unemployment, employment types, etc.)
labor <- fetch_sidra_rolling_quarters(theme = "labor_market")
```

### Fetching Specific Series

For maximum efficiency, request only the series you need:

```{r specific-series, eval=FALSE}
# Only unemployment-related series
unemp <- fetch_sidra_rolling_quarters(
  series = c("popdesocup", "taxadesocup", "popnaforca")
)
```

### Excluding Derived Series

Some series are rates computed from other series. To fetch only "base" series:

```{r exclude-derived, eval=FALSE}
# Exclude computed rates (only population and income levels)
base_only <- fetch_sidra_rolling_quarters(exclude_derived = TRUE)
```

### Selecting Output Columns

After mensalization, select columns as needed:

```{r select-columns, eval=FALSE}
monthly <- mensalize_sidra_series(rq)

# Select specific series
labor_market <- monthly[, .(
  anomesexato,
  employed = m_popocup,
  unemployed = m_popdesocup,
  unemp_rate = m_taxadesocup,
  participation = m_taxapartic
)]
```

---

## The Mensalization Methodology

*This section can be skipped by users who just need results.*

### The Core Concept

Rolling quarters are 3-month moving averages. If we denote the true monthly
value for month $t$ as $y_t$, then the rolling quarter value $x_t$ is:

$$x_t = \frac{y_{t-2} + y_{t-1} + y_t}{3}$$

The mensalization algorithm inverts this relationship to recover $y_t$ from
the sequence of $x_t$ values.

### The Mensalization Formula

**Step 1: Compute first differences**

$$d3_t = x_t - x_{t-1}$$

**Step 2: Identify month position (mesnotrim)**

Each month has a position within its quarter:
- Position 1: Jan, Apr, Jul, Oct
- Position 2: Feb, May, Aug, Nov
- Position 3: Mar, Jun, Sep, Dec

**Step 3: Cumulative sum by position**

For each position separately, compute the cumulative sum of first differences,
starting from a calibrated "starting point" $y_0$:

$$y_t = y_0 + \sum_{s \in \text{same position}, s \leq t} d3_s$$

![Mensalization process: rolling quarters (blue) vs monthly estimates (red)](figures/sidra-mensalization/fig3_mensalization_process.png){width=100%}

### The Role of Starting Points ($y_0$)

The starting point $y_0$ is crucial. It determines the **level** of all
subsequent monthly estimates. The package includes pre-computed starting points
for 53 series, calibrated during the stable 2013-2019 period.

Starting points are computed by:

1. Processing PNADC microdata to get "true" monthly aggregates ($z$ values)
2. Comparing these to rolling quarters
3. Finding the $y_0$ that makes $y_0 + \text{cumsum}(d3)$ match the microdata

### Assumptions and Limitations

- Monthly values within each position evolve smoothly
- The calibration period (2013-2019) reflects "normal" conditions
- Cannot recover intra-month variation
- Starting points are calibrated to national totals (not regional breakdowns)

---

## Practical Considerations

### API Caching

The package caches SIDRA API responses in memory during your R session:

```{r cache, eval=FALSE}
# First call: fetches from API (~10 seconds)
rq1 <- fetch_sidra_rolling_quarters()

# Second call with use_cache = TRUE: uses cached data (instant)
rq2 <- fetch_sidra_rolling_quarters(use_cache = TRUE)

# Clear all cached data (force fresh fetch on next call)
clear_sidra_cache()
```

The cache persists until you call `clear_sidra_cache()` or restart R.

### Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| "Series not found" | Misspelled series name | Check `get_sidra_series_metadata()` |
| "API timeout" | SIDRA server slow | Retry; use `use_cache = TRUE` |
| "No starting points" | Custom series | See Custom Starting Points below |

```{r error-handling, eval=FALSE}
# Check if series exists
meta <- get_sidra_series_metadata()
"taxadesocup" %in% meta$series_name  # TRUE
```

### Data Quality Notes

**COVID-19 disruptions (2020):** IBGE suspended in-person interviews during
the pandemic. Some indicators show unusual patterns in 2020-Q2.

**CNPJ series availability:** Series based on CNPJ registration
(empregadorcomcnpj, contapropriacomcnpj, etc.) are only available from
October 2015, when V4019 was introduced.

---

## Custom Starting Points

*For users with calibrated PNADC microdata.*

Use the bundled starting points (default) unless:

1. **Your series isn't bundled** — Custom variable definitions
2. **Different calibration period** — Non-standard reference period
3. **Regional breakdown** — State or metro-area mensalization

### Option A: All-in-One Function

```{r custom-y0-allinone, eval=FALSE}
# Load your stacked PNADC microdata (with pnadc_apply_periods weights)
stacked <- readRDS("my_calibrated_pnadc.rds")

# Compute starting points
custom_y0 <- compute_starting_points_from_microdata(
  data = stacked,
  calibration_start = 201301L,
  calibration_end = 201912L,
  verbose = TRUE
)

# Use custom starting points
monthly <- mensalize_sidra_series(rq, starting_points = custom_y0)
```

### Option B: Step-by-Step

```{r custom-y0-stepbystep, eval=FALSE}
# Step 1: Build crosswalk and calibrate
crosswalk <- pnadc_identify_periods(stacked)
calibrated <- pnadc_apply_periods(
  stacked, crosswalk,
  weight_var = "V1028",
  anchor = "quarter",
  calibration_unit = "month"
)

# Step 2: Compute z_ aggregates (monthly totals from microdata)
z_agg <- compute_z_aggregates(calibrated)

# Step 3: Fetch rolling quarters for comparison
rq <- fetch_sidra_rolling_quarters()

# Step 4: Compute starting points
y0 <- compute_series_starting_points(
  monthly_estimates = z_agg,
  rolling_quarters = rq,
  calibration_start = 201301L,
  calibration_end = 201912L
)

# Step 5: Use custom starting points
result <- mensalize_sidra_series(rq, starting_points = y0)
```

CNPJ-based series automatically use a later calibration period (2016-2019)
when `use_series_specific_periods = TRUE` (the default in
`compute_series_starting_points()`).

### Validating Custom Starting Points

```{r validate-y0, eval=FALSE}
bundled <- pnadc_series_starting_points

# Merge and compare
comp <- merge(custom_y0, bundled,
              by = c("series_name", "mesnotrim"),
              suffixes = c("_custom", "_bundled"))

comp[, rel_diff := abs(y0_custom - y0_bundled) / abs(y0_bundled) * 100]
comp[rel_diff > 1]  # Flag series with >1% difference
```

---

## Case Study: COVID-19 Unemployment

How quickly did unemployment rise when COVID-19 hit Brazil? Rolling quarter
data obscures these dynamics. Monthly estimates reveal the exact timing.

<details>
<summary>Show analysis code</summary>

```{r covid-analysis, eval=FALSE}
# --- VIGNETTE CODE: covid-analysis ---
# Fetch all series and mensalize
rq <- fetch_sidra_rolling_quarters()
monthly <- mensalize_sidra_series(rq)

# Filter to COVID period
covid_period <- monthly[anomesexato >= 201901 & anomesexato <= 202212]

# Create date column
covid_period[, date := as.Date(paste0(
  substr(anomesexato, 1, 4), "-",
  substr(anomesexato, 5, 6), "-01"
))]

# Find peak
peak_month <- covid_period[which.max(m_taxadesocup)]
cat("Peak unemployment:", peak_month$m_taxadesocup, "% in",
    format(peak_month$date, "%B %Y"), "\n")
```

</details>

![Monthly vs rolling quarter unemployment rate (2019-2023)](figures/sidra-mensalization/fig4_monthly_vs_quarterly.png){width=100%}

![COVID-19 impact on Brazilian unemployment](figures/sidra-mensalization/fig5_covid_case_study.png){width=100%}

**Key findings from monthly estimates:**

1. **Exact peak timing**: Monthly data pinpoints the peak month, while
   rolling quarters show only a gradual rise

2. **Speed of impact**: The monthly series reveals a sharp spike that
   rolling quarters smooth over 3+ months

3. **Recovery dynamics**: Monthly estimates show pauses and reversals
   in recovery that are hidden in quarterly averages

---

## Series Naming Conventions

| Pattern | Meaning | Example |
|---------|---------|---------|
| `m_` | Mensalized monthly estimate | `m_popocup` |
| `pop*` | Population count | `populacao`, `pop14mais` |
| `*comcart` | With formal contract | `empregprivcomcart` |
| `*semcart` | Without formal contract | `empregprivsemcart` |
| `*comcnpj` | With CNPJ registration | `empregadorcomcnpj` |
| `taxa*` | Rate (percent) | `taxadesocup` |
| `nivel*` | Level/ratio (percent) | `nivelocup` |
| `rend*` | Income (rendimento) | `rendhabnominaltodos` |
| `massa*` | Wage bill (massa salarial) | `massahabnominaltodos` |
| `*hab*` | Usually received (habitual) | `rendhabnominaltodos` |
| `*efet*` | Actually received (efetivo) | `rendefetnominaltodos` |

For the complete catalog, use `get_sidra_series_metadata()`:

```{r programmatic-access, eval=FALSE}
meta <- get_sidra_series_metadata()

# Filter by theme category
meta[theme_category == "employment_type", .(series_name, description)]

# Filter by theme and pattern
meta[theme == "labor_market" & grepl("taxa|nivel", series_name),
     .(series_name, description)]
```

---

## Function Reference

| Function | Purpose |
|----------|---------|
| `fetch_sidra_rolling_quarters()` | Download rolling quarter data from SIDRA API |
| `fetch_monthly_population()` | Get monthly population estimates |
| `mensalize_sidra_series()` | Convert rolling quarters to monthly estimates |
| `get_sidra_series_metadata()` | Explore available series and metadata |
| `clear_sidra_cache()` | Clear cached API data |
| `compute_z_aggregates()` | Compute monthly aggregates from calibrated microdata |
| `compute_series_starting_points()` | Compute $y_0$ values from aggregates |
| `compute_starting_points_from_microdata()` | All-in-one $y_0$ computation |

**Bundled data:** `pnadc_series_starting_points` — pre-computed $y_0$ for
53 series x 3 month positions (calibration period: 2013-2019).

---

## References

- HECKSHER, Marcos. "Valor Impreciso por Mes Exato: Microdados e Indicadores Mensais Baseados na Pnad Continua". IPEA - Nota Tecnica Disoc, n. 62. Brasilia, DF: IPEA, 2020. <https://portalantigo.ipea.gov.br/portal/index.php?option=com_content&view=article&id=35453>
- HECKSHER, M. "Cinco meses de perdas de empregos e simulacao de um incentivo a contratacoes". IPEA - Nota Tecnica Disoc, n. 87. Brasilia, DF: IPEA, 2020.
- HECKSHER, Marcos. "Mercado de trabalho: A queda da segunda quinzena de marco, aprofundada em abril". IPEA - Carta de Conjuntura, v. 47, p. 1-6, 2020.
- Barbosa, Rogerio J; Hecksher, Marcos. (2026). PNADCperiods: Identify Reference Periods in Brazil's PNADC Survey Data. R package version v0.1.0. <https://github.com/antrologos/PNADCperiods>

---

## Further Reading

- `vignette("getting-started")` — Setting up PNADC microdata analysis
- `vignette("how-it-works")` — The period identification algorithm
- `vignette("applied-examples")` — Applied research examples
- **IBGE SIDRA API**: https://sidra.ibge.gov.br/
- **Package repository**: https://github.com/antrologos/PNADCperiods
