---
title: "Indexing and Query Optimization"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Indexing and Query Optimization}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

When a dataset has millions of rows spread across hundreds of row groups, a naive
scan reads every byte from disk even if the query only touches a handful of
rows. vectra's optimizer exists to prevent that. It applies several techniques
automatically (zone-map pruning, predicate pushdown, column pruning) and
exposes one manual mechanism: hash indexes.

The goal in every case is the same: skip work. Skip row groups whose statistics
prove they contain nothing relevant. Skip columns the query never references.
Skip disk reads entirely when a hash index can identify the target row groups
in constant time.

This vignette walks through each optimization, starting from the ones that
happen automatically and moving toward the ones that require explicit setup.
We will build a synthetic ecological dataset, write it to `.vtr`, then run
queries with and without indexes to see what changes. The `explain()` function
is our primary diagnostic tool throughout: it prints the query plan and
annotates which optimizations fired.

Understanding these mechanisms is not strictly necessary for using vectra. The
engine applies everything it can without intervention. But knowing what the
optimizer does (and what it cannot do) helps when designing file layouts, choosing
row group sizes, and deciding whether an index is worth the disk space.

We will start by creating a dataset to work with.

```{r create-data}
library(vectra)

set.seed(42)
n <- 50000

sites <- paste0("site_", sprintf("%03d", 1:200))
species <- paste0("sp_", sprintf("%03d", 1:80))

eco <- data.frame(
  site     = sample(sites, n, replace = TRUE),
  year     = sample(2000:2023, n, replace = TRUE),
  species  = sample(species, n, replace = TRUE),
  value    = round(runif(n, 0, 100), 2),
  quality  = sample(c("good", "moderate", "poor"), n, replace = TRUE),
  stringsAsFactors = FALSE
)
```

We write this to a `.vtr` file with a relatively small row group size so that
pruning effects are visible in the plan output. In production, larger row groups
(100k to 500k rows) are typical; we use 5000 here so that 50,000 rows split into
10 row groups.

```{r write-vtr}
f <- tempfile(fileext = ".vtr")
write_vtr(eco, f, batch_size = 5000)
```

## Zone-map pruning

Every row group in a v3+ `.vtr` file carries per-column min/max statistics in
its header. When a `filter()` sits directly above a scan, the engine checks
each row group's statistics against the predicate before reading any data. If
a row group's min/max range for the filtered column does not overlap the
predicate's range, the entire row group is skipped.

The min/max statistics are computed at write time. When `write_vtr()` flushes
each row group, it scans every column and records the smallest and largest
values in the row group header. For numeric columns (integer and double), the
comparison is straightforward. For string columns, the engine stores the
lexicographically smallest and largest strings, which lets it answer range
and equality predicates using byte-wise comparison. Boolean columns store
min/max too, but since the domain is just TRUE/FALSE, pruning only helps when
an entire row group is uniformly one value.

Not all column types benefit equally. Numeric columns with natural ordering
produce the tightest zone maps: timestamps that increase over time, sensor
readings that cluster by location, sequential IDs. String columns can also
prune well if the values have a predictable lexicographic distribution (e.g.,
ISO country codes or zero-padded identifiers like `"site_042"`), but
free-text strings with arbitrary prefixes tend to span the full alphabet in
every row group, making the min/max bounds too wide to exclude anything.

This works for numeric comparisons: `>`, `<`, `>=`, `<=`, `==`, `!=`, and
combinations that define a range. It also works for string columns where
lexicographic ordering makes the min/max bounds meaningful.

```{r zonemap-explain}
tbl(f) |>
  filter(value > 95) |>
  explain()
```

The plan output shows `v4 stats` on the ScanNode, indicating that zone-map
statistics are available and the engine will use them. With our uniform random
data, roughly 5% of rows have `value > 95`. Some row groups will have a max
below 95 and be skipped entirely; others will be read and filtered
conventionally.

The effectiveness of zone-map pruning depends on two things: the distribution
of data within row groups, and the size of those row groups. If data arrives
sorted on the filter column, each row group covers a narrow range and most
can be eliminated. If data is randomly shuffled, every row group's range spans
nearly the full domain, and zone maps provide little benefit.

```{r zonemap-sorted}
eco_sorted <- eco[order(eco$value), ]
f_sorted <- tempfile(fileext = ".vtr")
write_vtr(eco_sorted, f_sorted, batch_size = 5000)
```

When the data is sorted by `value`, each row group covers a contiguous slice
of the value range. A filter for `value > 95` now only needs the last row
group or two.

```{r zonemap-sorted-explain}
tbl(f_sorted) |>
  filter(value > 95) |>
  explain()
```

Row group size is a trade-off. Smaller row groups give finer-grained pruning
boundaries but increase the number of zone-map entries in the header and the
number of read operations. Each row group carries its own set of column
min/max values, validity bitmaps, and byte offsets, so splitting the same
data into 1000 groups instead of 10 multiplies the metadata by 100x. On the
read side, each row group that survives pruning triggers a separate seek and
read. Many small reads can be slower than fewer large reads, especially on
spinning disks or network-mounted storage where seek latency dominates.

Larger row groups amortize per-group overhead but make pruning coarser.
A single row group of 500k rows either gets read or skipped as a whole; if
the predicate matches even one row in that group, the engine reads all 500k
and filters in memory. For most workloads, 100k to 500k rows per group is a
reasonable default. If queries frequently filter on a column whose values
cluster spatially (timestamps, sorted IDs, geographic coordinates), smaller
groups around 10k to 50k can pay off because each group covers a narrow value
range and most groups fall cleanly outside the predicate.

The relationship between sorting and row group size is multiplicative. Sorted
data with small row groups is the best-case scenario for zone maps: each
group covers a contiguous, non-overlapping slice of the value domain, and
the engine can binary-search the row group index to find the relevant range.
Unsorted data with small row groups is the worst case: every group spans
nearly the full domain, none get pruned, and we pay the overhead of many
small reads with no pruning benefit.

Zone-map pruning is fully automatic. There is nothing to configure and no
sidecar file to maintain. It composes with every other optimization described
below.

## Hash indexes

Zone maps work well for range predicates on columns with spatial locality, but
they do nothing for equality lookups on high-cardinality string columns. If we
filter for a specific site name, the engine must check every row group's string
min/max bounds, which often span the entire alphabet. This is where hash
indexes help.

A hash index is a `.vtri` sidecar file that maps column values to the row
groups that contain them. Internally, the `.vtri` file is a serialized
open-addressing hash table using FNV-1a as the hash function. Each slot in
the table stores a column value (or its hash) and a bitmap of row group IDs
that contain that value. Open addressing means there are no linked lists or
secondary structures: when a hash collision occurs, the engine probes forward
through adjacent slots until it finds an empty one. At query time, "probing
the index" means hashing the filter value, looking up the corresponding slot,
walking past any collisions, and reading the row group bitmap directly. The
whole operation is O(1) in the expected case, with a small constant factor
that depends on how full the table is (the load factor stays below 70% to
keep probe chains short).

This design favors high-cardinality columns. When a column has thousands of
distinct values spread across hundreds of row groups, each value maps to a
small subset of groups, and the index eliminates most of the file from the
scan. A column with only 5 distinct values (like `quality` in our dataset)
maps each value to nearly every row group, so the bitmap is almost fully set
and the index saves little work. Zone maps handle that case adequately on
their own.

Once built, the scan node can identify the relevant row groups in constant
time, skipping all others without reading their data.

```{r index-no-index}
tbl(f) |>
  filter(site == "site_042") |>
  explain()
```

Without an index, the plan shows a FilterNode above a ScanNode with predicate
pushdown and zone-map stats, but no index annotation. The engine reads every
row group and filters in-memory.

Now we create an index on the `site` column.

```{r create-index}
create_index(f, "site")
has_index(f, "site")
```

```{r index-with-index}
tbl(f) |>
  filter(site == "site_042") |>
  explain()
```

The plan now shows `hash index` (or equivalent annotation) on the ScanNode.
Instead of scanning all 10 row groups, the engine probes the index to find
which row groups contain `"site_042"` and reads only those. With 200 sites
spread across 10 row groups, most row groups will contain at least one
occurrence, but in real data where sites cluster geographically or
temporally, the savings can be dramatic.

We can see the difference in timing on a moderately sized dataset. The effect
is most pronounced when the predicate is highly selective (few row groups
match) and the file has many row groups.

```{r index-timing}
# Without index -- drop the existing one first
unlink(paste0(f, ".site.vtri"))

t_no_idx <- system.time({
  for (i in 1:50) {
    tbl(f) |> filter(site == "site_042") |> collect()
  }
})

# With index
create_index(f, "site")

t_idx <- system.time({
  for (i in 1:50) {
    tbl(f) |> filter(site == "site_042") |> collect()
  }
})

cat("Without index:", t_no_idx["elapsed"], "s\n")
cat("With index:   ", t_idx["elapsed"], "s\n")
```

The magnitude of the speedup depends on how many row groups the index can
eliminate. On this small dataset with only 10 row groups and uniformly
distributed site names, the difference may be modest. On a file with 1000 row
groups where the target site appears in only 3 of them, the index turns a
full scan into three targeted reads.

The index file sits alongside the data file with the naming convention
`<original>.column.vtri`. It is read automatically by `tbl()` whenever a
filter predicate matches the indexed column.

## Composite indexes

When queries filter on multiple columns simultaneously, a composite index can
prune more aggressively than individual single-column indexes. A composite
index hashes the combined values of two or more columns together using FNV-1a
hash combining, mapping compound keys to row groups.

```{r composite-create}
create_index(f, c("site", "year"))
has_index(f, c("site", "year"))
```

This index accelerates AND-combined equality predicates on both columns at once.

```{r composite-explain}
tbl(f) |>
  filter(site == "site_042", year == 2015) |>
  explain()
```

The engine detects that both predicates in the AND-combination match the
composite index columns and probes the index with the combined hash. This is
more selective than probing two single-column indexes independently, because
the composite index encodes the co-occurrence of values within row groups.

When should we prefer a composite index over two single-column indexes? The
composite wins when the query always filters on both columns together. If
sometimes we filter on `site` alone and sometimes on `site` and `year`
together, we need a single-column index on `site` in addition to the
composite. A composite index on `(site, year)` does not accelerate a filter
on just `site` or just `year`.

The composite index is stored as a v2-format `.vtri` file. The naming
convention encodes both columns: `<file>.site_year.vtri` (or similar). Like
single-column indexes, composite indexes are detected and loaded automatically
by the scan node when the predicate structure matches.

The FNV-1a hash combining means that `(site = "A", year = 2020)` and
`(site = "B", year = 2020)` produce different hashes even though they share
the year value. There is no prefix-matching behavior as in B-tree indexes.
The composite index is purely an equality lookup on the full compound key.

## Case-insensitive indexes

Ecological datasets often contain species names, location names, or observer
codes with inconsistent casing. `"Quercus robur"`, `"quercus robur"`, and
`"QUERCUS ROBUR"` might all appear in the same column. A standard hash index
treats these as distinct values, so a filter for one spelling misses rows
stored under a different case.

The `ci = TRUE` option builds a case-insensitive index that normalizes all
values to lowercase before hashing.

```{r ci-index}
# Create a dataset with mixed-case species
eco_mixed <- eco
eco_mixed$species[1:100] <- toupper(eco_mixed$species[1:100])
f_mixed <- tempfile(fileext = ".vtr")
write_vtr(eco_mixed, f_mixed, batch_size = 5000)

create_index(f_mixed, "species", ci = TRUE)
```

With this index in place, a filter for `"sp_001"` will match rows regardless
of whether the stored value is `"sp_001"`, `"SP_001"`, or `"Sp_001"`. The
case normalization happens at index creation time, so there is no runtime cost
during queries beyond the normal hash probe.

Case-insensitive indexes are useful whenever the filter values might not match
the stored casing exactly. Reference tables, user-entered location names, and
taxonomic databases with inconsistent capitalization all benefit from this
option.

## `%in%` acceleration

Hash indexes accelerate single-value equality predicates and also
`%in%` predicates. When we filter a column against a set of values, the engine
probes the index for each value in the set and takes the union of matching row
groups.

```{r in-explain}
target_sites <- c("site_010", "site_042", "site_100", "site_150")

tbl(f) |>
  filter(site %in% target_sites) |>
  explain()
```

The scan node probes the site index four times and builds a bitmap of row
groups that contain at least one of the target values. Only those row groups
are read from disk.

This matters most when the set is small relative to the number of distinct
values. Filtering 4 sites out of 200 in a file with 1000 row groups might
reduce the scan to 20 row groups. Filtering 190 out of 200 sites offers
less benefit since most row groups will match anyway.

The `%in%` acceleration works with both single-column and composite indexes.
For composite indexes, each element of the filter set must be a complete
compound key. In practice, `%in%` on composite indexes is less common;
the typical use case is `%in%` on a single high-cardinality column like
species, site, or sample ID.

```{r in-timing}
t_in_no_idx <- system.time({
  for (i in 1:50) {
    tbl(f) |>
      filter(site %in% c("site_010", "site_042")) |>
      collect()
  }
})

cat("With index, %in% filter:", t_in_no_idx["elapsed"], "s\n")
```

Without an index, the same query reads all row groups and filters in memory.
The difference grows with file size.

## Column pruning

Every query touches only a subset of the columns in a file. vectra's optimizer
walks the plan tree top-down, collects the set of columns each node references,
and propagates that information to the scan node. Columns that no node in the
tree needs are excluded from disk reads entirely.

```{r col-pruning}
tbl(f) |>
  filter(value > 90) |>
  select(site, value) |>
  explain()
```

The ScanNode annotation shows something like `2/5 cols (pruned)` or `3/5 cols
(pruned)`, depending on how many columns the filter and select together
require. The `year`, `species`, and `quality` columns are never read from disk.

Column pruning is especially effective on wide tables. A dataset with 50
columns where a query touches 3 of them avoids reading 94% of the column data.
Since `.vtr` files store columns independently within each row group, skipping
a column means skipping its entire byte range on disk.

```{r col-pruning-wide}
wide <- data.frame(
  id = seq_len(1000),
  matrix(rnorm(1000 * 20), ncol = 20,
         dimnames = list(NULL, paste0("v", 1:20)))
)
f_wide <- tempfile(fileext = ".vtr")
write_vtr(wide, f_wide)

tbl(f_wide) |>
  select(id, v1, v2) |>
  explain()
```

The plan shows that only 3 of 21 columns are read. The remaining 18 columns
never touch memory. This optimization is always active and requires no user
action.

## Predicate pushdown

When a `filter()` node sits above a scan node, the engine pushes the predicate
down into the scan itself. This allows the scan to apply zone-map pruning and
hash index lookups before any data is materialized.

```{r pushdown-basic}
tbl(f) |>
  filter(value > 50) |>
  explain()
```

The plan annotates the ScanNode with `predicate pushdown`, indicating that the
filter condition has been absorbed into the scan. The separate FilterNode still
appears in the plan (it handles any remaining rows that pass the zone-map check
but fail the actual predicate), but the scan does the heavy lifting of
eliminating row groups.

Predicate pushdown works when the filter is directly above the scan or when only
"transparent" nodes (like select/rename) sit between them. If a mutate or
join intervenes, the predicate cannot be pushed past it because the filter
might reference columns that the intermediate node creates.

```{r pushdown-blocked}
tbl(f) |>
  mutate(scaled = value / 100) |>
  filter(scaled > 0.9) |>
  explain()
```

Here the filter references `scaled`, which is created by the mutate. The
predicate stays on the FilterNode above the ProjectNode. The ScanNode has no
predicate pushdown annotation. To recover pushdown in this situation,
restructure the query so the filter on `value` happens before the mutate:

```{r pushdown-rewrite}
tbl(f) |>
  filter(value > 90) |>
  mutate(scaled = value / 100) |>
  explain()
```

Now the filter on `value` pushes down to the scan, and the mutate operates
only on the surviving rows. Query structure matters.

## Reading `explain()` output

The `explain()` function prints the query plan as an indented tree. Each line
shows a node type, its execution mode, and any optimization annotations. Reading
this output is the fastest way to understand what the engine will do.

```{r explain-complex}
tbl(f) |>
  filter(site == "site_042", value > 50) |>
  select(site, year, value) |>
  mutate(decade = year - (year %% 10)) |>
  explain()
```

The tree reads bottom-up (scan at the bottom, output at the top). Each node
pulls batches from its child. The annotations on each line tell us:

**Node types:**

- `ScanNode`: reads from disk. Annotations include column count, pruning info,
  predicate pushdown, zone-map stats (`v4 stats`), hash index usage.
- `FilterNode`: applies a predicate row-by-row. When pushdown succeeds, this
  node still appears but does less work.
- `ProjectNode`: handles `select()`, `mutate()`, and `rename()`. Always
  streaming.
- `SortNode`: materializes all input, sorts, then streams output.

- `GroupAggNode`: hash-based grouping and aggregation. Materializes groups.

- `JoinNode`: hash join. Materializes the build side (right table).

- `WindowNode`: window functions. May materialize partitions.

- `TopNNode`: combined sort + limit. More efficient than separate sort and
  limit for small N.
- `ConcatNode`: union of multiple inputs (`bind_rows()`). Streaming.

**Annotations:**

- `streaming`: the node processes one batch at a time without buffering the
  full input. Memory usage is bounded by the batch size.
- `materializing`: the node must consume all input before producing output.
  Sorts and some aggregations are materializing.
- `predicate pushdown`: filter condition was absorbed into the scan.

- `v4 stats`: zone-map statistics are available for row group pruning.

- `hash index`: a `.vtri` index is being used for row group identification.

- `N/M cols (pruned)`: only N of M columns in the file are actually read.

When diagnosing a slow query, start at the ScanNode and work upward. The scan
is where data enters the pipeline, so any inefficiency there propagates
through every downstream node.

If the ScanNode lacks a `predicate pushdown` annotation but the query has a
`filter()`, the most common cause is an intermediate node sitting between the
filter and the scan. A `mutate()` that creates a column referenced by the
filter blocks pushdown, because the scan cannot evaluate an expression that
depends on a column that does not exist on disk. The fix is to rewrite the
query so that filters on raw columns happen before any mutate. Sometimes two
filters are needed: one on raw columns (pushed down) and one on computed
columns (applied later). This is a valid pattern and the engine handles it
correctly.

Another common surprise: the plan shows `predicate pushdown` but the query
is still slow. This can happen when the predicate is not selective enough to
prune row groups. A filter like `value > 0` on a column where every row
group's min is below 0 and max is above 0 will push down successfully but
prune nothing. The `explain()` output confirms that the machinery is in
place; actual pruning depends on the data distribution and row group
boundaries.

If the ScanNode reads all columns, check whether adding a `select()` before
a heavy operation (join, sort, aggregation) could shrink the column set. On
wide tables this can cut I/O by an order of magnitude. If the ScanNode lacks
a hash index annotation for an equality filter, consider creating one.

A useful diagnostic workflow: run `explain()`, note the ScanNode annotations,
make a structural change (reorder verbs, add an index, sort the file), then
run `explain()` again and compare. The annotations change immediately; there
is no caching or stale state to worry about.

```{r explain-agg}
tbl(f) |>
  filter(year >= 2020) |>
  group_by(site) |>
  summarise(avg = mean(value), n = n()) |>
  explain()
```

This plan shows a GroupAggNode above a FilterNode above a ScanNode. The filter
pushes down, zone maps may prune row groups where all years are below 2020, and
the aggregation hashes on `site`. Column pruning eliminates `species` and
`quality` since neither the filter nor the aggregation references them.

## Materialized blocks

Sometimes we have a reference table that we need to look up against repeatedly.
A taxonomy backbone, a site metadata table, a species trait database. Joining
this table into every query works but re-reads the reference data each time.
`materialize()` loads a query result into memory once and keeps it there as
a `vectra_block` for repeated O(1) lookups.

```{r materialize-block}
# Build a small reference table
ref <- data.frame(
  species = species,
  common   = paste("Common name for", species),
  family   = sample(paste0("fam_", 1:10), 80, replace = TRUE),
  stringsAsFactors = FALSE
)
f_ref <- tempfile(fileext = ".vtr")
write_vtr(ref, f_ref)

blk <- materialize(tbl(f_ref))
blk
```

The block builds an internal hash table lazily on first lookup. Subsequent
lookups on the same column are immediate.

```{r block-lookup}
hits <- block_lookup(blk, "species", c("sp_010", "sp_042", "sp_001"))
hits
```

`block_lookup()` returns a data.frame with a `query_idx` column indicating
which query key each row matched, plus all columns from the block. This is
equivalent to a semi-join filtered to the query keys, but without rebuilding
a hash table each time.

For approximate matching, `block_fuzzy_lookup()` computes string distances
between query keys and the block column. This is useful for reconciling species
names across databases where spelling varies.

```{r block-fuzzy}
# Deliberately misspelled queries
fuzzy_hits <- block_fuzzy_lookup(
  blk, "species",
  keys = c("sp_10", "sp_42"),
  method = "dl",
  max_dist = 0.3
)
fuzzy_hits
```

The `method` argument accepts `"dl"` (Damerau-Levenshtein, the default),
`"levenshtein"`, or `"jw"` (Jaro-Winkler). The `max_dist` threshold controls
how permissive the matching is; lower values require closer matches.

When the reference table has a blocking column (e.g., genus) that can narrow
the search space, use `block_col` and `block_keys` to restrict comparisons to
rows sharing the same block value. This turns an O(N * M) fuzzy search into
O(N * m) where m is the average block size.

```{r block-fuzzy-blocking}
# Add a genus column for blocking
ref2 <- ref
ref2$genus <- substr(ref2$species, 1, 5)
f_ref2 <- tempfile(fileext = ".vtr")
write_vtr(ref2, f_ref2)

blk2 <- materialize(tbl(f_ref2))

fuzzy_blocked <- block_fuzzy_lookup(
  blk2, "species",
  keys       = c("sp_10", "sp_42"),
  method     = "dl",
  max_dist   = 0.3,
  block_col  = "genus",
  block_keys = c("sp_01", "sp_04")
)
fuzzy_blocked
```

A block differs from a hash index in a fundamental way: it holds the entire
table in RAM, not just a mapping from values to row group IDs. This makes
lookups faster (no disk I/O at all), but the memory cost is the full
uncompressed size of the table. A reference table with 10,000 rows and 5
columns is negligible. A table with 10 million rows might consume hundreds of
megabytes, and at that scale a streaming hash join against a `.vtr` file is
the better approach.

Blocks are also ephemeral. They live in the R session and are not persisted
to disk. When the session ends, the block and its internal hash table
disappear. This is by design: blocks are caches, not storage. If we need
the reference table again in a later session, we re-materialize from the
`.vtr` file, which takes a fraction of a second for small tables. There is
no `.vtri`-style sidecar to maintain or worry about going stale.

When a block is better than an index: the reference table is small (under
100k rows), lookups happen many times in the same session (e.g., inside a
loop or across multiple queries), and the lookup pattern is exact-match or
fuzzy-match on a key column. The block's internal hash table builds lazily
on the first `block_lookup()` call for a given column and stays resident for
all subsequent calls on that column. If the workflow involves a single lookup
and then discarding the result, the setup cost of materializing the block may
not be worth it compared to a simple join.

## Practical guidance

The optimizations described above interact in predictable ways. Here is a
decision framework for choosing which ones to invest in.

**When to create a hash index.**
Indexes pay off when three conditions hold: the column has high cardinality
(hundreds or thousands of distinct values), the filter is an equality predicate
or `%in%` check, and the query runs more than once. A column with 5 distinct
values benefits little from an index because zone maps already identify the
relevant row groups in most layouts. A column filtered with range predicates
(`>`, `<`, `between`) gets no benefit because the index only supports equality
lookups.

**When NOT to create an index.**
Small files (under 100k rows) are fast enough to scan entirely. Files that
are always read in full (no filtering) gain nothing. Columns used exclusively
in range predicates should rely on zone maps and sorted layouts instead.

One-off queries also don't justify an index. Building the `.vtri` file
requires a full pass over the data, which is roughly as expensive as the
scan we're trying to avoid. If the query runs once, we've spent the same
I/O budget twice (once to build, once to query) for no net gain. Indexes
amortize their creation cost over many queries. As a rough heuristic, if
the query will run fewer than 5 times, a full scan is likely cheaper in
total wall time.

Disk space is another consideration. Each `.vtri` file stores the full set
of distinct values for the indexed column plus a bitmap per value. For a
column with 100,000 distinct strings averaging 20 bytes, the index file
will be a few megabytes. For a column with 10 million distinct values, the
index can rival the size of the data file itself. Composite indexes are
slightly larger because they store compound hashes. `has_index()` confirms
whether an index exists; checking the file size directly shows its cost.

Each index is a sidecar file that must be maintained; if the `.vtr` file
changes via `append_vtr()`, the index must be recreated to include the new
row groups.

**Row group sizing for zone maps.**
Smaller row groups give zone maps finer-grained pruning boundaries. If the
data is sorted on the primary filter column, 10k to 50k rows per group lets the
engine skip precisely. If the data is unsorted, small groups add overhead
without improving pruning. For unsorted data, 100k to 500k rows per group is a
safer default that balances scan overhead against I/O efficiency.

**Sort order matters.**
If one column dominates filter predicates, sorting the data on that column
before writing dramatically improves zone-map pruning. The engine can also
apply binary search within row groups on sorted columns. A common pattern is
to sort ecological observation data by timestamp or site ID, since most queries
filter on one of these.

```{r sort-write}
eco_by_site <- eco[order(eco$site), ]
f_by_site <- tempfile(fileext = ".vtr")
write_vtr(eco_by_site, f_by_site, batch_size = 5000)

tbl(f_by_site) |>
  filter(site == "site_042") |>
  explain()
```

When data is sorted by `site`, each site occupies a contiguous range of row
groups. Zone maps on string columns prune all row groups outside that range,
and an index can pinpoint the exact groups.

**Index maintenance.**
Indexes are not updated automatically when data changes. After calling
`append_vtr()` to add new row groups, the existing index does not cover the
appended data. The engine will still use the stale index for the row groups
it knows about, but any rows in the appended groups are invisible to the
index and will be found only by the fallback full-scan path. This means
queries may silently miss recent data if the index is not rebuilt.

The rebuild itself is fast: `create_index()` does a single sequential pass
over the file, hashing each value and recording its row group membership.
On a file with a few million rows, this takes well under a second. The
practical workflow is to batch appends (accumulate a day's or week's worth
of new data), then rebuild the index once. Rebuilding after every single
append is unnecessary overhead for most pipelines.

Recreate the index to include everything:

```{r index-rebuild}
append_vtr(eco[1:100, ], f)

# Old index is now stale -- recreate it
create_index(f, "site")
```

This is a deliberate design choice. Automatically updating the index on every
append would add write-time overhead that most append workflows do not need.
Batch your appends, then rebuild the index once.

**Decision tree.**
For a new `.vtr` file that will be queried repeatedly:

1. Choose a sort column based on the most common filter predicate. Sort the
   data before writing.
2. Pick a row group size. Start with 100k rows. If queries filter on the sort
   column, try 10k to 50k and benchmark.
3. If queries filter on equality predicates on a non-sort column, create a hash
   index on that column.
4. If queries filter on two columns simultaneously (AND-combined equality),
   consider a composite index.
5. Use `explain()` after each change to verify the optimizer is picking up the
   new structure.

**Composing optimizations.**
All optimizations compose. A single query can benefit from column pruning (fewer
columns read), zone-map pruning (fewer row groups read), hash index pushdown
(constant-time row group identification), and predicate pushdown (filtering
inside the scan). The engine applies every applicable technique automatically.
The only manual step is creating indexes.

```{r compose-all}
tbl(f) |>
  filter(site == "site_042", value > 80) |>
  select(site, year, value) |>
  explain()
```

This plan uses the hash index on `site` to identify candidate row groups,
zone maps on `value` to further prune within those groups, reads only 3
of 5 columns, and pushes both predicates into the scan.

```{r cleanup, include = FALSE}
unlink(c(
  f, paste0(f, ".site.vtri"), paste0(f, ".site_year.vtri"),
  f_sorted, f_mixed, paste0(f_mixed, ".species.vtri"),
  f_wide, f_ref, f_ref2, f_by_site
))
```
