---
title: "Partial Reading"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Partial Reading}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

When working with large datasets, loading an entire HDF5 object into R's memory isn't always feasible—or necessary. `h5lite` provides a highly efficient "partial reading" feature using the `start` and `count` parameters in `h5_read()`. 

This vignette explains why partial reading is vastly more efficient than R's standard indexing for large data, and how to use the "smart" `start` parameter across different data structures.

## Why Partial Reading Matters

If you are working with small datasets, partial reading isn't strictly necessary. By default, `h5lite` chunks data in 1 MB blocks. For objects smaller than this, reading the whole dataset into memory and subsetting it in R is perfectly fine.

However, when dealing with datasets that span gigabytes and exceed your system's RAM, partial reading becomes essential.

### The HDF5 Storage Model: Chunking and Compression

To understand why `start` and `count` are designed the way they are, it helps to understand how HDF5 stores data. 



Unlike a standard CSV, HDF5 datasets are divided into "chunks" which are compressed individually. When you want to read a specific piece of data, the HDF5 library must locate the chunk containing that data, decompress the *entire* chunk into memory, and then extract your requested values.

If you request a contiguous block of data, HDF5 only needs to decompress a handful of chunks. This is incredibly fast. 

However, if you try to use typical random-access indexing—for example, trying to extract a single column from a massive, row-oriented HDF5 matrix—the library has to decompress almost every single chunk in the dataset just to piece together that one column. To fetch a single column, it is often faster to read the entire dataset into R first and then subset it.

### Designing for Partial Reading

If you are the one designing and writing the HDF5 file, you should actively consider optimizing your data storage for partial reading. Well-designed HDF5 files lay out large datasets in such a way that users can extract useful subsets while only decompressing a minimal number of internal chunks. For instance, if you anticipate that users will primarily extract data row-by-row, your data should be oriented so that rows are kept contiguous.

The "smart" `start` parameter is purposefully designed to work seamlessly with datasets that are arranged optimally in this way, ensuring that the most efficient access patterns are also the easiest to type.

### Memory Efficiency of `start` and `count`

Another massive benefit of partial reading is the memory footprint of the request itself. 

In standard R, if you want to extract the first ten million elements of a vector, you might write `vec[1:10000000]`. Behind the scenes, R expands `1:10000000` into an actual vector of ten million 32-bit integers. That index vector alone consumes nearly 40 MB of RAM just to be passed as an argument!

In `h5lite`, fetching those same ten million elements looks like this: `h5_read(file, "vec", start = 1, count = 10000000)`. Those two arguments are passed as simple numeric values, consuming just 16 bytes.

---

## The "Smart" `start` Parameter

The `start` parameter is designed to relieve you from doing complex index math. Assuming your HDF5 file is well-designed and stores data in the most logical way it will be retrieved, **90% of the time you only need to provide a single integer to `start`**. 

When you provide a single integer, `start` automatically applies itself to the most meaningful dimension of the dataset:

* **1D Vector:** `start` specifies the **element**.
* **2D Matrix:** `start` specifies the **row**.
* **2D Data Frame:** `start` specifies the **row**.
* **3D Array:** `start` specifies the **2D matrix**.

The `count` parameter is an optional single integer that simply says, "Starting from `start`, how many of these structural units do you want to read?" 

### Single-Value Examples

Here is how this intuitive behavior looks in practice across different shapes of data when fetching a block of units:

```{r single-value}
library(h5lite)
file <- tempfile(fileext = ".h5")

# --- 1. Vectors (Element-level targeting) ---
h5_write(seq(10, 100, by = 10), file, "my_vector")

# Start at the 4th element, read 3 elements
h5_read(file, "my_vector", start = 4, count = 3)

# --- 2. Matrices (Row-level targeting) ---
mat <- matrix(1:50, nrow = 10, ncol = 5)
h5_write(mat, file, "my_matrix")

# Start at row 5, read 3 complete rows (automatically spans all columns)
h5_read(file, "my_matrix", start = 5, count = 3)

# --- 3. Data Frames (Row-level targeting) ---
h5_write(mtcars, file, "my_mtcars")

# Start at row 10, read 5 complete rows
h5_read(file, "my_mtcars", start = 10, count = 5)

# --- 4. 3D Arrays (Matrix-level targeting) ---
arr <- array(1:24, dim = c(2, 3, 4)) 
h5_write(arr, file, "my_array")

# Start at the 2nd matrix, read 2 complete matrices
h5_read(file, "my_array", start = 2, count = 2)
```

### Dimension Simplification (Exact vs. Range Indexing)

`h5lite` mimics R's native subsetting behavior when it comes to preserving or dropping dimensions. This behavior is controlled entirely by whether you include the `count` argument.

**Exact Indexing (Omitting `count`)** If you provide `start` but omit `count`, `h5lite` assumes you are requesting an exact point index. It will read 1 unit and **drop** the targeted dimension to simplify the resulting data structure.

```{r exact-index}
# Read exactly row 5 of the matrix. 
# The row dimension is dropped, returning a 1D vector.
row_vec <- h5_read(file, "my_matrix", start = 5)
row_vec

class(row_vec)
```

**Range Indexing (Providing `count`)** If you explicitly provide `count` (even if `count = 1`), `h5lite` assumes you are reading a range. The dataset's original dimensions are **preserved**. This is incredibly useful when programming dynamically and you need to guarantee that your matrix remains a matrix, even if your batch loop happens to fetch only a single row.

```{r range-index}
# Read row 5, but signal a range request by setting count = 1.
# The original geometry is preserved, returning a 1x5 matrix.
row_mat <- h5_read(file, "my_matrix", start = 5, count = 1)
row_mat

class(row_mat)
```

### Drilling Down: Multi-Value `start` and N-Dimensional Arrays

While the single-value form covers most use cases, `start` is flexible enough to target lower-rank dimensions for unusual or highly specific extractions. 

If you need to extract a specific contiguous block *inside* a matrix or array, you can pass a vector of integers to `start`. When you do this, the `count` dropping rules apply to the **last** dimension you specify, while all preceding dimensions are treated as exact point indices and dropped unconditionally.

To make this intuitive, `start` maps its values to the dataset's dimensions in a specific priority order, targeting the "outermost" structural blocks first, and the specific rows/columns last. For any N-dimensional array, the mapping order is:

* **Priority Order:** `Dimension N, Dimension N-1, ..., Dimension 3, Dimension 1 (Rows), Dimension 2 (Cols)`

For a 3D array, this means the first value targets the matrix, the second targets the row, and the third targets the column.

```{r multi-value}
# Matrix: Start at row 5, column 2, and read 3 elements along that row.
# The row is an exact point index (dropped). The columns are a range (preserved).
# Returns a 1D vector of length 3.
h5_read(file, "my_matrix", start = c(5, 2), count = 3)

# Matrix: Extract exactly row 5, column 2. 
# Because count is omitted, the final dimension is also dropped.
# Returns an unnamed scalar value.
h5_read(file, "my_matrix", start = c(5, 2))

# 3D Array: Target matrix 2, row 1.
# The matrix and row are exact point indices (dropped). 
# Returns a 1D vector containing the columns of that specific row.
h5_read(file, "my_array", start = c(2, 1))
```

*(Note: Data frames are a special case. Because HDF5 stores data frames as 1-dimensional lists of compound records, they do not have columns in the same structural way a matrix does. Therefore, `start` for a data frame must always be a single integer targeting the row. To get specific columns, read the rows you need first, then subset the columns in R).*

```{r cleanup, include=FALSE}
unlink(file)
```
