---
title: "Reading SNPs from Format 5 Files"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Reading SNPs from Format 5 Files}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, echo = FALSE}
library(BinaryDosage)
```

# Introduction

The BinaryDosage package provides three functions for reading individual SNPs
from Format 5 binary dosage files. They return the same data but differ in how
they manage memory and file connections, which matters when reading many SNPs
in a loop.

| Function | Allocates output? | Opens file per call? |
|---|---|---|
| `getbd5snp` | Yes — returns a new list | Yes |
| `getbd5snp_buf` | No — writes into caller vectors | Yes |
| `getbd5snp_con` | No — writes into caller vectors | No — reuses open connection |

All three functions require the `"genetic-info"` list returned by `getbdinfo`.

# Setup

The examples below use the small bgzipped VCF file included with the package.
First, convert it to a Format 5 file pair and load the metadata.

```{r setup_files, message = FALSE, warning = FALSE}
bdose_file <- file.path(tempdir(), "set1a.bdose")

if (requireNamespace("vcfppR", quietly = TRUE)) {
  vcftobd(vcffile    = system.file("extdata", "set1a.vcf.gz", package = "BinaryDosage"),
          bdose_file = bdose_file)
} else {
  updatebd(bdfiles    = system.file("extdata", "vcf1a.bdose", package = "BinaryDosage"),
           bdose_file = bdose_file)
}
bd5 <- getbdinfo(bdose_file)

n_snps <- nrow(bd5$snps)
n_samp <- nrow(bd5$samples)
cat("SNPs:", n_snps, " Samples:", n_samp, "\n")
```

---

# getbd5snp

`getbd5snp` is the simplest interface. It opens the `.bdose` file, seeks to
the requested SNP's compressed block, decompresses it, decodes the values, and
returns them as a new list. The file is opened and closed on every call.

## Parameters

- `bd5info` — object returned by `getbdinfo`
- `snp` — 1-based integer index or character SNP ID

## Return value

A list with four numeric vectors of length *n_samples*:

- `dosage` — DS values in \[0, 2\]; `NA` = missing
- `p0` — Pr(*g*=0) in \[0, 1\]; `NA` = missing
- `p1` — Pr(*g*=1) in \[0, 1\]; `NA` = missing
- `p2` — Pr(*g*=2) in \[0, 1\]; `NA` = missing

## Reading a SNP by index

```{r getbd5snp_index}
snp1 <- getbd5snp(bd5info = bd5, snp = 1L)

knitr::kable(
  data.frame(
    SampleID = head(bd5$samples$sid, 10),
    Dosage   = round(head(snp1$dosage, 10), 4),
    P_00     = round(head(snp1$p0,     10), 4),
    P_01     = round(head(snp1$p1,     10), 4),
    P_11     = round(head(snp1$p2,     10), 4)
  ),
  caption = paste("SNP", bd5$snps$snpid[1], "— first 10 subjects")
)
```

## Reading a SNP by ID

```{r getbd5snp_id}
snp3 <- getbd5snp(bd5info = bd5, snp = "1:12000:T:C")

knitr::kable(
  data.frame(
    SampleID = head(bd5$samples$sid, 10),
    Dosage   = round(head(snp3$dosage, 10), 4),
    P_00     = round(head(snp3$p0,     10), 4),
    P_01     = round(head(snp3$p1,     10), 4),
    P_11     = round(head(snp3$p2,     10), 4)
  ),
  caption = "SNP 1:12000:T:C — first 10 subjects"
)
```

## Using getbd5snp in a loop

`getbd5snp` is convenient but allocates a new list and opens the file on every
call. For a small number of SNPs this is fine. The following calculates the
alternate allele frequency for every SNP.

```{r getbd5snp_loop}
aaf <- numeric(n_snps)
for (i in seq_len(n_snps)) {
  aaf[i] <- mean(getbd5snp(bd5, i)$dosage, na.rm = TRUE) / 2
}

knitr::kable(
  data.frame(snpid = bd5$snps$snpid, aaf = round(aaf, 4)),
  caption = "Alternate allele frequencies via getbd5snp"
)
```

---

# getbd5snp_buf

`getbd5snp_buf` eliminates the per-call list allocation by writing results
directly into four numeric vectors that the caller pre-allocates once before
the loop. This avoids repeated garbage collection pressure when reading many
SNPs. The file is still opened and closed on every call.

## Parameters

- `bd5info` — object returned by `getbdinfo`
- `snp` — 1-based integer index or character SNP ID
- `dosage`, `p0`, `p1`, `p2` — pre-allocated `numeric(n_samples)` vectors

## Return value

`NULL` invisibly. The four output vectors are updated in place.

## Important: R copy-on-modify semantics

The output vectors **must not have more than one R binding** at the call site.
If another variable also points to the same vector, R's copy-on-modify rule
will copy the vector before writing, so the caller's variable will not be
updated. Always use plain, freshly allocated vectors:

```r
# Correct
dosage <- numeric(n_samp)
getbd5snp_buf(bd5, 1L, dosage, p0, p1, p2)

# Wrong — dosage2 creates a second binding; the update may not be visible
dosage2 <- dosage
getbd5snp_buf(bd5, 1L, dosage, p0, p1, p2)
```

## Example

```{r getbd5snp_buf_loop}
dosage <- numeric(n_samp)
p0     <- numeric(n_samp)
p1     <- numeric(n_samp)
p2     <- numeric(n_samp)

aaf_buf <- numeric(n_snps)
for (i in seq_len(n_snps)) {
  getbd5snp_buf(bd5info = bd5, snp = i, dosage = dosage,
                p0 = p0, p1 = p1, p2 = p2)
  aaf_buf[i] <- mean(dosage, na.rm = TRUE) / 2
}

knitr::kable(
  data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_buf, 4)),
  caption = "Alternate allele frequencies via getbd5snp_buf"
)
```

---

# getbd5snp_con

`getbd5snp_con` is the highest-performance option. It combines the
pre-allocated vector approach of `getbd5snp_buf` with a persistent open file
connection, so the `.bdose` file is opened once before the loop and kept open
for all reads. The C-level read uses zlib directly rather than R's `memDecompress`.

Use this when reading a large number of SNPs sequentially and minimizing
elapsed time matters.

## Workflow

1. Call `openbd5con(bd5info)` to open the file and get a `"bd5con"` object.
2. Call `getbd5snp_con` inside the loop, passing the `"bd5con"` object.
3. Call `closebd5con` when finished to release the file handle promptly.
   (The connection is also closed automatically when the `"bd5con"` object
   is garbage-collected or when R exits, so the explicit close is optional
   but recommended.)

## openbd5con

Opens a persistent connection to the `.bdose` file.

**Parameters**

- `bd5info` — object returned by `getbdinfo`

**Return value**

An object of class `"bd5con"` to be passed to `getbd5snp_con` and
`closebd5con`.

## closebd5con

Explicitly closes the connection.

**Parameters**

- `bd5con` — object returned by `openbd5con`

**Return value**

`NULL` invisibly.

## getbd5snp_con

**Parameters**

- `bd5info` — object returned by `getbdinfo`
- `snp` — 1-based integer index or character SNP ID
- `dosage`, `p0`, `p1`, `p2` — pre-allocated `numeric(n_samples)` vectors
- `bd5con` — object returned by `openbd5con`

**Return value**

`NULL` invisibly. The four output vectors are updated in place.

The same copy-on-modify constraint as `getbd5snp_buf` applies.

## Example

```{r getbd5snp_con_loop}
dosage <- numeric(n_samp)
p0     <- numeric(n_samp)
p1     <- numeric(n_samp)
p2     <- numeric(n_samp)

con <- openbd5con(bd5)

aaf_con <- numeric(n_snps)
for (i in seq_len(n_snps)) {
  getbd5snp_con(bd5info = bd5, snp = i, dosage = dosage,
                p0 = p0, p1 = p1, p2 = p2, bd5con = con)
  aaf_con[i] <- mean(dosage, na.rm = TRUE) / 2
}

closebd5con(con)

knitr::kable(
  data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_con, 4)),
  caption = "Alternate allele frequencies via getbd5snp_con"
)
```

---

# Choosing a reader

- **`getbd5snp`** — use when reading a small number of SNPs, or when
  simplicity matters more than speed.
- **`getbd5snp_buf`** — use in loops where allocation overhead is measurable
  but the file seek cost is acceptable.
- **`getbd5snp_con`** — use when reading all or most SNPs sequentially and
  elapsed time is important. The persistent connection avoids both the
  allocation and the file open/close on every iteration.

All three functions produce identical results, as confirmed below.

```{r verify}
all(aaf == aaf_buf)
all(aaf == aaf_con)
```

```{r cleanup, include = FALSE}
unlink(c(bdose_file, paste0(bdose_file, ".bdi")))
```
