Getting started with openVA

Zehang Richard Li, Jason Thomas, Eungang Choi, Tyler H. McCormick, Samuel J Clark



Verbal autopsy (VA) is a well-established approach to ascertaining the cause of death (COD) when medical certification and full autopsies are not feasible or practical (Garenne 2014; Taylor et al. 1983). After a death is identified, a specially-trained fieldworker interviews the caregivers (usually family members) of the decedent. A typical VA interview includes a set of structured questions with categorical or quantitative responses and a narrative section that records the “story” of the death from the respondent’s point of view (World Health Organization 2012).

This vignette aims to provide a quick introduction to the openVA package using simple case studies. For more detailed discussion of the package and related VA methods, please refer to (Zehang R. Li et al. 2021)

Overview of openVA

The openVA package (Zehang Richard Li et al. 2022) addresses this issue through an open-source toolkit. The openVA suite comprises a collection of R packages for the analysis of verbal autopsy data. The goal of this package is to provide researchers with an open-source tool that supports flexible data input formats and allows easier data manipulation and model fitting. The openVA suite consists of four core packages that are on CRAN, InterVA4 (Zehang R. Li et al. 2019), InterVA5 (Thomas et al. 2021), InSilicoVA (Zehang R. Li, McCormick, and Clark 2022), and Tariff (Zehang R. Li, McCormick, and Clark 2018), and an optional package nbc4va (Wen et al. 2018). Each of these packages implements one coding algorithm. For three of these algorithms – namely, InterVA-4, InterVA-5 and Tariff – there are also compiled software programs distributed by the original authors.

The openVA package is hosted on CRAN and can be installed with the following commands. For the analysis in this paper, we also install the nbc4va package separately from GitHub. The versions of the supporting packages can be checked in R using the openVA_status() function.


The common modeling framework for VA data consists of first converting the survey responses into a series of binary variables, and then assigning a COD to each death based on the binary input variables. Typically, the target of inference consists of two parts: the individual cause-of-death assignment, and the population-level cause-specific mortality fractions (CSMF), i.e., the fraction of deaths due to each cause. In this section, we formally compare the modeling approaches utilized by each algorithm for these two targets. We adopt the following notations. Consider \(N\) deaths, each with \(S\) binary indicators of symptoms. Let \(s_{ij}\) denote the indicator for the presence of \(j\)-th symptom in the \(i\)-th death, which can take values 0, 1, or NA (for missing data). We consider a pre-defined set of causes of size \(C\). For the \(i\)-th death, denote the COD by \(y_i \in \{1, ..., C\}\) and the probability of dying from cause \(k\) is denoted by \(P_{ik}\). For the population, the CSMF of cause \(k\) is denoted as \(\pi_k\), with \(\sum_{k=1}^C \pi_k = 1\).

In addition to the different model specifications underlying each algorithm, there is also a major conceptual difference in handling missing symptoms across the algorithms. Missing symptoms could arise from different stages of the data collection process. For example, the respondent may not know whether certain symptoms existed or may refuse to answer a question. From a statistical point of view, knowing that a symptom does not exist provides some information to the possible cause assignment, while a missing symptom does not. Although in theory, most of the VA algorithms could benefit from distinguishing ‘missing’ from ‘absent’, InSilicoVA is the only algorithm that has been implemented to acknowledge missing data. Missing indicators are assumed to be equivalent to ‘absence’ in InterVA, NBC, and Tariff.

Data preparation

In the openVA package, we consider two main types of standardized questionnaire: the WHO instrument and the IHME questionnaire. In this section, we focus on describing these two data formats and tools to clean and convert data. Pre-processing the raw data collected from the survey instrument (usually with Open Data Toolkit) is usually performed with additional packages and software outside of the analysis pipeline in R. We briefly mention software for data pre-processing towards the end of this paper.

The WHO standard format

For users familiar with InterVA software and the associated data processing steps, the standard input format from the WHO 2012 and 2016 instruments is usually well understood. For the 2012 instrument, the data expected by the InterVA-4 software are organized into a data frame where each row represents one death and the corresponding VA information is contained in \(246\) fields, starting from the first item being the ID of the death. The \(245\) items following the ID each represent one binary variable of symptom/indicator, where ‘presence’ is coded by ‘Y’, and ‘absence’ is coded by an empty cell.

To accommodate updates for the WHO 2016 instrument (D’Ambruoso et al. 2017), the InterVA-5 software accepts a data frame with \(354\) columns that include \(353\) columns of symptom/indicators followed by an additional column for the record ID. It should be noted that the R package InterVA5 retains the format with the record ID residing in the first column. Another important update with InterVA-5 is that it acknowledges the difference between “Yes” and “No” (or “Y/y” and “N/n”, which is different from the coding scheme in InterVA-4), both of which are processed as relevant responses, while all other responses are treated as missing values and ignored. With respect to the list of causes of death, InterVA-5 utilizes the WHO 2016 COD categories, which is nearly identical to the WHO 2012 COD categories (used by InterVA-4) except that hemorrhagic fever and dengue fever are two separate categories in the 2016 COD categories.

The same input format is inherited by the openVA package, except for one modification. We further distinguish ‘missing’ and ‘absent’ in the input data frame explicitly. We highly recommend that users pre-process all the raw data so that a ‘missing’ value in the data spreadsheet is coded as a ‘.’ (following the Stata practice familiar to many VA practitioners), and an ‘absent’ value is indicated by an empty cell, as in the standard InterVA-4 software. For WHO 2016 data, both ‘.’ and ‘-’ (following the default coding scheme of InterVA-5 software) are interpreted as missing values. For methods other than InSilicoVA, ‘missing’ and ‘absent’ will be considered the same internally and thus will not introduce a compatibility problem.

The PHMRC format

The Population Health Metrics Research Consortium (PHMRC) gold standard VA data (Murray et al. 2011) consist of three datasets corresponding to adult, child, and neonatal deaths, respectively. All deaths occurred in health facilities and gold-standard causes are determined based on laboratory, pathology and medical imaging findings. These datasets can be downloaded directly using the link returned by the function getPHMRC_url(). For example, we can read the adult VA dataset using the following command.

PHMRC_adult <- read.csv(getPHMRC_url("adult"))

Although the data are publicly accessible, a major practical challenge for studies involving the PHMRC gold standard dataset is that the pre-processing steps described from the relevant publications are not clear enough nor easy to implement. The openVA package internally cleans up the PHMRC gold standard data when calling the codeVA() function on the PHMRC data. The procedure follows the steps described in the supplement material of McCormick et al. (2016). Notice that the original PHMRC data are useful for comparing and validating new methods, as well as for using as training data, but the cleaning functions only require that the columns are exactly the same as the PHMRC gold standard datasets, so they could also be used for new data that are pre-processed into the same format.

Customized format

In addition to the two standard questionnaires discussed previously, researchers might also be interested in including customized dichotomous symptoms in their analysis. The openVA package also supports customized inputs as long as they are dichotomous. In such case, neither the built-in conditional probability matrix of InterVA nor the PHMRC gold standard dataset could be used to learn the relationship between training and testing data, thus different training data with known causes of death are necessary for all three algorithms. The ConvertData() function can be used to convert data with customized coding schemes into the format recognized by the openVA package.

Modeling data collected with WHO 2012 questionnaire

In the first example, we demonstrate fitting InterVA and InSilicoVA on a random sample of \(1,000\) deaths from the ALPHA network without cause-of-death labels collected with the WHO 2012 instrument. The dataset is already included in the openVA package as a dataset RandomVA1 and can be loaded directly.

## [1] 1000  246
head(RandomVA1[, 1:10])
##   ID elder midage adult child under5 infant neonate male female
## 1 d1     Y                                             Y       
## 2 d2     Y                                                    Y
## 3 d3            Y                                      Y       
## 4 d4                  Y                                       Y
## 5 d5                  Y                                Y       
## 6 d6                  Y                                       Y

The codeVA() function provides a standardized syntax to fit different VA models. Internally, the codeVA() function organizes the input data according to the specified data type, checks for incompatibility of the data and specified model, and calls the corresponding model fitting functions. It returns a classed object of the specified model class. In this example, we use version 4.03 of the InterVA software, which is the latest release of the original software compatible with the WHO 2012 instrument. Any additional model-specific parameters can be passed through the arguments of codeVA(). Here we specify the HIV and malaria prevalence levels required by the InterVA model to be ‘high’. Guidelines on how to set these parameters can be found in Byass et al. (2012).

fit_inter_who <- codeVA(data = RandomVA1, data.type = "WHO2012", 
                        model = "InterVA", version = "4.03", 
                        HIV = "h", Malaria = "h")
## InterVA-4 fitted on 1000 deaths
## CSMF calculated using reported causes by InterVA-4 only
## The remaining probabilities are assigned to 'Undetermined'
## Top 5 CSMFs:
##  cause                     likelihood
##  Undetermined              0.154     
##  HIV/AIDS related death    0.122     
##  Stroke                    0.072     
##  Reproductive neoplasms MF 0.058     
##  Pulmonary tuberculosis    0.055

We can implement InSilicoVA method with similar syntax. We use the default parameters and run the MCMC for \(10,000\) iterations. Setting the auto.length argument to FALSE specifies that the algorithm does not automatically increase the length of the chain when convergence failed. The InSilicoVA algorithm is implemented using a Metropolis-Hastings within Gibbs sampler. The acceptance rate is printed as part of the message as the model samples from the posterior distribution.

fit_ins_who <- codeVA(RandomVA1, data.type = "WHO2012", model = "InSilicoVA",
                    Nsim = 10000, auto.length = FALSE)
## InSilicoVA Call: 
## 1000 death processed
## 10000 iterations performed, with first 5000 iterations discarded
##  250 iterations saved after thinning
## Fitted with re-estimated conditional probability level table
## Data consistency check performed as in InterVA4
## Top 10 CSMFs:
##                                    Mean Std.Error Lower Median Upper
## Other and unspecified infect dis  0.266    0.0168 0.235  0.265 0.301
## HIV/AIDS related death            0.102    0.0091 0.085  0.102 0.119
## Renal failure                     0.101    0.0108 0.084  0.101 0.123
## Other and unspecified neoplasms   0.062    0.0089 0.046  0.061 0.080
## Other and unspecified cardiac dis 0.058    0.0076 0.044  0.058 0.075
## Digestive neoplasms               0.050    0.0077 0.033  0.050 0.065
## Acute resp infect incl pneumonia  0.048    0.0073 0.034  0.049 0.063
## Pulmonary tuberculosis            0.039    0.0068 0.025  0.039 0.054
## Stroke                            0.038    0.0061 0.027  0.038 0.052
## Other and unspecified NCD         0.034    0.0089 0.018  0.034 0.052

Modeling the PHMRC data

In the second example, we consider a prediction task using the PHMRC adult dataset. We first load the complete PHMRC adult dataset from its on-line repository, and organize it into training and test datasets. We treat all deaths from Andhra Pradesh, India as the test dataset.

PHMRC_adult <- read.csv(getPHMRC_url("adult"))
is.test <- which(PHMRC_adult$site == "AP")
test <- PHMRC_adult[is.test, ]
train <- PHMRC_adult[-is.test, ]
## [1] 1554  946
## [1] 6287  946

In order to fit the models on the PHMRC data, we specify data.type = "PHMRC" and phmrc.type = "adult" to indicate the data input is collected using the PHMRC adult questionnaire. We also specify the column of the causes-of-death label in the training data. The rest of the syntax is similar to the previous example.

When the input consists of both training and testing data, the InterVA and InSilicoVA algorithms estimate the conditional probabilities of symptoms using the training data, instead of using the built-in values. In such case, the version argument for the InterVA algorithm is suppressed. There are several ways to map the conditional probabilities of symptoms given causes in the training dataset to a letter grade system, specified by the convert.type argument. The convert.type = "quantile" performs the mapping so that the percentile of each rank stays the same as the original \(P_{s|c}\) matrix in InterVA software. Alternatively we can also use the original fixed values of translation, and assign letter grades closest to each entry in \(\hat{P}_{s|c}\). This conversion is specified by convert.type = "fixed", and is more closely aligned to the original InterVA and InSilicoVA setting. Finally, we can also directly use the values in the \(\hat{P}_{s|c}\) without converting them to ranks and re-estimating the values associated with each rank. This can be specified by convert.type = "empirical". In this demonstration, we assume the fixed value conversion.

fit_inter <- codeVA(data = test, data.type = "PHMRC", model = "InterVA", 
                     data.train = train, causes.train = "gs_text34", 
                     phmrc.type = "adult", convert.type = "fixed")
fit_ins <- codeVA(data = test, data.type = "PHMRC", model = "InSilicoVA",
                    data.train = train, causes.train = "gs_text34", 
                    phmrc.type = "adult", convert.type = "fixed", 
                    Nsim=10000, auto.length = FALSE)

The NBC and Tariff method can be fit using similar syntax.

fit_nbc <- codeVA(data = test, data.type = "PHMRC", model = "NBC", 
                   data.train = train, causes.train = "gs_text34", 
                   phmrc.type = "adult")
fit_tariff <- codeVA(data = test, data.type = "PHMRC", model = "Tariff",
                     data.train = train, causes.train = "gs_text34", 
                     phmrc.type = "adult")

Notice that we do not need to transform the PHMRC data manually. Data transformations are performed automatically within the codeVA() function.

Summarizing results

In this section we demonstrate how to summarize results, extract output, and visualize and compare fitted results. All the fitted object returned by codeVA() are S3 objects, for which a readable summary of model results can be obtained with the summary() function as shown in the previous section. In addition, several other metrics are commonly used to evaluate and compare VA algorithms at either the population or individual levels. In the rest of this section, we show how to easily calculate and visualize some of these metrics with the openVA package.

CSMF Accuracy

We can extract the CSMFs directly using the getCSMF() function. The function returns a vector of the point estimates of the CSMFs, or a matrix of posterior summaries of the CSMF for the InSilicoVA algorithm.

csmf_inter <- getCSMF(fit_inter)
csmf_ins <- getCSMF(fit_ins)
csmf_nbc <- getCSMF(fit_nbc)
csmf_tariff <- getCSMF(fit_tariff)

One commonly used metric to evaluate the CSMF estimates is the so-called CSMF accuracy, defined as \[ CSMF_{acc} = 1 - \frac{\sum_j^C CSMF_i - CSMF_j^{(true)}}{2(1 - \min CSMF^{(true)})} \] The CSMF accuracy can be readily computed using functions in openVA as the codes below shows.

csmf_true <- table(c(test$gs_text34, unique(PHMRC_adult$gs_text34))) - 1
csmf_true <- csmf_true / sum(csmf_true)
c(getCSMF_accuracy(csmf_inter, csmf_true, undet = "Undetermined"), 
  getCSMF_accuracy(csmf_ins[, "Mean"], csmf_true), 
  getCSMF_accuracy(csmf_nbc, csmf_true), 
  getCSMF_accuracy(csmf_tariff, csmf_true))
## [1] 0.53 0.74 0.77 0.68

We use the empirical distribution in the test data to calculate the true CSMF distribution, i.e., $CSMF_i^{(true)} = $. Then we evaluate the CSMF accuracy using the getCSMF_accuracy() function. As discussed previously, the default CSMF calculation is slightly different for diCfferent methods. For example, the InterVA algorithm creates the additional category of Undetermined by default, which is not in the true CSMF categories and needs to be specified. The creation of the undetermined category can also be suppressed by interVA.rule = FALSE in the getCSMF() function call. For the InSilicoVA algorithm, we use the posterior mean to calculate the point estimates of the CSMF accuracy.

Individual COD summary

At the individual level, we can extract the most likely cause-of-death assignment from the fitted object using the getTopCOD() function.

cod_inter <- getTopCOD(fit_inter)
cod_ins <- getTopCOD(fit_ins)
cod_nbc <- getTopCOD(fit_nbc)
cod_tariff <- getTopCOD(fit_tariff)

With the most likely COD assignment, other types of metrics based on individual COD assignment accuracy can be similarly constructed by users. The summary methods can also be called for each death ID. For example, using the Tariff method, we can extract the fitted rankings of causes for the death with ID 6288 by

summary(fit_inter, id = "6288")
## InterVA-4 fitted top 5 causes for death ID: 6288
##  Cause                     Likelihood
##  Stroke                    0.509     
##  Pneumonia                 0.318     
##  COPD                      0.081     
##  Other Infectious Diseases 0.064     
##  Renal Failure             0.013

The summary() function for InSilcoVA does not provide uncertainty estimates for individual COD assignments by default. This is because, in practice, the calculation of individual posterior probabilities of COD distribution can be memory-intensive when the dataset is large. To obtain individual-level uncertainty measurements, we can either run the MCMC chain with the additional argument indiv.CI = 0.95 when calling codeVA(), or update the fitted object directly with the saved posterior draws.

fit_ins <- updateIndiv(fit_ins, CI = 0.95)
summary(fit_ins, id = "6288")
## InSilicoVA fitted top  causes for death ID: 6288
## Credible intervals shown: 95%
##                               Mean  Lower Median  Upper
## Stroke                      0.5043 0.3485 0.5083 0.6361
## Pneumonia                   0.4116 0.2615 0.4083 0.5834
## Other Infectious Diseases   0.0660 0.0411 0.0642 0.0966
## Epilepsy                    0.0099 0.0064 0.0097 0.0142
## COPD                        0.0053 0.0031 0.0052 0.0079
## Malaria                     0.0007 0.0005 0.0007 0.0011
## Diabetes                    0.0005 0.0003 0.0005 0.0009
## Acute Myocardial Infarction 0.0004 0.0003 0.0004 0.0006
## Falls                       0.0004 0.0001 0.0004 0.0013
## Renal Failure               0.0004 0.0002 0.0003 0.0005

For \(N\) deaths, \(C\) causes, the posterior mean of individual COD distributions returned by the InSilicoVA model, along with median and with credible intervals can be represented by a \((N \times C \times 4)\)-dimensional array. The function getIndivProb() extracts this summary in the form of a list of \(4\) matrices of dimension \(N\) by \(C\), which can then be saved to other formats to facilitate further analysis. For other methods, the point estimates of individual COD distribution are returned as the \(N\) by \(C\) matrix.

fit_prob <- getIndivProb(fit_inter)
## [1] 1554   34


The previous sections discuss how results could be extracted and examined in R. In this subsection, we show some visualization tools provided in the openVA package for presenting these results. The fitted CSMFs for the top causes can be easily visualized by the plotVA() function. The default graph components are specific to each algorithm and individual package implementations, with options for further customization. For example, the figure below shows the estimated CSMF from the InterVA algorithm in the PHMRC data example.

plotVA(fit_inter, title = "InterVA")
Top 10 CSMFs estimated by InterVA.

Top 10 CSMFs estimated by InterVA.

The CSMFs can also be aggregated for easier visualization of groups of causes. For the InterVA-4 cause list, we included an example grouping built into the package, so the aggregated CSMFs can be compared directly. In practice, the grouping of causes of deaths often needs to be determined according to context and the research question of interest. Changing the grouping can be easily achieved by modifying the grouping argument in the stackplotVA() function. For example, to facilitate the new category of Undetermined returned by InterVA, we first modify the grouping matrix to include it as a new cause and visualize the aggregated CSMF estimates in the figure below.

grouping <- SampleCategory
grouping[,1] <- as.character(grouping[,1])
grouping <- rbind(grouping, c("Undetermined", "Undetermined"))
compare <- list(InterVA4 = fit_inter_who,
                InSilicoVA = fit_ins_who)
stackplotVA(compare, xlab = "", angle = 0,  grouping = grouping)
Estimated aggregated CSMFs for InterVA-4 and InSilicoVA, adding undetermined category.

Estimated aggregated CSMFs for InterVA-4 and InSilicoVA, adding undetermined category.

The ordering of the stacked bars can also be changed to reflect the structures within the aggregated causes, as demonstrated in the figure below.

group_order <- c("TB/AIDS",  "Communicable", "NCD", "External", "Maternal",
            "causes specific to infancy", "Undetermined") 
stackplotVA(compare, xlab = "", angle = 0, grouping = grouping, 
            group_order = group_order)
Estimated aggregated CSMFs for InterVA-4 and InSilicoVA, with the causes reordered.

Estimated aggregated CSMFs for InterVA-4 and InSilicoVA, with the causes reordered.

Incorporating additional information in the InSilicoVA algorithm

Among the VA methods discussed in this paper, the InSilicoVA algorithm (McCormick et al. 2016) allows for more flexible modifications to the Bayesian hierarchical model structure when additional information is available. In this section, we illustrate two features unique to the InSilicoVA method: jointly estimating CSMFs from multiple populations, and incorporating partial and potentially noisy physician codings into the algorithm.

Sub-population specific CSMFs

In practice researchers may want to estimate and compare CSMFs for different regions, time periods, or demographic groups in the population. Running separate models on subsets of data can be inefficient and does not allow parameter estimation to borrow information across different groups. The generative framework adopted by InSilicoVA allows the specification of sub-populations in analyzing VA data. Consider an input dataset with \(G\) different sub-populations. We can estimate different CSMFs \(\pi^{(g)}\) for \(g = 1, ..., G\) for each sub-population, while assuming the same conditional probability matrix, \(P_{s|c}\) and other hyperpriors. As an example, we show how to estimate different CSMFs for sub-populations specified by sex and age groups, using a randomly sampled ALPHA dataset with additional columns specifying the sub-population each death belongs to.

head(RandomVA2[, 244:248])
##   stradm smobph scosts   sex age
## 1      .      .      .   Men 60+
## 2      .      .      . Women 60-
## 3      .      .      . Women 60-
## 4      .      .      . Women 60+
## 5      .      .      . Women 60-
## 6      .      .      . Women 60-

Then we can fit the model with one or multiple additional columns specifying sub-population membership for each observation.

fit_sub <- codeVA(RandomVA2, model = "InSilicoVA", 
              subpop = list("sex", "age"),  indiv.CI = 0.95,
              Nsim = 10000, auto.length = FALSE)

Functions discussed in the previous sections work in the same way for the fitted object with multiple sub-populations. Additional visualization tools are also available. The figure below plots the CSMFs for two sub-populations on the same plot by specify type = "compare".

plotVA(fit_sub, type = "compare", title = "Comparing CSMFs", top = 3)
Estimated CSMFs for different sub-populations. The points indicate posterior means of the CSMF and the error bars indicate 95\% credible intervals of the CSMF.

Estimated CSMFs for different sub-populations. The points indicate posterior means of the CSMF and the error bars indicate 95% credible intervals of the CSMF.

By default, the comparison plots will select all the CODs that are included in the top \(K\) for each of the sub-populations. We can also plot only subsets of them by specifying the causes of interest, as shown in the figure below.

plotVA(fit_sub, type = "compare", title = "Comparing CSMFs",
                causelist = c("HIV/AIDS related death", 
                              "Pulmonary tuberculosis", 
                              "Other and unspecified infect dis", 
                              "Other and unspecified NCD"))