Hapi: an R package for chromosome-length haplotype inference using genotypic data of single gamete cells

Ruidong Li and Zhenyu Jia
Department of Plant and Botany Sciences, Univeristy of California, Riverside

Last update: 22 July, 2018

1 Introduction

The knowledge of complete rather than fragmental haplotypes is critical in many areas of genetics studies and will advance personalized medicine including imputation of low-frequent variants, characterization of variant-disease associations, prediction of the clinical outcomes in tranplantation, drug response, and susceptibility to diseases, etc.

Population-based computational phasing methods cannot infer high-resolution haplotypes at chromosome-length and are not able to phase de novo mutations and rare variants, whereas existing experimental phasing methods usually require specialized equipment to separate single chromosomes, which is time-consuming and expensive. Whole-chromosome haplotype phasing with single diploid cells, either require a large number of sequencing libraries or need the assistance of long-read sequencing technologies to phase dense haplotypes at high accuracy thus makes it infeasible for large-scale applications. Using an individual’s genomic data of single haploid gamete cells provides a promising approach to overcome limitations in other methods for chromosome-length haplotype phasing.

Hapi a novel easy-to-use and high-efficient algorithm that only requires 3 to 5 gametes to reconstruct accurate and high-resolution haplotypes of an individual. The gamete genotype data may be generated from various platforms including genotyping arrays and next generation sequencing even with low-coverage. Hapi simply takes genotype data of known hetSNPs in single gamete cells as input and report the high-resolution haplotypes as well as the confidence level of each phased hetSNPs. The package also includes a module allowing downstream analyses and visualization of identified crossovers in the gametes.


2 Hapi package installation

Hapi can be easily installed from Github by running the following command in R:

### Install dependencies ahead
install.packages('devtools')
install.packages('HMM')

devtools::install_github('Jialab-UCR/Hapi')

If the installation fails with the ERROR: object ‘enexprs’ is not exported by ‘namespace:rlang’, please install the developmental version of rlang package first.

devtools::install_github("tidyverse/rlang", build_vignettes = TRUE)

Hapi can also be downloaded and installed locally. The download link is here.

install.packages('HMM')

install.packages('Hapi_0.0.1.tar.gz', repos = NULL, type='source')
library(Hapi)

3 Haplotype phasing module

3.1 Quick start (automatic phasing mode)

The core algorithm of Hapi to phase a chromosome consists of three main steps:

  • Data preprocessing.
  • Draft haplotype inference.
  • High-resolution haplotype assembly.

Haplotype phasing can be completed by running a single function hapiAutoPhase() with well-setting parameters.

library(HMM)
#library(DT)

### load example data
data(gmt)
rownames(gmt) <- gmt$pos
head(gmt)
##         chr     pos ref alt gmt1 gmt2 gmt3 gmt4 gmt5
## 960055    1  960055   T   C <NA> <NA>    C <NA> <NA>
## 1078815   1 1078815   C   T    C <NA> <NA>    C <NA>
## 1079001   1 1079001   T   A    T <NA> <NA>    T    T
## 1079015   1 1079015   G   A    G <NA> <NA>    G    G
## 1079040   1 1079040   C   T <NA>    T    T    C <NA>
## 1079112   1 1079112   A   G <NA> <NA>    G    A <NA>
### automatic haplotype phasing
hapOutput <- hapiAutoPhase(gmt = gmt, code = 'atcg')
head(hapOutput)
##         chr     pos ref alt gmt1 gmt2 gmt3 gmt4 gmt5 hap1 hap2 total rate
## 960055    1  960055   T   C <NA> <NA>    C <NA> <NA>    T    C     1    1
## 1078815   1 1078815   C   T    C <NA> <NA>    C <NA>    C    T     2    1
## 1079001   1 1079001   T   A    T <NA> <NA>    T    T    T    A     3    1
## 1079015   1 1079015   G   A    G <NA> <NA>    G    G    G    A     3    1
## 1079040   1 1079040   C   T <NA>    T    T    C <NA>    C    T     3    1
## 1079112   1 1079112   A   G <NA> <NA>    G    A <NA>    A    G     2    1
##         confidence
## 960055           L
## 1078815          L
## 1079001          H
## 1079015          H
## 1079040          H
## 1079112          L

3.2 Haplotype phasing step by step

3.2.1 Data preprocessing

3.2.1.1 Convert genotype coding style

To take full advantage of matrix manipulation in R language, A/T/C/G coded genotypes will be first converted to 0/1 coded format with the base2num() function by providing reference and alternative alleles.

### covert A/T/C/G to 0/1
hetDa <- gmt[,1:4]
ref <- hetDa$ref
alt <- hetDa$alt

gmtDa <- gmt[,-(1:4)]
gmtDa <- base2num(gmt = gmtDa, ref = ref, alt = alt)
head(gmtDa)
##         gmt1 gmt2 gmt3 gmt4 gmt5
## 960055    NA   NA    1   NA   NA
## 1078815    0   NA   NA    0   NA
## 1079001    0   NA   NA    0    0
## 1079015    0   NA   NA    0    0
## 1079040   NA    1    1    0   NA
## 1079112   NA   NA    1    0   NA

3.2.1.2 Filter out genotyping error

hetSNPs with potential genotyping errors can be detected and filtered out by a Hidden MarKov Model (HMM) in the hapiErrorFitler() function. The transition and emission probabilities can be defined by users according to genotyping error rate and recombination frequency.

### define HMM probabilities
hmm = initHMM(States=c("S","D"), Symbols=c("s","d"), 
            transProbs=matrix(c(0.99999,0.00001,0.00001,0.99999),2),
            emissionProbs=matrix(c(0.99,0.01,0.01,0.99),2), 
            startProbs = c(0.5,0.5))
hmm
## $States
## [1] "S" "D"
## 
## $Symbols
## [1] "s" "d"
## 
## $startProbs
##   S   D 
## 0.5 0.5 
## 
## $transProbs
##     to
## from       S       D
##    S 0.99999 0.00001
##    D 0.00001 0.99999
## 
## $emissionProbs
##       symbols
## states    s    d
##      S 0.99 0.01
##      D 0.01 0.99
### filter out genotyping errors
gmtDa <- hapiFilterError(gmt = gmtDa, hmm = hmm)
## 35 hetSNPs with potential genotyping errors are filtered out !

3.2.1.3 Select high-quality hetSNPs to form a framework

For low coverage sequencing, hetSNPs that are genotyped in at least 3 gametes can be selected to form a ‘precursor’ framework for the draft haplotype construction. For dataset with high genotyping rate, hetSNPs that are observed in more gametes may be selected.

### select a subset of high-quality markers
gmtFrame <- hapiFrameSelection(gmt = gmtDa, n = 3) ###
## Number of hetSNPs in the framework: 21074

3.2.1.4 Imputation of missing data in the framework

Missing genotypes in each gamete are iteratively imputed by observed genotypes in other gametes to faciliate the draft haplotype inference. In the hapiImpute() function, nSPT specifies the minumum number of supporting gametes to impute a hetSNP. By default, nSPT=2, which means a hetSNP in a gamete can be imputed only if imputations are supported by more than 2 gametes and no imputation conflict from different supporting gametes. allowNA is the maximum number of gametes with NA values at a locus that are allowed after imputation. This parameter is useful only when the missing genotype rate is extremely high (eg., >80%) thus few markers can be completely imputed acroass all gamete cells. By default, allowNA=0. The markers, usually of a small number, with missing data that cannot be fully resolved by imputation are eliminated from the framework.

### imputation
imputedFrame <- hapiImupte(gmt = gmtFrame, nSPT = 2, allowNA = 0)
## Number of hetSNPs after imputation: 20994
head(imputedFrame)
##         gmt1 gmt2 gmt3 gmt4 gmt5
## 1135959    0    1    1    0    0
## 1136439    0    1    1    0    0
## 1136733    0    1    1    0    0
## 1137140    0    1    1    0    0
## 1137167    0    1    1    0    0
## 1137267    0    1    1    0    0

3.2.2 Draft haplotype inference

3.2.2.1 Majority voting for draft haplotype inference

With the imputed genotype data, draft haplotypes can be constructed based upon majority voting strategy implemented in the hapiPhase() function. The majority voting strategy is used to infer draft haplotypes by counting the number of links between two adjacent markers. Because recombination frequency is very low, the appearance of links that represents true haplotypes (hap-links) should be much more frequent than those generated by crossovers (cv-links). Draft haplotypes can be determined through voting for the major links in sliding windows of 2 hetSNPs along the whole chromosome.

### majority voting
draftHap <- hapiPhase(gmt = imputedFrame) ###
head(draftHap)
##         hap haplink cvlink tlink ratio
## 1135959   0       5      0     5     0
## 1136439   0       5      0     5     0
## 1136733   0       5      0     5     0
## 1137140   0       5      0     5     0
## 1137167   0       5      0     5     0
## 1137267   0       5      0     5     0
### check positions with cv-links
draftHap[draftHap$cvlink>=1,]
##           hap haplink cvlink tlink ratio
## 2665203     0       4      1     5  0.25
## 4685329     0       4      1     5  0.25
## 11730915    0       4      1     5  0.25
## 29414096    0       4      1     5  0.25
## 29465722    0       4      1     5  0.25
## 35178660    0       4      1     5  0.25
## 35322900    0       4      1     5  0.25
## 133530288   0       4      1     5  0.25
## 141518185   0       4      1     5  0.25
## 144823941   0       4      1     5  0.25

3.2.2.2 Proofreading of draft haplotypes

The first step of proofreading is to filter out small regions (eg. < 1M) with multiple cv-links. Erroneous/ambiguous genotypes in complex genomic regions will largely affect the accuracy of imputation and thus impact the determination of hap-links by majority voting. If multiple cv-links are identified in a very small region, hetSNPs in this region will be eliminated from the frame and new draft haplotypes will be inferred again.

### identification of clusters of cv-links
cvCluster <- hapiCVCluster(draftHap = draftHap, cvlink = 2)
cvCluster
## [1] left  right
## <0 rows> (or 0-length row.names)
### determine hetSNPs in small regions involving multiple cv-links
filter <- c()
for (i in 1:nrow(cvCluster)) {
    filter <- c(filter, which (rownames(draftHap) >= cvCluster$left[i] & 
        rownames(draftHap) <= cvCluster$right[i]))
}

length(filter)
## [1] 0
### filter out hetSNPs in complex regions and infer new draft haplotypes
if (length(filter) > 0) {
    imputedFrame <- imputedFrame[-filter, ]
    draftHap <- hapiPhase(imputedFrame)
} 

In the second step of proofreading, a Maximum Parsimony of Recombination (MPR) strategy is adopted to proofread draft haplotypes at genomic positions where there are potential incorrect cv-links. By MPR, we believe that the most likely haplotype should be the one generating less number of crossovers across all gamete cells. The cvlink parameter is used to determine the positions for dividing draft haplotypes into blocks for proofreading. With 5 or more gametes, positions that harboring more than 2 cv-links across all the gametes will be used by default. With 3 or 4 gametes, probably all the positions with cv-links should be proofread.

finalDraft <- hapiBlockMPR(draftHap = draftHap, gmtFrame = gmtFrame, cvlink = 1)
## Size of blocks: 132,119,600,3957,53,1340,75,13412,555,264,487
## 2 blocks are removed !
## Number of crossovers given haplotype 1/2: 1/4
## Number of crossovers given haplotype 1/2: 1/4
## Number of crossovers given haplotype 1/2: 1/4
## Number of crossovers given haplotype 1/2: 0/5
## Number of crossovers given haplotype 1/2: 0/5
## Number of crossovers given haplotype 1/2: 1/4
## Number of crossovers given haplotype 1/2: 1/4
## Number of crossovers given haplotype 1/2: 1/4
head(finalDraft)
## 1135959 1136439 1136733 1137140 1137167 1137267 
##       0       0       0       0       0       0

3.2.3 High-resolution haplotype assembly

Each gamete chromosome is compared to the draft haplotypes to deduce gamete-specific haplotypes, with the non-framework markers being phased. Consensus high-resolution haplotypes are eventually determined with these gamete-specific haplotypes by the hapiAssembl() function.

consensusHap <- hapiAssemble(draftHap = finalDraft, gmt = gmtDa)
## Good consistence between draft haplotype andgmt1: 1
## Good consistence between draft haplotype andgmt2: 1
## Good consistence between draft haplotype andgmt3: 1
## Good consistence between draft haplotype andgmt4: 1
## Good consistence between draft haplotype andgmt5: 1
head(consensusHap)
##         hap1 hap2 total rate confidence
## 960055     0    1     1    1          L
## 1078815    0    1     2    1          H
## 1079001    0    1     3    1          H
## 1079015    0    1     3    1          H
## 1079040    0    1     3    1          H
## 1079112    0    1     2    1          H

When a crossover occurs at the end of a gamete chromosome where hetSNPs are not enclosed in the framework, it becomes very challenging to correctly infer the haplotypes for this region. Hapi employs an additional capping strategy to polish two ends of chromosomal haplotypes in the hapiAssembleEnd() function.

consensusHap <- hapiAssembleEnd(gmt = gmtDa, draftHap = finalDraft, 
                                consensusHap = consensusHap, k = 300)
## Number of reference pollens: 5
## 
## Number of reference pollens: 5
### Haplotype 1
hap1 <- sum(consensusHap$hap1==0)
### Haplotype 2
hap2 <- sum(consensusHap$hap1==1)
### Number of unphased hetSNPs
hap7 <- sum(consensusHap$hap1==7)

### Accuracy
max(hap1, hap2)/sum(hap1, hap2)
## [1] 0.9999259

After inference of the high-resolution haplotypes, genotypes are then converted back to A/T/C/G coded format.

### find hetSNP overlaps
snp <- which(rownames(hetDa) %in% rownames(consensusHap))

ref <- hetDa$ref[snp]
alt <- hetDa$alt[snp]

### convert back to A/T/C/G
consensusHap <- num2base(hap = consensusHap, ref = ref, alt = alt)
head(consensusHap)
##         hap1 hap2 total rate confidence
## 960055     T    C     1    1          L
## 1078815    C    T     2    1          L
## 1079001    T    A     3    1          H
## 1079015    G    A     3    1          H
## 1079040    C    T     3    1          H
## 1079112    A    G     2    1          L
### output all the information
hapOutput <- data.frame(gmt[snp,], consensusHap)
head(hapOutput)
##         chr     pos ref alt gmt1 gmt2 gmt3 gmt4 gmt5 hap1 hap2 total rate
## 960055    1  960055   T   C <NA> <NA>    C <NA> <NA>    T    C     1    1
## 1078815   1 1078815   C   T    C <NA> <NA>    C <NA>    C    T     2    1
## 1079001   1 1079001   T   A    T <NA> <NA>    T    T    T    A     3    1
## 1079015   1 1079015   G   A    G <NA> <NA>    G    G    G    A     3    1
## 1079040   1 1079040   C   T <NA>    T    T    C <NA>    C    T     3    1
## 1079112   1 1079112   A   G <NA> <NA>    G    A <NA>    A    G     2    1
##         confidence
## 960055           L
## 1078815          L
## 1079001          H
## 1079015          H
## 1079040          H
## 1079112          L

3.2.4 Visualization of haplotypes in single gamete cells

The visualization function hapiGameteView() is provided here to view haplotypes and crossovers in a single gamete cell.

### load haplotypes in each gamete cell
data(gamete11)
head(gamete11)
##   chr     pos hap
## 1   1 3787130  NA
## 2   1 3793818  NA
## 3   1 3798306   0
## 4   1 3823395  NA
## 5   1 4253891   0
## 6   1 4259253   0
### load chromosome information of the genome
data(hg19)
head(hg19)
##         length  cenStart    cenEnd
## chr1 249250621 121535434 124535434
## chr2 243199373  92326171  95326171
## chr3 198022430  90504854  93504854
## chr4 191154276  49660117  52660117
## chr5 180915260  46405641  49405641
## chr6 171115067  58830166  61830166
### view gamete cells 
hapiGameteView(chr = hg19, hap = gamete11)

4 Crossover analysis module

Mapping recombinations at the individual level is a very important by-product of gamete-based phasing. The crossover analysis module in Hapi provides an ideal tool for investigating recombination events in single gamete cells, which has the potential to be applied in clinical labs to manage human diseases that are associated with abnormal recombination. In addition, it can also be used to monitor the crossovers on plant genomes to facilitate more rapid introgression of target genes or to break up undesirable linkages for crop improvement.

4.1 Identification of crossovers

With the high-resolution chromosome-length haplotypes, crossovers in each gamete can be easily identified by the hapiIdentifyCV() function which also adopts a HMM.

### haplotypes
hap <- hapOutput[,10:11]
head(hap)
##         hap1 hap2
## 960055     T    C
## 1078815    C    T
## 1079001    T    A
## 1079015    G    A
## 1079040    C    T
## 1079112    A    G
### gametes
gmt <- hapOutput[,5:9]
head(gmt)
##         gmt1 gmt2 gmt3 gmt4 gmt5
## 960055  <NA> <NA>    C <NA> <NA>
## 1078815    C <NA> <NA>    C <NA>
## 1079001    T <NA> <NA>    T    T
## 1079015    G <NA> <NA>    G    G
## 1079040 <NA>    T    T    C <NA>
## 1079112 <NA> <NA>    G    A <NA>
### identify crossover
cvOutput <- hapiIdentifyCV(hap = hap, gmt = gmt)
cvOutput
##     gmt     start       end       pos    res
## 1  gmt1 144704855 144823941 144764398 119085
## 2  gmt2   4670029   4685329   4677679  15299
## 3  gmt2  11605193  11730555  11667874 125361
## 4  gmt2 141005431 141518185 141261808 512753
## 5  gmt4  29385454  29414096  29399775  28641
## 6  gmt4  29459037  29465722  29462379   6684
## 7  gmt4  35153363  35178660  35166011  25296
## 8  gmt4  35236017  35322900  35279458  86882
## 9  gmt5   2488318   2665203   2576760 176884
## 10 gmt5 133333150 133441645 133387397 108494

4.2 Crossover visualization

Resolution of recombination events, distances between adjacent crossovers, and the distribution of recombinations along each chromosome can be visualized by functions provided in the crossover analysis module.

### load crossover table
data(crossover)
head(crossover)

4.2.1 Visualization of crossover resolution

hapiCVResolution() function can generate a histogram of resolution of crossovers.

hapiCVResolution(cv = crossover)

4.2.2 Visualization of crossover distances

Distribution of distances between adjacent crossovers on the same chromosome can be visualized by the hapiCVDistance() function,and may also be used to study coexisting crossovers as well as the crossover interference phenomenon.

hapiCVDistance(cv = crossover)

4.2.3 Visualization of crossover map

hapiCVMap() generates a map of recombination events in an interval of a specified size (eg., 5M) across all the gamete cells.

hapiCVMap(chr = hg19, cv = crossover, step = 5)

5 sessionInfo

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] HMM_1.0    Hapi_0.0.3
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17     bindr_0.1.1      knitr_1.20       magrittr_1.5    
##  [5] tidyselect_0.2.4 munsell_0.5.0    colorspace_1.3-2 R6_2.2.2        
##  [9] rlang_0.2.1.9000 dplyr_0.7.6      stringr_1.3.1    plyr_1.8.4      
## [13] tools_3.4.3      grid_3.4.3       gtable_0.2.0     htmltools_0.3.6 
## [17] assertthat_0.2.0 yaml_2.1.19      lazyeval_0.2.1   rprojroot_1.3-2 
## [21] digest_0.6.15    tibble_1.4.2     crayon_1.3.4     bindrcpp_0.2.2  
## [25] purrr_0.2.5      ggplot2_3.0.0    prettydoc_0.2.1  glue_1.3.0      
## [29] evaluate_0.11    rmarkdown_1.10   stringi_1.2.4    compiler_3.4.3  
## [33] pillar_1.3.0     scales_0.5.0     backports_1.1.2  pkgconfig_2.0.1