1 Introduction

umiAnalyzer provides tools for analyzing UMIErrorCorrect output. UMIErrorCorrect is a Python package for processing targeted high-throughput sequencing data containing unique molecular identifiers (UMIs). UMIErrorCorrect is available at: https://github.com/stahlberggroup/umierrorcorrect.

1.1 Using the umiVisualizer shiny app

The shiny app can be run using the following command:

umiAnalyzer::runUmiVisualizer()

1.2 How to make your own UMIexperiment object

Define a variable containing the path to the directory with all the UMIErrorCorrect output folders belonging to your experiment. umiAnalyzer comes with raw test data generated with UMIErrorCorrect that you can import if you don’t have any of your own.

Call the createUmiExperiment to create your UMIexperiment object.

The UMIexperiment object always maintains your raw data, however you can create as many filters as you like, which will be saved as separate objects to access. You can filter the consensus table of UMIexperiment object with filterUmiObject. The only mandatory arguments are the object to be filtered and a user defined name. You can use that name to retrieve a filtered table using getFilteredData.

library(umiAnalyzer)

main <- system.file("extdata", package = "umiAnalyzer")

simsen <- createUmiExperiment(main)

simsen <- mergeAssays(
  object = simsen,
  name = "new",
  assay.list = c("PIK3CA_123", "PIK3CA_234")
)

createUmiExperiment has an optional flag importBam, which is set to FALSE by default. If this is set to TRUE it will automatically call parseBamFiles and store the read data in the ‘simsen’ object which can then be passed directly to plotFamilyHistogram, without having to run parseBamFiles again. Note that for large experiments, especially if consDepth is set lower than 10, the size of the experiment object may become too large.

reads <- parseBamFiles(main, consDepth = 10)

BarcodeFamilyHistogram(reads)
#> Warning: Removed 1787 rows containing non-finite values (stat_bin).
#> Warning: Removed 4 rows containing missing values (geom_bar).

Next we generate Quality Control plots and filter the UMIexperiment object to select for consensus 3 reads.

QCplot(simsen)

Most downstream functions use filtered data, each each filtered data set receives a name, making it possible to apply multiple filters on the same object as the original data is maintained and can always be retrieved with saveConsData (see below). Each filtered data set can be retrieved using getFilteredData and assigned to a new variable or be saved as a csv file with the optional parameter save = TRUE, which is set to FALSE by

simsen <- filterUmiObject(simsen)

Optionally, the beta-binomial variant caller can be run on the data using callVariants.

# This is optional
simsen <- callVariants(simsen)

Retrieve filters using getFilteredData:

myfilter <- getFilteredData(simsen)
myfilter
#> # A tibble: 562 x 18
#>    `Sample Name`   Contig Position Name  Reference     A     C     G     T     I
#>    <chr>           <chr>     <dbl> <chr> <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 VAF-1-5ng-1-10x chr17   7673746 TP53~ T             0     0     0  2809     0
#>  2 VAF-1-5ng-1-10x chr17   7673747 TP53~ C             0  2809     0     0     0
#>  3 VAF-1-5ng-1-10x chr17   7673748 TP53~ T             1     0     0  2808     0
#>  4 VAF-1-5ng-1-10x chr17   7673749 TP53~ T             0     0     0  2809     0
#>  5 VAF-1-5ng-1-10x chr17   7673750 TP53~ G             0     0  2809     0     0
#>  6 VAF-1-5ng-1-10x chr17   7673751 TP53~ C             0  2810     0     0     0
#>  7 VAF-1-5ng-1-10x chr17   7673752 TP53~ G             0     0  2810     0     0
#>  8 VAF-1-5ng-1-10x chr17   7673753 TP53~ G             0     0  2810     0     0
#>  9 VAF-1-5ng-1-10x chr17   7673754 TP53~ A          2810     0     0     0     0
#> 10 VAF-1-5ng-1-10x chr17   7673755 TP53~ G             0     0  2809     1     0
#> # ... with 552 more rows, and 8 more variables: D <dbl>, N <dbl>,
#> #   Coverage <dbl>, Consensus group size <dbl>, Max Non-ref Allele Count <dbl>,
#> #   Max Non-ref Allele Frequency <dbl>, Max Non-ref Allele <chr>, sample <chr>

Error correction depends on UMI depth, i.e. how many reads per barcode are required to form a barcode family. We can plot the number of available reads for different consensus depths.

UmiCountsPlot(simsen)

Next we generate plots for the amplicons and samples in the UMIexperiment object.

If generateAmpliconPlots is called and the number of amplicons and samples is too large to plot all of them individually in a single plot the data is shown in summarized form. The user can specify amplicons and or samples optionally to plot only the selection.

1.2.1 All data

AmpliconPlot(simsen)

1.2.2 Selection

AmpliconPlot(
  object = simsen,
  amplicons = 'KIT_125',
  samples = 'VAF-1-5ng-1-10x'
)

1.3 Merging technical replicates

In the above analyses potential replicates are treated as individual samples, but it is possible to merge replicate information for statistical analyses and more convenient plotting.

Merging data requires the user to supply a file with meta data using the importDesign function. This will create a metadata attribute called “design”, which can be retrieved using getMetaData.

metaData <- system.file("extdata", "metadata.txt", package = "umiAnalyzer")

simsen <- importDesign(
  object = simsen,
  file = metaData
)

design <- getMetaData(
  object = simsen, 
  attributeName = "design"
)

design

1.4 Time series data

We can generate time course data using the analyzeTimeSeries function. This requires some information from the meta data that was uploaded above. Most importantly a time variable needs to be available in the meta data object called “time” by default. However the variable can have any name, which can be specified using the time.var parameter. If time.var has a date format that will be used for plotting, otherwise it will convert time.var to a categorical. If you have time.var in date format but want the plot to use categorical instead set the categorical parameter to TRUE.

We can also make use of replicate data to remove background noise. If you specify a group.by variable from the meta data only variants that occur at least twice per group will be used. If we specify ‘replicate’ that means only variants in at least 2 out of 3 replicates will be considered. The default value for group.by is NULL and each sample will be analyzed and plotted independently.

simsen <- umiAnalyzer::analyzeTimeSeries(
  object = simsen,
  time.var = 'time',
  group.by = 'replicate'
)

We can now use the mergeTechnicalReplicates function to create a summarized data table. Merging replicates also performs coverage normalization. This can remove bias incurred when comparing samples sequenced at different read depths, because background noise is expected to be higher in a sample that was sequenced to a greater depth. A drawback is that it is not possible to scale a count of zero variant reads one can set all 0 to a small value, such as 0.5, first and then scale the data.

simsen <- mergeTechnicalReplicates(
  object = simsen,
  group.by = 'replicate',
  option = 'Set1', 
  amplicons = c('new', 'TP53_1', 'TP53_7')
)

We can also have a look at the merged amplicon plots which now show the average maximum alternate allele count and standard deviation.

vizMergedData(
  simsen,
  amplicons = c('new')
)

2 Calling variants

umiAnalyzer comes with a build-in UMIexperiment object to explore, which was generated using the code above, so it can be used without creating the it first if so desired.

In order to call variants using the umiAnalyzer variant caller simply load the package and test data and use the callVariants function. You can then filter the resulting consensus data (cons.data) within the object, e.g. for significant variants.

simsen <- callVariants(simsen)

simsen <- filterVariants(simsen)

simsen <- generateAmpliconPlots(
  object = simsen
)

3 Other functions (experimental)

3.1 Writing a Variant Call File (VCF)

generateVCF(object = simsen, outFile = 'simsen.vcf', printAll = FALSE, save = FALSE)

3.2 Saving consensus data as a csv file

It is possible to retrieve consensus data as a tibble using saveConsData. This function can also be used to to save the data as a csv file with delimiters either ‘,’, ‘;’ or tab. This requires the user to set the save parameter to TRUE and to specify an output directory (default is the working directory), filename (default is ‘consensus_data.csv’) and delimiter (default is ‘;’).

consensus_data <- saveConsData(object = simsen)

outDir <- getwd()
saveConsData(object = simsen, outDir = outDir, delim = ';', save = TRUE)

4 System info

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                    LC_CTYPE=English_Sweden.1252   
#> [3] LC_MONETARY=English_Sweden.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=English_Sweden.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] umiAnalyzer_1.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] viridis_0.6.2          httr_1.4.2             sass_0.4.0            
#>  [4] shinyFiles_0.9.1       tidyr_1.1.3            bit64_4.0.5           
#>  [7] vroom_1.5.6            viridisLite_0.4.0      jsonlite_1.7.2        
#> [10] bslib_0.3.1            shiny_1.7.1            assertthat_0.2.1      
#> [13] highr_0.9              BiocManager_1.30.16    stats4_4.1.0          
#> [16] GenomeInfoDbData_1.2.6 Rsamtools_2.8.0        yaml_2.2.1            
#> [19] pillar_1.6.4           glue_1.4.2             digest_0.6.27         
#> [22] GenomicRanges_1.44.0   RColorBrewer_1.1-2     promises_1.2.0.1      
#> [25] XVector_0.32.0         colorspace_2.0-2       htmltools_0.5.2       
#> [28] httpuv_1.6.3           pkgconfig_2.0.3        pheatmap_1.0.12       
#> [31] zlibbioc_1.38.0        purrr_0.3.4            xtable_1.8-4          
#> [34] scales_1.1.1           later_1.3.0            tzdb_0.1.2            
#> [37] BiocParallel_1.26.1    tibble_3.1.2           farver_2.1.0          
#> [40] generics_0.1.1         IRanges_2.26.0         ggplot2_3.3.5         
#> [43] ellipsis_0.3.2         DT_0.20                BiocGenerics_0.38.0   
#> [46] lazyeval_0.2.2         cli_3.1.0              magrittr_2.0.1        
#> [49] crayon_1.4.2           mime_0.12              evaluate_0.14         
#> [52] fs_1.5.0               fansi_0.5.0            forcats_0.5.1         
#> [55] shinydashboard_0.7.2   tools_4.1.0            data.table_1.14.2     
#> [58] hms_1.1.1              lifecycle_1.0.1        stringr_1.4.0         
#> [61] plotly_4.10.0          S4Vectors_0.30.0       munsell_0.5.0         
#> [64] Biostrings_2.60.1      compiler_4.1.0         jquerylib_0.1.4       
#> [67] GenomeInfoDb_1.28.1    rlang_0.4.12           grid_4.1.0            
#> [70] RCurl_1.98-1.3         rstudioapi_0.13        htmlwidgets_1.5.4     
#> [73] labeling_0.4.2         bitops_1.0-7           rmarkdown_2.11        
#> [76] shinyWidgets_0.6.2     gtable_0.3.0           DBI_1.1.1             
#> [79] R6_2.5.1               gridExtra_2.3          knitr_1.36            
#> [82] dplyr_1.0.7            bit_4.0.4              fastmap_1.1.0         
#> [85] utf8_1.2.1             readr_2.1.0            stringi_1.6.2         
#> [88] parallel_4.1.0         Rcpp_1.0.7             vctrs_0.3.8           
#> [91] tidyselect_1.1.1       xfun_0.24