psSubpathway: a software package for flexible identification of phenotype specific subpathways in the cancer progression

Xudong Han
Junwei Han
Qingfei Kong

2024-03-18

Introduction

psSubpathway is a method designed to discover the relationship between subpathway and multiple sample phenotype.psSubpathway consists of four parts:

  1. Extracting subpathways from canonical pathways. We used k-clique method in social network analysis to extract the subpathways and eliminated the smaller subpathways that had an overlap above 80%; The subpathway data is stored in a list structure.
  2. Inferring subpathway activity profiles. The smaple-specific subpathway activity profiles was inferred by using sample-based gene set enrichment method (GSVA or ssGSEA);
  3. Identifying subtype-specific subpathways. We developed a Subtype Set Enrichment Analysis (SubSEA) method to identify subtype-specific subpathways across multi-subtype groups of cancer samples;
  4. Identifying the dynamic changed subpathways. We developed a Dynamic Changed Subpathway Analysis (DCSA) method to identify the dynamic changed subpathways associated with the developmental stage of cancer.

Its capabilities render psSubpathway could find the specific dysregulated subpathways in multi-phenotypes samples, and fill the gap in recent subpathway identification methods which generally found the differentially expressed subpathways between normal and cancer samples. psSubpathway may identify more precision biomarkers and phenotype-related disease mechanisms to help to tailored treatment for patients with cancer.

SubSEA:identifying subtype-specific subpathways

The function SubSEA can identify subtype-specific subpathways across multi-subtype groups of cancer samples. For each subpathway, we ranked the N samples in the dataset to form a sample list L=<s1, s2…sN> according to decreasing subpathway activity. The samples in a given subtype were mapped to the sample list L. If the samples in the subtype significantly cluster at the top or bottom of the entire ranked list L, the subpathway will be associated with the specific subtype. We used the weighted Kolmogorov-Smirnov statistic to calculate an sample enrichment score (SES), which reflects the degree to which the samples in a subtype is overrepresented toward the extremes (top or bottom) of the sample list L. The SES is calculated by walking down the list L, increasing a running-sum statistic when we encounter a sample in the subtype and decreasing it when we encounter a sample not in the subtype.

This function requires user input the gene expression profile identified and the corresponding phenotype file of the sample (CLS format).In addition, this function requires input of subpathway list data,which has been extracted from KEGG pathway data and saved into the package environment variables. Of course, users can also define their own dataset list data as input, as long as the gene identification and gene expression profile of gene sets are maintained.

We selected the breast cancer gene expression profile data in the GDC TCGA database and pm50 phenotype file containing four breast cancer subtypes for SubSEA and obtained the results of subpathways related to each subtype of breast cancer. The commands are as follows.

# load depend package.
 require(GSVA)
#> Loading required package: GSVA
 require(parallel)
#> Loading required package: parallel
# get breast cancer disease subtype gene expression profile.
 Bregenematrix<-get("Subgenematrix")
# get path of the sample disease subtype files.
 Subtypelabels<- system.file("extdata", "Sublabels.cls", package = "psSubpathway")
# perform the SubSEA method.
#SubSEAresult<-SubSEA(Bregenematrix,
#                     input.cls=Subtypelabels,
#                     nperm=50,                     # Number of random permutations
#                     fdr.th=0.01,                  # Cutoff value for fdr.when fdr.th=1,get all subpathway
#                     parallel.sz=2)                # Number of processors to use  
 
# get the result of the SubSEA function
 SubSEAresult<-get("Subspwresult")
 str(SubSEAresult)
#> List of 5
#>  $ Basal    :'data.frame':   15 obs. of  6 variables:
#>   ..$ SubpathwayID  : Factor w/ 15 levels "00120_19","00120_9",..: 2 1 3 4 5 6 7 8 9 10 ...
#>   ..$ PathwayName   : Factor w/ 14 levels "Apoptosis","C-type lectin receptor signaling pathway",..: 10 10 3 13 8 6 4 11 1 14 ...
#>   ..$ SubpathwayGene: Factor w/ 15 levels "AMACR ACOX2 HSD17B4 SCP2",..: 1 13 6 7 9 3 15 12 2 4 ...
#>   ..$ SES           : num [1:15] -0.885 -0.829 -0.848 -0.767 -0.665 ...
#>   ..$ Pvalue        : num [1:15] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ FDR           : num [1:15] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ Her2     :'data.frame':   25 obs. of  6 variables:
#>   ..$ SubpathwayID  : Factor w/ 25 levels "00010_6","00020_4",..: 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ PathwayName   : Factor w/ 24 levels "Arginine and proline metabolism",..: 12 3 8 1 16 11 24 22 5 17 ...
#>   ..$ SubpathwayGene: Factor w/ 25 levels "ADCY1 ADCY2 ADCY3 ADCY5 ADCY6 ADCY7 ADCY8 ADCY9 ADCY4 PRKACA PRKACB PRKACG CREB1 ATF2 CREM ATF1 ATF3 ATF4 XBP1 "| __truncated__,..: 3 18 2 23 13 22 12 24 7 16 ...
#>   ..$ SES           : num [1:25] 0.704 0.681 0.725 0.817 0.711 ...
#>   ..$ Pvalue        : num [1:25] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ FDR           : num [1:25] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ LumA     :'data.frame':   49 obs. of  6 variables:
#>   ..$ SubpathwayID  : Factor w/ 49 levels "00052_2","00340_5",..: 1 2 3 5 4 6 7 8 9 10 ...
#>   ..$ PathwayName   : Factor w/ 36 levels "Adipocytokine signaling pathway",..: 15 18 32 11 11 30 13 21 7 14 ...
#>   ..$ SubpathwayGene: Factor w/ 49 levels "ADCY1 ADCY2 ADCY3 ADCY5 ADCY6 ADCY7 ADCY8 ADCY9 ADCY4 PRKACA PRKACB PRKACG CREB1 ATF2 CREM ATF1 ATF3 ATF4 XBP1 "| __truncated__,..: 3 23 14 17 12 49 16 42 41 10 ...
#>   ..$ SES           : num [1:49] -0.731 0.773 -0.554 0.677 0.703 ...
#>   ..$ Pvalue        : num [1:49] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ FDR           : num [1:49] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ LumB     :'data.frame':   22 obs. of  6 variables:
#>   ..$ SubpathwayID  : Factor w/ 22 levels "00350_2","00563_1",..: 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ PathwayName   : Factor w/ 22 levels "Amyotrophic lateral sclerosis (ALS)",..: 22 7 18 15 6 4 12 8 2 13 ...
#>   ..$ SubpathwayGene: Factor w/ 22 levels "ADAM17 HBEGF ADAM10 CXCL8 CXCR1 CXCR2 EGFR",..: 6 17 7 11 10 14 12 18 19 21 ...
#>   ..$ SES           : num [1:22] 0.576 0.559 0.608 0.627 -0.658 ...
#>   ..$ Pvalue        : num [1:22] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ FDR           : num [1:22] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ spwmatrix: num [1:725, 1:67] 0.1892 0.0467 0.1907 0.0874 0.1993 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:725] "00010_1" "00010_2" "00010_5" "00010_6" ...
#>   .. ..$ : chr [1:67] "Her2" "LumA" "LumB" "LumA" ...
 head(SubSEAresult$Basal)
#>          SubpathwayID
#> 00120_9       00120_9
#> 00120_19     00120_19
#> 00232_1       00232_1
#> 00280_10     00280_10
#> 00510_8       00510_8
#> 00601_8       00601_8
#>                                                         PathwayName
#> 00120_9                              Primary bile acid biosynthesis
#> 00120_19                             Primary bile acid biosynthesis
#> 00232_1                                         Caffeine metabolism
#> 00280_10                 Valine, leucine and isoleucine degradation
#> 00510_8                                       N-Glycan biosynthesis
#> 00601_8  Glycosphingolipid biosynthesis - lacto and neolacto series
#>                                                                                                                                                                           SubpathwayGene
#> 00120_9                                                                                                                                                         AMACR ACOX2 HSD17B4 SCP2
#> 00120_19                                                                                                                                             SLC27A5 CYP27A1 AMACR ACOX2 HSD17B4
#> 00232_1                                                                                                                                                      CYP1A2 XDH NAT2 NAT1 CYP2A6
#> 00280_10                                                     EHHADH HADH ACADSB AOX1 ALDH2 ALDH1B1 ALDH9A1 ALDH3A2 ALDH7A1 ALDH6A1 ABAT HIBADH HIBCH ECHS1 HADHA ACADM ACADS ACAD8 AGXT2
#> 00510_8                                                                                                   FUT8 MGAT3 MGAT2 MAN2A2 MAN2A1 MAN1A2 MAN1B1 MAN1A1 MAN1C1 MGAT4B MGAT4A MGAT1
#> 00601_8  B4GALT4 B4GALT3 B4GALT2 FUT4 FUT7 FUT3 FUT5 FUT6 ST8SIA1 ST3GAL6 FUT9 B3GNT3 B3GNT2 B3GNT4 B4GALT1 ABO FUT1 FUT2 B3GNT5 B3GALT1 B3GALT2 B3GALT5 ST3GAL3 ST3GAL4 A4GALT B3GALNT1
#>               SES Pvalue FDR
#> 00120_9  -0.88462      0   0
#> 00120_19 -0.82945      0   0
#> 00232_1  -0.84790      0   0
#> 00280_10 -0.76735      0   0
#> 00510_8  -0.66495      0   0
#> 00601_8   0.74271      0   0

DCSA: Identifying the dynamically changed subpathways associated with the phenotype

The function DCSA used the information-theoretic measure of statistical dependence, mutual information (MI), to estimate the dynamically changed subpathways associated with the phenotypic change.

This function requires loading dependent mpmi packages. We selected adrenocortical cancer (ACC) gene expression profile and the disease stage phenotypes from the GDC TCGA database as an example of DCSA function. The sample contains four disease stages(StageI~StageIV). The following commands will perform DCSA functions to obtain dynamic change subpathways related to changes in the breast cancer development stage.

# load depend package.
 require(mpmi)
# get ACC disease stage gene expression profiling.
 ACCgenematrix<-get("DCgenematrix")
# get path of the sample disease stage phenotype files.
 Stagelabels<-system.file("extdata", "DClabels.cls", package = "psSubpathway")
# perform the DCSA method.
#DCSAresult<-DCSA(ACCgenematrix,input.cls=Stagelabels,nperm=50,fdr.th=0.01,parallel.sz=2)
# get the result of the SubSEA function
 DCSAresult<-get("DCspwresult")
 str(DCSAresult)
#> List of 2
#>  $ DCSA     :'data.frame':   13 obs. of  6 variables:
#>   ..$ SubpahwayID   : Factor w/ 13 levels "00220_2","00520_2",..: 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ PahwayName    : Factor w/ 12 levels "Amino sugar and nucleotide sugar metabolism",..: 2 1 4 12 6 3 3 10 11 5 ...
#>   ..$ SubpathwayGene: Factor w/ 13 levels "CDK1 CCNB2 ANAPC10 CDC26 ANAPC13 ANAPC2 ANAPC4 ANAPC5 ANAPC7 ANAPC11 ANAPC1 CDC23 CDC16 CDC27 CDC20 PTTG2 PTTG1"| __truncated__,..: 7 9 4 11 3 6 5 1 10 13 ...
#>   ..$ D(M)          : num [1:13] 0.191 0.166 0.254 0.191 0.153 ...
#>   ..$ Pvalue        : num [1:13] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ FDR           : num [1:13] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ spwmatrix: num [1:725, 1:77] -0.146 -0.507 -0.503 -0.502 -0.158 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:725] "00010_1" "00010_2" "00010_5" "00010_6" ...
#>   .. ..$ : chr [1:77] "StageI" "StageI" "StageI" "StageI" ...
 head(DCSAresult$DCSA)
#>          SubpahwayID                                  PahwayName
#> 00220_2      00220_2                       Arginine biosynthesis
#> 00520_2      00520_2 Amino sugar and nucleotide sugar metabolism
#> 00982_2      00982_2           Drug metabolism - cytochrome P450
#> 03013_2      03013_2                               RNA transport
#> 04066_10    04066_10                     HIF-1 signaling pathway
#> 04110_6      04110_6                                  Cell cycle
#>                                                                                                                                                                                                                                                   SubpathwayGene
#> 00220_2                                                                                                                                                            GLS2 GLS GLUL CPS1 OTC ASS1 ARG1 ARG2 NAGS ACY1 NOS1 NOS2 NOS3 GLUD1 GLUD2 GOT1 GOT2 GPT GPT2
#> 00520_2                                                                                                                                 NAGK PGM3 UAP1 UAP1L1 GNE GNPDA1 GNPDA2 AMDHD2 GFPT1 GFPT2 GNPNAT1 NANS RENBP HEXA HEXB CHIT1 CHIA NPL HK1 HK2 HK3 HKDC1
#> 00982_2                                                                                                                                                                                                            FMO1 FMO2 FMO3 FMO4 FMO5 CYP3A4 CYP2D6 CYP2C9
#> 03013_2                                                                                                                                                                                   RAN XPO1 NCBP2 NCBP1 PHAX NXF1 NXF5 NXF3 NXF2 NXF2B NUP214 NUP88 NUP42
#> 04066_10                                                                                                 EGLN2 EGLN3 EGLN1 PIK3CA PIK3CB PIK3CD PIK3R1 PIK3R2 PIK3R3 AKT3 AKT1 AKT2 MTOR PLCG1 PLCG2 PRKCA PRKCB PRKCG EIF4EBP1 RPS6KB1 RPS6KB2 RPS6 MAPK1 MAPK3
#> 04110_6  FZR1 ANAPC10 CDC26 ANAPC13 ANAPC2 ANAPC4 ANAPC5 ANAPC7 ANAPC11 ANAPC1 CDC23 CDC16 CDC27 CDK1 PLK1 CDC14B CDC14A YWHAQ YWHAB YWHAE YWHAG YWHAH YWHAZ SFN CHEK1 CHEK2 PKMYT1 WEE2 WEE1 GADD45G GADD45A GADD45B TP53 CDC20 CDC25B CDC25C CCNB3 CCNB1 CCNB2
#>               D(M) Pvalue FDR
#> 00220_2  0.1909344      0   0
#> 00520_2  0.1663632      0   0
#> 00982_2  0.2541684      0   0
#> 03013_2  0.1913032      0   0
#> 04066_10 0.1529787      0   0
#> 04110_6  0.1782991      0   0

Visualize

We provide a set of visual analysis functions including plotSubSEScurve,plotSpwACmap,plotSpwNetmap and plotheatmap.

The plotSubSEScurve function can plot the subtype set sample enrichment curve graph of the subpathway. The follows commands can plot the subtype set sample enrichment curve graph and we can see the enrichment distribution of the four subtypes of breast cancer samples in the subpathway 00120_9 activity. We can observe that the sample of subtype Basal is specifically enriched to the low expression region of subpathway 00120_9.

# plot enrichment score curve of the subpathway 00120_9 in all breast cancer subtypes
 plotSubSEScurve(Subspwresult,
                 spwid="00120_9", # The subpathway id which subpahtway is ploted
                 phenotype="all")  # Which phenotypic phenotype set enrichment curve is drawn


# plot enrichment score curve of the subpathway 00120_9 in the Basal breast cancer subtype.
 plotSubSEScurve(Subspwresult,spwid="00120_9",phenotype="Basal")

The plotSpwACmap function can plot subpathway activity change map (includes subpathway active change box plot and subpathway active change heat map). We can observe the changes in subpathway activity values in breast cancer subtypes or stages and the distribution of each subtype samples. The commands are as follows:

# plot the subpathway 00120_9 in the SubSEA function result.
 plotSpwACmap(Subspwresult,spwid="00120_9")

# plot the subpathway 00982_2 in the DCSA function result.
 plotSpwACmap(DCspwresult,spwid="00982_2")

The plotheatmap function presents a heat map of the subpathway matrix according to the user’s set conditions. The following commands plot heat map of significant up-regulation or down-regulation subpathways. We can observe the obvious block area from the heat map in each subtype.

# load depend package.
 require(pheatmap)
#> Loading required package: pheatmap
# plot significant up-regulation subpathway heat map specific for each breast cancer subtype.
 plotheatmap(Subspwresult,
             fdr.th=0.01,      # Cutoff value for fdr
             plotSubSEA=TRUE,  # Indicate that the input data is the result of the SubSEA function
             SES="positive",   # Obtain a subpathway with a positive SES value
             phenotype="all")

# plot significant down-regulation subpathway heat map specific for each breast cancer subtype.
 plotheatmap(Subspwresult,plotSubSEA=TRUE,fdr.th=0.01,SES="negative",phenotype="all")

We can also draw a single subtype-specific significant pathway heat map. We can see that there are distinct specific regions under the Basal subtype sample. The commands and results are as follows:

# plot Basal subtype specific significant subpathway heat map.
 plotheatmap(Subspwresult,plotSubSEA=TRUE,fdr.th=0.01,SES="all",phenotype="Basal")

Since the DCSA function is used to mine the subpathways that change dynamically with the phenotype, the subpathway heat map is scattered. As follows:

# plot a heat map of the subpathway that is significantly associated with breast cancer stages.
 plotheatmap(DCspwresult,plotSubSEA=FALSE,fdr.th=0.01)

The plotSpwPSheatmap function presents a heat map of the T-test P-value of the activity of the subpathway between the phenotypes. The lower the number in the cells in the heat map, the greater the difference in the activity of the subpathways between the two phenotypes. The following commands plot the heat map. The activity of subpathway 00120_9 is significantly different from other subtypes in the breast cancer basal subtype.

# get the Subspwresult which is the result of SubSEA method.
 Subspwresult<-get("Subspwresult")
 # plot significant heat map between the activity of the subpathway in each subtype of breast cancer.
 plotSpwPSheatmap(Subspwresult,spwid="00120_9")

The plotSpwPSheatmap function can also be used for the results of the DCSA method.

# get the DCspwresult which is the result of DCSA method. 
 DCspwresult<-get("DCspwresult")
##' # plot significant heat map between the activity of the subpathway in each stage of breast cancer.
 plotSpwPSheatmap(DCspwresult,spwid="00982_2")

The function plotSpwNetmap for visualization of a subpathway network map. The commands are as follows:

# load depend package.
 library(igraph)
# plot network graph of the subpathway 00982_2
plotSpwNetmap("00982_2")