An Introduction to the package M2SMF

This vignette presents the M2SMF,which implements a framework named multi-modality similarity matrix factorization (M2SMF) to conduct integrative analysis of multiple modality data in R. The objective is to provide an implementation of the proposed method, which is designed to solve the high dimensionality multiple modality data in bioinformatics. It was achived by first constructing similarity matrices for each modality to reduce the dimension, and then jointly factorizing them into a shared sub-matrix with group sparsity constraint and several modality private sub-matrices. The introduction of group sparsity on the shared coefficient sub-matrix forces the samples in the same group to allow each modality exploiting only a subset of the dimensions of the global latent space, since the latent dimensions are shared across any subset of the views rather than across all views only.

Installation

The latest stable version of the package can be installed from any CRAN repository mirror:

#Install
install.packages('M2SMF')
#Load
library(M2SMF)

The latest development version is available from https://cran.r-project.org/package=M2SMF and may be downloaded from there and installed manully:

install.packages('/path/to/file/M2SMF.tar.gz',repos=NULL,type="source")

Support: Users interested in this package are encouraged to email to Xiaoyao Yin (yinxy1992@sina.com) for enquiries, bug reports, feature requests, suggestions or M2SMF-related discussions.

Usage

We will give an example of how to use this packge hereafter.

Simulation data generation

We generate simulated data with two modularities and five clusters with the function simu_data_gen. Each data modality consists of 100 samples, the first modality data is composed of 100 features for each sample while the second modality comprising 50 features. Samples are assigned to 5 groups equally, i.e. 20 samples in each group. The data are generated with the rnorm function in R with mean varying from 10 to 50 and variance 1 for the first modality while the second modality with mean varying from 5 to 30 and variance 1. The data can be generated by running:

data_list = simu_data_gen()

Simulation data groundtruth assignment and permutation

Label assignment: According to the data generation process, we assign the groundtruth label to the data we have generated as:

truelabel = rep(c(1:5),each=20)

this label will be used to test the clustering ability afterwards.

Data permutation: Since the data structure is much too easy for the classification task, we will permute same of the data in one modality to test the classification ability of the proposed method.

#Assign the number of samples to permute
pert_num = 10
#Radomly sample *pert_num* samples from all the samples
index1  =  sample(c(1:100),n=pert_num)
#Permute the samples by index
index2  =  gtools::permute(index1)
#Reassign them to the first modality data
temp_data = data_list[[1]]
sub_data  =  temp_data[index1,]
temp_data[index2,]  =  sub_data
data_list[[1]] = temp_data

Now we can cluster the samples with the proposed method and compare its performance by calculating the normalized mutual information with the function cal_NMI by inputting the truelabel and the predicted label.

You should start from here if you are using your own data.

Data normalization: We conduct this process because the dimensionalities come from different metrics and have different scales, and further this easily induces the bias of the built similarity matrix to large features. To remove this negative effect, we normalize the original data matrix of each modality by column, such that each feature has mean 0 and variance 1. This is required only for continous data, integer values and binary values do not need the normalization process.

for (i in 1:length(data_list))
{
    data_list[[i]] = Standard_Normalization(data_list[[i]])
}

Distance calculation: We denote ρ(x_i,x_j) as the Euclidean distance of two any pair-wise samples x_i and x_j for real values with function dist2eu, chi-squared distance for integer values with function dist2chi and agreement based measure for binary values with function dist2bin, respectively.

for (i in 1:length(data_list))
{
    data_list[[i]] = dist2eu(data_list[[i]],data_list[[i]])
}

Similarity matrix construction: We use the scaled exponential similarity kernel to determine the weight of x_i and x_j, and then normalize the tough similarity matrix to a relatively tight interval with Laplacian normalization to avoid negative effect of some modalities data, which might lead to divergence. All above are implemented in the function affinityMatrix.

for (i in 1:length(data_list))
{
    data_list[[i]] = affinityMatrix(data_list[[i]])
}

M2SMF: Jointly factorize the matrices into a shared embedding matrix and several modality private basis matrices.

#Assign the parameters
lambda = 0.25
theta = 10^-4
k = 5
res = M2SMF(data_list,lambda,theta,k)

Now you have got the classification result you want.

Evaluating k: Evaluate the most proper cluster number k by normalized average modulairty with the function new_modularity.

#Assign the interval of k according to your data
k_min = 2
k_max = 30
#Initialize the varible
modularity_data = vector("numeric",(k_max-k_min+1))
#Test all the k
for (i in k_min:k_max)
{
    res = M2SMF(data_list,lambda,theta,i)
    modularity_data[i-k_min+1] = new_modularity(res,data_list)
}
#The most proper is the one with maximum modularity
best_k = which(modularity_data==max(modularity_data),T)+k_min-1

You can ommit the following if you do not have any true label as the groudtruth, we do the next to evaluate our method.

Robustness test: We test the robustness of our method by calculating the normalized mutual information of the true label and our predicted label. We can compare the performance of our method with others by this score, which is in the interval [0,1]. The larger the score, the more robust the method. We show the comparison of our method with SNF as an example.

#Calculate the NMI of our method *M2SMF*
M2SMF_res = M2SMF(data_list,lambda,theta,i)
M2SMF_cluster = M2SMF_res$clusters
M2SMF_NMI = cal_NMI(true_label,M2SMF_cluster)
#Calculate the NMI of *SNF*
SNF_res = SNF(data_list,20,10)
SNF_cluster = SNF_res$clusters
SNF_NMI = cal_NMI(true_label,SNF_cluster)