This vignette presents the **M2SMF**,which implements a framework named multi-modality similarity matrix factorization (M2SMF) to conduct integrative analysis of multiple modality data in **R**. The objective is to provide an implementation of the proposed method, which is designed to solve the high dimensionality multiple modality data in bioinformatics. It was achived by first constructing similarity matrices for each modality to reduce the dimension, and then jointly factorizing them into a shared sub-matrix with group sparsity constraint and several modality private sub-matrices. The introduction of group sparsity on the shared coefficient sub-matrix forces the samples in the same group to allow each modality exploiting only a subset of the dimensions of the global latent space, since the latent dimensions are shared across any subset of the views rather than across all views only.

The latest stable version of the package can be installed from any CRAN repository mirror:

```
#Install
install.packages('M2SMF')
#Load
library(M2SMF)
```

The latest development version is available from https://cran.r-project.org/package=M2SMF and may be downloaded from there and installed manully:

`install.packages('/path/to/file/M2SMF.tar.gz',repos=NULL,type="source")`

**Support**: Users interested in this package are encouraged to email to Xiaoyao Yin (yinxy1992@sina.com) for enquiries, bug reports, feature requests, suggestions or M2SMF-related discussions.

We will give an example of how to use this packge hereafter.

We generate simulated data with two modularities and five clusters with the function *simu_data_gen*. Each data modality consists of 100 samples, the first modality data is composed of 100 features for each sample while the second modality comprising 50 features. Samples are assigned to 5 groups equally, i.e. 20 samples in each group. The data are generated with the rnorm function in R with mean varying from 10 to 50 and variance 1 for the first modality while the second modality with mean varying from 5 to 30 and variance 1. The data can be generated by running:

`data_list = simu_data_gen()`

**Label assignment**: According to the data generation process, we assign the groundtruth label to the data we have generated as:

`truelabel = rep(c(1:5),each=20)`

this label will be used to test the clustering ability afterwards.

**Data permutation**: Since the data structure is much too easy for the classification task, we will permute same of the data in one modality to test the classification ability of the proposed method.

```
#Assign the number of samples to permute
pert_num = 10
#Radomly sample *pert_num* samples from all the samples
index1 = sample(c(1:100),n=pert_num)
#Permute the samples by index
index2 = gtools::permute(index1)
#Reassign them to the first modality data
temp_data = data_list[[1]]
sub_data = temp_data[index1,]
temp_data[index2,] = sub_data
data_list[[1]] = temp_data
```

Now we can cluster the samples with the proposed method and compare its performance by calculating the normalized mutual information with the function *cal_NMI* by inputting the truelabel and the predicted label.

**Data normalization**: We conduct this process because the dimensionalities come from different metrics and have different scales, and further this easily induces the bias of the built similarity matrix to large features. To remove this negative effect, we normalize the original data matrix of each modality by column, such that each feature has mean 0 and variance 1. *This is required only for continous data, integer values and binary values do not need the normalization process.*

```
for (i in 1:length(data_list))
{
data_list[[i]] = Standard_Normalization(data_list[[i]])
}
```

**Distance calculation**: We denote ρ(x_i,x_j) as the Euclidean distance of two any pair-wise samples x_i and x_j for real values with function *dist2eu*, chi-squared distance for integer values with function *dist2chi* and agreement based measure for binary values with function *dist2bin*, respectively.

```
for (i in 1:length(data_list))
{
data_list[[i]] = dist2eu(data_list[[i]],data_list[[i]])
}
```

**Similarity matrix construction**: We use the scaled exponential similarity kernel to determine the weight of x_i and x_j, and then normalize the tough similarity matrix to a relatively tight interval with Laplacian normalization to avoid negative effect of some modalities data, which might lead to divergence. All above are implemented in the function *affinityMatrix*.

```
for (i in 1:length(data_list))
{
data_list[[i]] = affinityMatrix(data_list[[i]])
}
```

**M2SMF**: Jointly factorize the matrices into a shared embedding matrix and several modality private basis matrices.

```
#Assign the parameters
lambda = 0.25
theta = 10^-4
k = 5
res = M2SMF(data_list,lambda,theta,k)
```

Now you have got the classification result you want.

**Evaluating k**: Evaluate the most proper cluster number k by normalized average modulairty with the function *new_modularity*.

```
#Assign the interval of k according to your data
k_min = 2
k_max = 30
#Initialize the varible
modularity_data = vector("numeric",(k_max-k_min+1))
#Test all the k
for (i in k_min:k_max)
{
res = M2SMF(data_list,lambda,theta,i)
modularity_data[i-k_min+1] = new_modularity(res,data_list)
}
#The most proper is the one with maximum modularity
best_k = which(modularity_data==max(modularity_data),T)+k_min-1
```

**Robustness test**: We test the robustness of our method by calculating the normalized mutual information of the true label and our predicted label. We can compare the performance of our method with others by this score, which is in the interval [0,1]. The larger the score, the more robust the method. We show the comparison of our method with *SNF* as an example.

```
#Calculate the NMI of our method *M2SMF*
M2SMF_res = M2SMF(data_list,lambda,theta,i)
M2SMF_cluster = M2SMF_res$clusters
M2SMF_NMI = cal_NMI(true_label,M2SMF_cluster)
#Calculate the NMI of *SNF*
SNF_res = SNF(data_list,20,10)
SNF_cluster = SNF_res$clusters
SNF_NMI = cal_NMI(true_label,SNF_cluster)
```