kFoldMaskTensor
)In this vignette, we introduce a cross-validation approach using mask tensor (Fujita 2018; Owen 2009) to estimate the optimal rank parameter for matrix/tensor decomposition. Mask matrix/tensor has the same size of data matrix/tensor and contains only 0 or 1 elements, 0 means not observed value, and 1 means observed value. In this approach, only non-zero and non-NA elements are randomly specified as 0 and estimated by other elements reconstructed by the result of matrix/tensor decomposition. This can be considered a cross-validation approach because the elements specified as 1 are the training dataset and the elements specified as 0 are the test dataset.
Here, we use these packages.
library("nnTensor")
library("ggplot2")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Here, we use three types of demo data as follows:
<- toyModel("NMF")
data_matrix <- toyModel("siNMF_Easy")
data_matrices <- toyModel("CP") data_tensor
To set mask matrix/tensor, here we use kFoldMaskTensor
,
which divides only the elements other than NA and 0 into k
mask matrices/tensors. In NMF
, the mask matrix can be
specified as M
and the rank parameter (J
) with
the smallest test error is considered the optimal rank. Here, three mask
matrices are prepared for each rank, and the average is used to estimate
the optimal rank.
<- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
out_NMF <- 1
count for(i in 1:10){
<- kFoldMaskTensor(data_matrix, k=3)
masks_NMF for(j in 1:3){
3] <- rev(
out_NMF[count, NMF(data_matrix,
M = masks_NMF[[j]],
J = i)$TestRecError)[1]
<- count + 1
count
} }
Looking at the average test error for each rank, the optimal rank appears to be around 8 to 10 (with some variation depending on random seeds).
ggplot(out_NMF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error")
group_by(out_NMF, rank) |>
(summarize(Avg = mean(value)) -> avg_test_error_NMF)
## # A tibble: 10 × 2
## rank Avg
## <fct> <dbl>
## 1 1 4193.
## 2 2 3329.
## 3 3 1516.
## 4 4 1511.
## 5 5 1328.
## 6 6 1212.
## 7 7 1307.
## 8 8 1145.
## 9 9 699.
## 10 10 793.
which(avg_test_error_NMF$Avg == min(avg_test_error_NMF$Avg))[1], ] avg_test_error_NMF[
## # A tibble: 1 × 2
## rank Avg
## <fct> <dbl>
## 1 9 699.
Same as NMF
, mask matrix M
can be specified
in NMTF
, and the rank parameter (rank
) with
the smallest test error is considered the optimal rank. The following
code is time-consuming and should be executed in your own
environment.
<- expand.grid(replicate=1:3, rank2=1:10, rank1=1:10, value=0)
out_NMTF <- paste0(out_NMTF$rank1, "-", out_NMTF$rank2)
rank_NMTF <- cbind(out_NMTF, rank=factor(rank_NMTF, levels=unique(rank_NMTF)))
out_NMTF <- 1
count for(i in 1:10){
for(j in 1:10){
<- kFoldMaskTensor(data_matrix, k=3)
masks_NMTF for(k in 1:3){
4] <- rev(
out_NMTF[count, NMTF(data_matrix,
M = masks_NMTF[[k]],
rank = c(i, j))$TestRecError)[1]
<- count + 1
count
}
} }
ggplot(out_NMTF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
|>
(out_NMTF group_by(rank1, rank2) |>
summarize(Avg = mean(value)) -> avg_test_error_NMTF)
which(avg_test_error_NMTF$Avg == min(avg_test_error_NMTF$Avg))[1], ] avg_test_error_NMTF[
Same as NMF
, mask matrices M
can be
specified in siNMF
, and the rank parameter (J
)
with the smallest test error is considered the optimal rank. The
following code is time-consuming and should be executed in your own
environment.
<- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
out_siNMF <- 1
count for(i in 1:10){
<- lapply(1:3, function(x){
masks_siNMF list(
kFoldMaskTensor(data_matrices[[1]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[2]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[3]], k=3)[[x]])
})for(j in 1:3){
3] <- rev(
out_siNMF[count, siNMF(data_matrices,
M = masks_siNMF[[j]],
J = i)$TestRecError)[1]
<- count + 1
count
} }
ggplot(out_siNMF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error")
|>
(out_siNMF group_by(rank) |>
summarize(Avg = mean(value)) -> avg_test_error_siNMF)
which(avg_test_error_siNMF$Avg == min(avg_test_error_siNMF$Avg))[1], ] avg_test_error_siNMF[
Same as NMF
, mask matrices M
can be
specified in jNMF
, and the rank parameter (J
)
with the smallest test error is considered the optimal rank. The
following code is time-consuming and should be executed in your own
environment.
<- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
out_jNMF <- 1
count for(i in 1:10){
<- lapply(1:3, function(x){
masks_jNMF list(
kFoldMaskTensor(data_matrices[[1]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[2]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[3]], k=3)[[x]])
})for(j in 1:3){
3] <- rev(
out_jNMF[count, jNMF(data_matrices,
M = masks_jNMF[[j]],
J = i)$TestRecError)[1]
<- count + 1
count
} }
ggplot(out_jNMF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error")
|>
(out_jNMF group_by(rank) |>
summarize(Avg = mean(value)) -> avg_test_error_jNMF)
which(avg_test_error_jNMF$Avg == min(avg_test_error_jNMF$Avg))[1], ] avg_test_error_jNMF[
Same as NMF
, mask tensor M
can be specified
in NTF
, and the rank parameter (rank
) with the
smallest test error is considered the optimal rank. The following code
is time-consuming and should be executed in your own environment.
<- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
out_NTF <- 1
count for(i in 1:10){
<- kFoldMaskTensor(data_tensor, k=3)
masks_NTF for(j in 1:3){
3] <- rev(
out_NTF[count, NTF(data_tensor,
M = masks_NTF[[j]],
rank = i)$TestRecError)[1]
<- count + 1
count
} }
ggplot(out_NTF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error")
group_by(out_NTF, rank) |>
(summarize(Avg = mean(value)) -> avg_test_error_NTF)
which(avg_test_error_NTF$Avg == min(avg_test_error_NTF$Avg))[1], ] avg_test_error_NTF[
Same as NMF
, mask tensor M
can be specified
in NTD
, and the rank parameter (rank
) with the
smallest test error is considered the optimal rank. The following code
is time-consuming and should be executed in your own environment.
<- expand.grid(replicate=1:3, rank3=1:5, rank2=1:5, rank1=1:5, value=0)
out_NTD <- paste0(out_NTD$rank1, "-", out_NTD$rank2,
rank_NTD "-", out_NTD$rank3)
<- cbind(out_NTD, rank=factor(rank_NTD, levels=unique(rank_NTD)))
out_NTD <- 1
count for(i in 1:5){
for(j in 1:5){
for(k in 1:5){
<- kFoldMaskTensor(data_tensor, k=3)
masks_NTD for(k in 1:3){
5] <- rev(
out_NTD[count, NTD(data_tensor,
M = masks_NTD[[k]],
rank = c(i, j, k))$TestRecError)[1]
<- count + 1
count
}
}
} }
ggplot(out_NTD, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
|>
(out_NTD group_by(rank1, rank2, rank3) |>
summarize(Avg = mean(value)) -> avg_test_error_NTD)
which(avg_test_error_NTD$Avg == min(avg_test_error_NTD$Avg))[1], ] avg_test_error_NTD[
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.2 ggplot2_3.4.2 rTensor_1.4.8 nnTensor_1.2.0
##
## loaded via a namespace (and not attached):
## [1] viridis_0.6.3 sass_0.4.6 utf8_1.2.3 generics_0.1.3
## [5] tcltk_4.3.0 digest_0.6.31 magrittr_2.0.3 evaluate_0.21
## [9] grid_4.3.0 RColorBrewer_1.1-3 fastmap_1.1.1 maps_3.4.1
## [13] jsonlite_1.8.5 misc3d_0.9-1 gridExtra_2.3 fansi_1.0.4
## [17] spam_2.9-1 viridisLite_0.4.2 scales_1.2.1 jquerylib_0.1.4
## [21] cli_3.6.1 rlang_1.1.1 munsell_0.5.0 withr_2.5.0
## [25] cachem_1.0.8 yaml_2.3.7 tools_4.3.0 colorspace_2.1-0
## [29] vctrs_0.6.2 R6_2.5.1 lifecycle_1.0.3 plot3D_1.4
## [33] MASS_7.3-60 pkgconfig_2.0.3 pillar_1.9.0 bslib_0.5.0
## [37] gtable_0.3.3 glue_1.6.2 Rcpp_1.0.10 fields_14.1
## [41] xfun_0.39 tibble_3.2.1 tidyselect_1.2.0 highr_0.10
## [45] knitr_1.43 farver_2.1.1 htmltools_0.5.5 rmarkdown_2.22
## [49] labeling_0.4.2 tagcloud_0.6 dotCall64_1.0-2 compiler_4.3.0