Breast cancer classification with AdaSampling

Pengyi Yang (original version by Dinuka Perera)

2019-05-21

Here we will examine how AdaSampling works on the Wisconsin Breast Cancer dataset, brca, from the UCI Machine Learning Repository and included as part of this package. For more information about the variables, try ?brca. This dataset contains ten features, with an eleventh column containing the class labels, malignant or benign.

head(brca)
#>   clt ucs uch mad ecs nuc chr ncl mit       cla
#> 1   8  10  10   8   7  10   9   7   1 malignant
#> 2   5   3   3   3   2   3   4   4   1 malignant
#> 3   8   7   5  10   7   9   5   5   4 malignant
#> 4   7   4   6   4   6   1   4   3   1 malignant
#> 5  10   7   7   6   4  10   4   1   2 malignant
#> 6   7   3   2  10   5  10   5   4   4 malignant

First, clean up the dataset to transform into the required format.

brca.mat <- apply(X = brca[,-10], MARGIN = 2, FUN = as.numeric)
brca.cls <- sapply(X = brca$cla, FUN = function(x) {ifelse(x == "malignant", 1, 0)})
rownames(brca.mat) <- paste("p", 1:nrow(brca.mat), sep="_")

Examining this dataset shows balanced proportions of classes.

table(brca.cls)
#> brca.cls
#>   0   1 
#> 444 239
brca.cls
#>   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
#> [246] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [281] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [316] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [351] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [386] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [421] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [456] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [491] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [526] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [561] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [596] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [631] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [666] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In order to demonstrate how AdaSampling eliminates noisy class label data it will be necessary to introduce some noise into this dataset, by randomly flipping a selected number of class labels. More noise will be added to the positive observations.

set.seed(1)
pos <- which(brca.cls == 1)
neg <- which(brca.cls == 0)
brca.cls.noisy <- brca.cls
brca.cls.noisy[sample(pos, floor(length(pos) * 0.4))] <- 0
brca.cls.noisy[sample(neg, floor(length(neg) * 0.3))] <- 1

Examining the noisy class labels reveals noise has been added:

table(brca.cls.noisy)
#> brca.cls.noisy
#>   0   1 
#> 406 277
brca.cls.noisy
#>   [1] 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 1
#>  [36] 1 0 1 0 0 1 0 0 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0
#>  [71] 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
#> [106] 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1
#> [141] 0 1 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 0 0 1 0 0 0 1 1 1 0 0 1 0 0
#> [176] 1 1 1 0 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 0 0
#> [211] 0 1 0 1 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0
#> [246] 0 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 1
#> [281] 0 1 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0
#> [316] 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1
#> [351] 0 1 0 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0
#> [386] 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1
#> [421] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0
#> [456] 0 1 1 0 1 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0
#> [491] 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1
#> [526] 0 0 0 1 0 0 1 1 1 1 1 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0
#> [561] 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1
#> [596] 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0
#> [631] 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [666] 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 0

We can now run AdaSampling on this data. For more information use ?adaSample().

Ps <- rownames(brca.mat)[which(brca.cls.noisy == 1)]
Ns <- rownames(brca.mat)[which(brca.cls.noisy == 0)]

brca.preds <- adaSample(Ps, Ns, train.mat=brca.mat, test.mat=brca.mat,
                  classifier = "knn", C= 1, sampleFactor = 1)
head(brca.preds)
#>             P         N
#> p_1 1.0000000 0.0000000
#> p_2 0.6666667 0.3333333
#> p_3 1.0000000 0.0000000
#> p_4 0.6000000 0.4000000
#> p_5 0.8000000 0.2000000
#> p_6 1.0000000 0.0000000

accuracy <- sum(brca.cls.noisy == brca.cls) / length(brca.cls)
accuracy
#> [1] 0.6661786

accuracyWithAdaSample <- sum(ifelse(brca.preds[,"P"] > 0.5, 1, 0) == brca.cls) / length(brca.cls)
accuracyWithAdaSample
#> [1] 0.9502196

The table gives the prediction probability for both a positive (ā€œPā€) and negative (ā€œNā€) class label for each row of the test set. In order to compare the improvement in performance of adaSample against learning without resampling, use the adaSvmBenchmark() function.

In order to see how effective adaSample() is at removing noise, we will use the adaSvmBenchmark() function to compare its performance to a regular classification process.

This procedure compares classification across four conditions, firstly using the original dataset (with correct label information), the second with the noisy dataset (but without AdaSampling), the third with AdaSampling, and the fourth utilising AdaSampling multiple times in the form of an ensemble learning model.

adaSvmBenchmark(data.mat = brca.mat, data.cls = brca.cls.noisy, data.cls.truth = brca.cls, cvSeed=1)
#>                Se    Sp    F1
#> Original    0.971 0.971 0.959
#> Baseline    0.748 0.975 0.831
#> AdaSingle   0.987 0.943 0.944
#> AdaEnsemble 0.987 0.957 0.956