How to train your model

Legana Fingerhut

2021-06-29

By default ampir predicts the probability of a protein to be an antimicrobial peptide (AMP) or not based on a trained SVM model with as input known AMP sequences corresponding to a wide diversity of organisms. However, within the predict_amps function there is a model argument that allows users to pass their own trained model object. Using a different trained model might be useful when users wish to e.g. use a taxonomic specific model to predict AMPs in a restricted group of taxa.

This vignette will go through a mock example of how you can train your own model using the caret package. For more information on how to use caret and the functions used within this example, please see the extensive documentation made by the author, Max Kuhn.

Step 1: Obtain input data

First, a positive and negative dataset have to be obtained. In this example, we want to predict AMPs in bats and decide to train a model using protein sequences found in bats. The positive dataset are AMPs and the negative dataset are random sequences. Both datasets were obtained from UniProt:

For the positive dataset:

library(ampir)
bat_pos <- read_faa(system.file("extdata/bat_positive.fasta.gz", package = "ampir"))
bat_pos$Label <- "Positive"
bat_pos <- remove_nonstandard_aa(bat_pos)

For the negative dataset:

bat_neg <- read_faa(system.file("extdata/bat_negative.fasta.gz", package = "ampir"))
bat_neg$Label <- "Negative"
bat_neg <- remove_nonstandard_aa(bat_neg)
bat_neg <- bat_neg[!bat_neg$seq_aa %in% bat_pos$seq_aa,]
bat_neg <- bat_neg[sample(nrow(bat_neg),78),]

Combine the positive and negative dataset

bats <- rbind(bat_pos, bat_neg)

Calculate features on the combined positive and negative dataset and add the label column

bats_features <- calculate_features(bats)
bats_features$Label <- as.factor(bats$Label)
rownames(bats_features) <- NULL
library(caret)

Split feature set data and create train and test set with caret

trainIndex <-createDataPartition(y=bats_features$Label, p=.7, list = FALSE)
bats_featuresTrain <-bats_features[trainIndex,]
bats_featuresTest <-bats_features[-trainIndex,]

Resample method using repeated cross validation and adding in a probability calculation with caret

trctrl_prob <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
                            classProbs = TRUE)

Train model using a support vector machine with radial kernel with caret. Note: Other classification models are supported too. For example, to use a random forest model in caret, method could be changed from “svmRadial” to “ranger”.

my_bat_svm_model <- train(Label~.,
                       data = bats_featuresTrain[,-1], # excluding seq_name column
                       method="svmRadial",
                       trControl = trctrl_prob,
                       preProcess = c("center", "scale"))

Test model to get an indication of how well the model performs on test data with caret

my_bat_pred <- predict(my_bat_svm_model, bats_featuresTest)
cm <- confusionMatrix(my_bat_pred, bats_featuresTest$Label, positive = "Positive")

Subset from cm$byClass

Balanced Accuracy Precision Recall F1
0.98 0.96 1 0.98

Convert the bat feature test data to the original FASTA type format containing just the sequence name and sequence as this is the required input data for ampir

bat_test_set <- bats[bats$seq_name %in% bats_featuresTest$seq_name,][,-3]
rownames(bat_test_set) <- NULL

Use the trained bat model in ampir’s predict_amps function on the bat test set

my_bat_AMPs <- predict_amps(bat_test_set, min_len = 5, model = my_bat_svm_model)

my_bat_AMPs sample

seq_name seq_aa prob_AMP
1 G1PHN9… MKIYYYLLHF… 0.996
2 G1P1T7… MKALLTLGLL… 0.984
3 G1QAP4… MKALLTLGLL… 0.981
44 Q95727… MTNIRKTHPL… 0.018
45 P14391… VHLSGEEKAA… 0.025
46 Q330H3… MNPYIFFIIM… 0.024