How to train your model

By default ampir predicts the probability of a protein to be an antimicrobial peptide (AMP) or not based on a trained SVM model with as input known AMP sequences corresponding to a wide diversity of organisms. However, within the predict_amps function there is a model argument that allows users to pass their own trained model object. Using a different trained model might be useful when users wish to e.g. use a taxonomic specific model to predict AMPs in a restricted group of taxa.

This vignette will go through a mock example of how you can train your own model using the caret package. For more information on how to use caret and the functions used within this example, please see the extensive documentation made by the author, Max Kuhn.

Step 1: Obtain input data

First, a positive and negative dataset have to be obtained. In this example, we want to predict AMPs in bats and decide to train a model using protein sequences found in bats. The positive dataset are AMPs and the negative dataset are random sequences. Both datasets were obtained from UniProt:

positive by using the search term keyword:antimicrobial taxonomy:“Chiroptera [9397]”
negative by using the search term taxonomy:“Chiroptera [9397]”

For the positive dataset:

read data
add “positive” label
remove non standard amino acids

library(ampir)

bat_pos <- read_faa(system.file("extdata/bat_positive.fasta.gz", package = "ampir"))
bat_pos$Label <- "Positive"
bat_pos <- remove_nonstandard_aa(bat_pos)

For the negative dataset:

read data
add “negative” lavel
remove non standard amino acids
remove sequences (if any) that are also present in the positive dataset
randomly select the same number of sequences that are in the positive dataset

bat_neg <- read_faa(system.file("extdata/bat_negative.fasta.gz", package = "ampir"))
bat_neg$Label <- "Negative"
bat_neg <- remove_nonstandard_aa(bat_neg)
bat_neg <- bat_neg[!bat_neg$seq_aa %in% bat_pos$seq_aa,]
bat_neg <- bat_neg[sample(nrow(bat_neg),78),]

Combine the positive and negative dataset

bats <- rbind(bat_pos, bat_neg)

Calculate features on the combined positive and negative dataset and add the label column

bats_features <- calculate_features(bats)
bats_features$Label <- as.factor(bats$Label)
rownames(bats_features) <- NULL

library(caret)

Split feature set data and create train and test set with caret

trainIndex <-createDataPartition(y=bats_features$Label, p=.7, list = FALSE)
bats_featuresTrain <-bats_features[trainIndex,]
bats_featuresTest <-bats_features[-trainIndex,]

Resample method using repeated cross validation and adding in a probability calculation with caret

trctrl_prob <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
                            classProbs = TRUE)

Train model using a support vector machine with radial kernel with caret. Note: Other classification models are supported too. For example, to use a random forest model in caret, method could be changed from “svmRadial” to “ranger”.

my_bat_svm_model <- train(Label~.,
                       data = bats_featuresTrain[,-1], # excluding seq_name column
                       method="svmRadial",
                       trControl = trctrl_prob,
                       preProcess = c("center", "scale"))

Test model to get an indication of how well the model performs on test data with caret

my_bat_pred <- predict(my_bat_svm_model, bats_featuresTest)
cm <- confusionMatrix(my_bat_pred, bats_featuresTest$Label, positive = "Positive")

Subset from cm$byClass

Balanced Accuracy	Precision	Recall	F1
0.98	0.96	1	0.98

Convert the bat feature test data to the original FASTA type format containing just the sequence name and sequence as this is the required input data for ampir

bat_test_set <- bats[bats$seq_name %in% bats_featuresTest$seq_name,][,-3]
rownames(bat_test_set) <- NULL

Use the trained bat model in ampir’s predict_amps function on the bat test set

my_bat_AMPs <- predict_amps(bat_test_set, min_len = 5, model = my_bat_svm_model)

my_bat_AMPs sample

	seq_name	seq_aa	prob_AMP
1	G1PHN9…	MKIYYYLLHF…	0.996
2	G1P1T7…	MKALLTLGLL…	0.984
3	G1QAP4…	MKALLTLGLL…	0.981
44	Q95727…	MTNIRKTHPL…	0.018
45	P14391…	VHLSGEEKAA…	0.025
46	Q330H3…	MNPYIFFIIM…	0.024

How to train your model

Legana Fingerhut

2021-06-29

Step 1: Obtain input data