The sbo
package provides utilities for building and evaluating next-word prediction functions based on Stupid Back-off N-gram models in R. In this vignette, I illustrate the main features of sbo
, including in particular:
sbo
In this and the next section we will employ the sbo::twitter_train
example dataset1, each entry of which consists in a single tweet in English, e.g.:
head(sbo::twitter_train, 3)
#> [1] "FOX NEWS TODAY: \"Energizer Bunny Arrested, Charged With Battery!\""
#> [2] "Today is a good day! Excluding today, I only have 9 days of class left. Awww yeaaahh"
#> [3] "I will be at broadway bar tonight"
Given the training corpus, the typical workflow for building a text-predictor consists of the following steps:
Also, implicit in the previous steps is the choice of a model dictionary, which can be done a priori, or during the training process.
All these steps (including building a dictionary) can be performed in sbo
as follows:
p <- sbo_predictor(object = sbo::twitter_train, # preloaded example dataset
N = 3, # Train a 3-gram model
dict = target ~ 0.75, # cover 75% of training corpus
.preprocess = sbo::preprocess, # Preprocessing transformation
EOS = ".?!:;", # End-Of-Sentence tokens
lambda = 0.4, # Back-off penalization in SBO algorithm
L = 3L, # Number of predictions for input
filtered = "<UNK>" # Exclude the <UNK> token from predictions
)
This creates an object p
of class sbo_predictor
, which can be used to generate text predictions as follows:
Let us comment the various arguments in the previous call to sbo_predictor()
:
object
. The training corpus used to train the text predictor.N
. The order \(N\) of the \(N\)-gram model.dict
. This argument specifies the model dictionary. In this case, we build our dictionary directly from the training corpus, using the most frequent words which cover a fraction target = 0.75
of the corpus. In alternative, one can prune the corpus word set to a fixed vocabulary size, or use a predefined dictionary..preprocess
. The function used in corpus preprocessing. Here we leverage on the minimal sbo::preprocess
, but this can be in principle any function taking a character input and returning a character output.EOS
. This argument lists, in a single string, all End-Of-Sentence characters, employed for sentence tokenization (moreover, text belonging to different entries of the preprocessed input vector are understood to belong to different sentences).lambda
. The penalization \(\lambda\) employed in the Stupid Back-Off algorithm. Here \(\lambda = 0.4\) is the benchmark value given in the original work by Brants et al..L
. Number of predictions to return for any given input. This needs to be specified a priori, because the object p
does not store the full \(k\)-gram frequency tables, but only a fixed number (i.e. \(L\)) of predictions for any \(k\)-gram prefix (\(k\leq N -1\)) observed in the training corpus. This is commented more at length below.filtered
. Words to exclude from next-word predictions. Here the reserved string <UNK>
excludes the Unknown-Word token (similarly, one can use the reserved string <EOS>
to exclude End-Of-Sentence tokens).Before proceeding, let us take a little break and introduce the most important function of sbo
:
The example in the previous Section illustrates how to use a text predictor in interactive mode. If the training process is computationally expensive, one may want to save the text predictor object (i.e. p
in the example above) out of physical memory (e.g. through save()
). For this purpose2, sbo
provides the class sbo_predtable
(“Stupid Back-Off prediction tables”).
These objects are a “raw” equivalent of a text predictor, and can be created with sbo_predtable()
, which has the same user interface of sbo_predictor()
. For example, the definition of p
above would be replaced by:
t <- sbo_predtable(object = sbo::twitter_train, # preloaded example dataset
N = 3, # Train a 3-gram model
dict = target ~ 0.75, # cover 75% of training corpus
.preprocess = sbo::preprocess, # Preprocessing transformation
EOS = ".?!:;", # End-Of-Sentence tokens
lambda = 0.4, # Back-off penalization in SBO algorithm
L = 3L, # Number of predictions for input
filtered = "<UNK>" # Exclude the <UNK> token from predictions
)
From t
, one can rapidly recover the corrisponding text predictor, using sbo_predictor()
3:
Objects of class sbo_predtable
can be safely stored out of memory and loaded in other R sessions:
sbo_predictor
and sbo_predtable
class objectssbo_predictor
and sbo_predtable
objects directly store next-word predictions for each \(k\)-gram prefix (\(k=1,\,2,\dots,\,N-1\)) observed in the training corpus, allowing for memory compression and fast query.
Both objects store, through attributes, information about the training process. This can be conveniently obtained through the corresponding summary()
methods, e.g.
summary(p)
#> Next-word text predictor from Stupid Back-off N-gram model
#>
#> Order (N): 3
#> Dictionary size: 1008 words
#> Back-off penalization (lambda): 0.4
#> Maximum number of predictions (L): 3
#>
#> See ?predict.sbo_predictor for usage help.
sbo_predictor
and sbo_predtable
objectsHere are some details on the current (still under development) implementation of sbo_predictor
and sbo_predtable
objects. For clarity, I will refer to the sbo_predtable
instance t
created above:
head(t[[3]])
#> w1 w2 prediction1 prediction2 prediction3
#> [1,] 4 749 1 68 33
#> [2,] 472 62 1 4 28
#> [3,] 98 288 1 4 5
#> [4,] 428 262 1 13 4
#> [5,] 193 17 1 415 1009
#> [6,] 333 738 1 1009 16
The first two columns correspond to the word codes4 of \(2\)-gram prefixes observed in the training corpus, and the other columns code the top \(L=3\) predictions for these \(2\)-grams. When a \(2\)-gram \(w_1 w_2\) is given as input for text prediction, it is first looked for in the prefix columns of t[[3]]
. If not found, \(w_2\) is looked for in the prefix column of t[[2]]
. If this also fails, the prediction is performed without any prefix, that is, we simply predict the L
most frequent words, stored in:
This Section leverages, for convenience, on:
Once we have built our next-word predictor, we may want to directly test its predictions on an independent corpus. For this purpose, sbo
offers the function eval_sbo_predictor()
, which samples \(N\)-grams from a test corpus and compares the predictions from the \((N-1)\)-gram prefixes with the true completions.
More in detail, given a character vector test
, where each entry of test
represents a single document, eval_sbo_predictor()
performs the following steps:
test
, i.e. from each document.For a reasonable estimate of prediction accuracy, the various entries of test
should be independent documents, e.g. single tweets as in the sbo::twitter_test
example dataset, which we use below to test the previously trained predictor p
.
set.seed(840)
(evaluation <- eval_sbo_predictor(p, test = sbo::twitter_test))
#> # A tibble: 10,000 x 4
#> input true preds[,1] [,2] [,3] correct
#> <chr> <chr> <chr> <chr> <chr> <lgl>
#> 1 "come back" to to <EOS> and TRUE
#> 2 "amazing should" have post be have TRUE
#> 3 "u at" <EOS> <EOS> the next TRUE
#> 4 "they'll be" respected a with in FALSE
#> 5 " everyone" follow is should loves FALSE
#> 6 "luck this" weekend weekend season is TRUE
#> 7 "atlanta weather" is <EOS> in to FALSE
#> 8 "iphone 4s" just <EOS> the me FALSE
#> 9 "is it" working <EOS> that just FALSE
#> 10 "did you" listen know get see FALSE
#> # … with 9,990 more rows
As it is seen, eval_sbo_predictor()
returns a tibble containing the input \((N-1)\)-grams, the true completions, the predicted completions and a column indicating whether one of the predictions were correct or not.
We can estimate predictive accuracy as follows (the uncertainty in the estimate is approximated by the binomial formula \(\sigma = \sqrt{\frac{p(1-p)}{M}}\), where \(M\) is the number of trials):
evaluation %>% summarise(accuracy = sum(correct)/n(),
uncertainty = sqrt(accuracy * (1 - accuracy) / n())
)
#> # A tibble: 1 x 2
#> accuracy uncertainty
#> <dbl> <dbl>
#> 1 0.316 0.00465
We may want to exclude from the test \(N\)-grams ending by the End-Of-Sentence token (here represented by "<EOS>"
):
evaluation %>% # Accuracy for in-sentence predictions
filter(true != "<EOS>") %>%
summarise(accuracy = sum(correct) / n(),
uncertainty = sqrt(accuracy * (1 - accuracy) / n())
)
#> # A tibble: 1 x 2
#> accuracy uncertainty
#> <dbl> <dbl>
#> 1 0.184 0.00427
In trying to reduce the size (in physical memory) of your text-predictor, it might be useful to prune the model dictionary. The following command plots an histogram of the distribution of correct predictions in our test.
if (require(ggplot2)) {
evaluation %>%
filter(correct, true != "<EOS>") %>%
select(true) %>%
transmute(rank = match(true, table = attr(p, "dict"))) %>%
ggplot(aes(x = rank)) + geom_histogram(binwidth = 25)
}
#> Loading required package: ggplot2
Apparently, the large majority of correct predictions come from the first ~ 300 words of the dictionary, so that if we prune the dictionary excluding words with rank greater than, e.g., 500 we can reduce the size of our model without seriously affecting its prediction accuracy.
In this Section, I briefly survey other functionalities provided by sbo
. See the corresponding help pages for more details.
Dictionaries can be directly built using sbo_dictionary()
. For example, the command:
dict <- sbo_dictionary(corpus = sbo::twitter_train,
max_size = 100,
target = 0.5,
.preprocess = sbo::preprocess,
EOS = ".?!:;")
constructs a dictionary applying the most restrictive of the two constraint max_size = 100
or target = 0.5
, where target
denotes the coverage fraction of corpus
. The arguments .preprocess
and EOS
work as described above.
The output is an object of class sbo_dictionary
, which stores, along with a vector of words (sorted by decreasing frequency), also the original values of .preprocess
and EOS
.
The word coverage fraction of a dictionary can be computed through the generic function word_coverage()
. This accepts as argument any object containing a dictionary, along with a preprocessing function and a list of End-Of-Sentence characters. This includes all sbo
main classes: sbo_dictionary
, sbo_kgram_freqs
, sbo_predtable
and sbo_predictor
. For instance:
(c <- word_coverage(p, sbo::twitter_train))
#> A 'word_coverage' object.
#>
#> See summary() for more details.
Computes the coverage fraction of the dictionary used by the predictor p
, on the original training corpus.
This can be conveniently summarized with:
summary(c)
#> Word coverage fraction
#>
#> Dictionary length: 1008
#> Coverage fraction (w/ EOS): 78.2 %
#> Coverage fraction (w/o EOS): 75 %
or visualized with:
\(k\)-gram frequency tables form the building blocks of any \(N\)-gram based language model. The function sbo::kgram_freqs()
extracts frequency tables from a training corpus and stores them into a class sbo_kgram_freq
object. For example:
f <- kgram_freqs(corpus = sbo::twitter_train,
N = 3,
dict = target ~ 0.75,
.preprocess = sbo::preprocess,
EOS = ".?!:;"
)
stores \(k\)-gram frequency tables into f
. This object can itself be used for text prediction:
predict(f, "i love")
#> # A tibble: 1,009 x 2
#> completion probability
#> <chr> <dbl>
#> 1 you 0.222
#> 2 it 0.0573
#> 3 my 0.0549
#> 4 the 0.0524
#> 5 your 0.0312
#> 6 this 0.0312
#> 7 that 0.0287
#> 8 how 0.0249
#> 9 u 0.0175
#> 10 when 0.0150
#> # … with 999 more rows
The output contains the full language model information, i.e. the probabilities5 for each possible word completion. Compare this with:
The extra information contained in f
comes at a price. In fact, the advantage provided by sbo_predictor
/sbo_predtable
objects for simple text prediction is two-fold:
size_in_MB <- function(x) format(utils::object.size(x), units = "MB")
sapply(list(sbo_predtable = t, kgram_freqs = f), size_in_MB)
#> sbo_predtable kgram_freqs
#> "1.4 Mb" "4.7 Mb"
Usually text corpora require preprocessing before word and \(k\)-gram tokenization can take place. The .preprocess
argument of kgram_freqs()
allows for an user specified preprocessing function. The default is the minimal sbo::preprocess()
, and the optimized kgram_freqs_fast(erase = x, EOS = y)
is equivalent to kgram_freqs(.preprocess = sbo::preprocess(erase = x, EOS = y))
(but more efficient).
Tokenization at the sentence level is required to obtain terminal \(k\)-grams (i.e. \(k\)-grams containing Begin-Of-Sentence or End-Of-Sentence tokens). In the training process, sentence tokenization takes place after text preprocessing.
End-Of-Sentence (single character) tokens are specified by the EOS
argument of kgram_freqs()
and kgram_freqs_fast()
; empty sentences are always skipped. Also, if the input vector text
has length(text) > 1
, the various elements of text
belong to different entries.
The process of sentence tokenization can also be performed directly, through sbo::tokenize_sentences()
, but this is not required for use with kgram_freqs*()
.
This is a small samples of \(5·10^4\) entries from the “Tweets” Swiftkey dataset fully available here.↩︎
At the present stage of development, this cannot be done directly for the sbo_predictor
object created above. Technically, this is because sbo_predictor
objects are external pointers to a convenient C++ class, optimized for fast predict()
ions. Such a class instance exists only during a single R session.↩︎
As this example makes clear, both sbo_predictor()
and sbo_predtable()
are S3 generics. The corresponding available methods are listed under ?sbo_predictions
.↩︎
Coded with respect to the rank sorted dictionary dict
(the codes 0
, length(dict) + 1
and length(dict) + 2
correspond to the Begin-Of-Sentence, End-Of-Sentence and Unknown-Word tokens, respectively).↩︎
More precisely, these are the normalized (to unity) scores resulting from the Stupid Back-Off smoothing method.↩︎