Model building

CRFsuite

This R package wraps the CRFsuite C/C++ library (https://github.com/chokkan/crfsuite), allowing the following:

  • Fit a Conditional Random Field model (1st-order linear-chain Markov)
  • Use the model to get predictions alongside the model on new data
  • The focus of the implementation is in the area of Natural Language Processing where this R package allows you to easily build and apply models for named entity recognition, text chunking, part of speech tagging, intent recognition or classification of any category you have in mind.

For users unfamiliar with Conditional Random Field (CRF) models, you can read this excellent tutorial https://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf

Data format

In order to build a CRF model, you need to have

  1. sequences of labels (the hidden state Y) and
  2. attributes of the observations corresponding to the labels (X).

Regarding the label sequence:

Generally the labels follow the IOB type of scheme which look something like: B-ORG, I-ORG, B-YOUROWNLABEL, I-YOUROWNLABEL or O. Indicating the beginning of a certain category (B-), the intermediate part of a certain category (I-) or outside the category (O).

  • Hence the text I went to the New York City District on holidays would e.g. be labelled as O, O, O, O, B-LOCATION, I-LOCATION, I-LOCATION, I-LOCATION, O, O

Regarding the attributes of the label observations:

The attributes of the observations are mostly something like the term itself, the neighbouring terms, the parts of speech, the neighbouring parts of speech or any specific feature you can extract and which is relevant to your business domain (e.g. the number of numbers in the token, how far is it from the start of the document or end of the document, is the token capitalised, does it contain an ampersand, …).

Example data

As an example, let’s get some data in Dutch for doing Named Entity Recognition which was distributed as part of the CoNLL-2002 shared task challenge. This dataset contains 1 row per term and provides entity labels as well as the parts of speech tag for each term.

library(crfsuite)
x <- ner_download_modeldata("conll2002-nl")
subset(x, doc_id == 100)
          data doc_id sentence_id         token  pos  label
  1: ned.train    100        8882            EK Pron B-MISC
  2: ned.train    100        8882      Magazine    N I-MISC
  3: ned.train    100        8882        Canvas    N  B-ORG
  4: ned.train    100        8882         23.45  Num      O
  5: ned.train    100        8883  Tourjournaal    N B-MISC
 ---                                                       
343: ned.train    100        8916 gepresenteerd    V      O
344: ned.train    100        8916          door Prep      O
345: ned.train    100        8916          Stef    N  B-PER
346: ned.train    100        8916      Wijnants    N  I-PER
347: ned.train    100        8916             . Punc      O

Attributes

As basic feature enrichment we add the parts of speech tag of the preceding and the next term which we will use later when building the model and do the same for the token. The R package data.table has a nice shift function for this.

library(data.table)
x <- as.data.table(x)
x <- x[, pos_previous   := shift(pos, n = 1, type = "lag"),    by = list(doc_id)]
x <- x[, pos_next       := shift(pos, n = 1, type = "lead"),   by = list(doc_id)]
x <- x[, token_previous := shift(token, n = 1, type = "lag"),  by = list(doc_id)]
x <- x[, token_next     := shift(token, n = 1, type = "lead"), by = list(doc_id)]

Note that CRFsuite handles all attributes equivalently, in order to distinguish between the columns, we need to prepend the column name logic to each column similar as shown at http://www.chokkan.org/software/crfsuite/tutorial.html. This is done using a custom txt_sprintf function which is similar as sprintf but handles NA values gracefully.

x <- x[, pos_previous   := txt_sprintf("pos[w-1]=%s", pos_previous),     by = list(doc_id)]
x <- x[, pos_next       := txt_sprintf("pos[w+1]=%s", pos_next),         by = list(doc_id)]
x <- x[, token_previous := txt_sprintf("token[w-1]=%s", token_previous), by = list(doc_id)]
x <- x[, token_next     := txt_sprintf("token[w-1]=%s", token_next),     by = list(doc_id)]
subset(x, doc_id == 100, select = c("doc_id", "token", "token_previous", "token_next"))
     doc_id         token           token_previous              token_next
  1:    100            EK                     <NA>     token[w-1]=Magazine
  2:    100      Magazine            token[w-1]=EK       token[w-1]=Canvas
  3:    100        Canvas      token[w-1]=Magazine        token[w-1]=23.45
  4:    100         23.45        token[w-1]=Canvas token[w-1]=Tourjournaal
  5:    100  Tourjournaal         token[w-1]=23.45       token[w-1]=Canvas
 ---                                                                      
343:    100 gepresenteerd             token[w-1]=,         token[w-1]=door
344:    100          door token[w-1]=gepresenteerd         token[w-1]=Stef
345:    100          Stef          token[w-1]=door     token[w-1]=Wijnants
346:    100      Wijnants          token[w-1]=Stef            token[w-1]=.
347:    100             .      token[w-1]=Wijnants                    <NA>
x <- as.data.frame(x)

Model

Train your own CRF model

Once you have data which are tagged with your own categories, you can build a CRF model. On the previous data, split it into a training and test dataset.

crf_train <- subset(x, data == "ned.train")
crf_test  <- subset(x, data == "testa")

And start building your model.

  • By default, the CRF model is trained using L-BFGS with L1/L2 regularization but other training methods are also available, namely: SGD with L2-regularization / Averaged Perceptron / Passive Aggressive or Adaptive Regularization of Weights).
  • In the below example we use the default parameters and decrease the iterations a bit to have a model ready within 30 seconds.
  • Provide the label with the categories (y) and the and the attributes of the observations (x) and indicate what is the sequence group (in this case we take document identifier).
  • The model will be saved to disk in file tagger.crfsuite
model <- crf(y = crf_train$label, 
             x = crf_train[, c("pos", "pos_previous", "pos_next", 
                               "token", "token_previous", "token_next")], 
             group = crf_train$doc_id, 
             method = "lbfgs", file = "tagger.crfsuite",
             options = list(max_iterations = 25, feature.minfreq = 5, c1 = 0, c2 = 1)) 
model
Conditional Random Field saved at C:\Users\jwijf\AppData\Local\Temp\RtmpqOrzeH\Rbuild355c11b5666e\crfsuite\vignettes\tagger.crfsuite
  size of the model in Mb: 0.79
  number of categories: 9
  category labels: O, B-ORG, B-MISC, B-PER, I-PER, B-LOC, I-MISC, I-ORG, I-LOC
stats <- summary(model)
Summary statistics of last iteration: 
Loss: 37014.993192
Feature norm: 30.527576
Error norm: 2230.388754
Active features: 11822
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.229

Dumping summary of the model to file C:\Users\jwijf\AppData\Local\Temp\RtmpIbhgnl\crfsuite_5a7c455c3212.txt
plot(stats$iterations$loss, pch = 20, type = "b", 
     main = "Loss evolution", xlab = "Iteration", ylab = "Loss")

  • You can inspect the coefficients (label to label transitions and attribute/label weights) of the model
weights     <- coefficients(model, encoding = "UTF-8")
head(weights$transitions)
    from     to   weight
1  B-PER  I-PER 7.865476
2  B-ORG  I-ORG 6.420383
3 B-MISC I-MISC 6.356296
4      O B-MISC 6.016069
5      O      O 5.983449
6 I-MISC I-MISC 5.877581
head(subset(weights$states, label %in% "B-LOC"), n = 10)
         attribute label   weight
11         Brussel B-LOC 2.297898
19   token[w-1]=in B-LOC 1.820405
21    token[w-1]=) B-LOC 1.756937
24          België B-LOC 1.705850
32             Fra B-LOC 1.492787
37               N B-LOC 1.409908
43   pos[w-1]=Punc B-LOC 1.312599
54             Ita B-LOC 1.104971
62       Antwerpen B-LOC 1.078814
65 token[w-1]=naar B-LOC 1.069607

Use the model

You can use the model to get predictions of the named entity / chunks / categories you have trained. Below this is done on the holdout data. Provide the model, your data with the attributes and indicate the group the attributes belong to.

scores <- predict(model, 
                  newdata = crf_test[, c("pos", "pos_previous", "pos_next", 
                                         "token", "token_previous", "token_next")], 
                  group = crf_test$doc_id)
crf_test$entity <- scores$label
table(crf_test$entity, crf_test$label)
        
         B-LOC B-MISC B-ORG B-PER I-LOC I-MISC I-ORG I-PER     O
  B-LOC    187      6    46    26     0      0     3     0    95
  B-MISC     1     44    13     4     0      0     0     0    29
  B-ORG      2     17   109     6     0      5     1     0    20
  B-PER     18     49    83   289     4      7    27    13   111
  I-LOC      4      1     2     1    14      2     7     2     8
  I-MISC     1      4     7     5     0     49    22    11    50
  I-ORG      3      8    19    19     3     30   130    23    56
  I-PER      5     10    16    42    19     28   107   319    87
  O        258    609   391   311    24     94    99    55 33517

Create training data

In order to facilitate creating training data on your own data, with your own categories, a Shiny app is put inside this R package. To go short, this app allows you to:

  • Upload an data.frame with text you want to manually label in chunks. This data.frame should contain the fields doc_id and text and should be uploaded in .rds file format.
  • Indicate the categories you want to use in your model
  • Start annotating by selecting the chunk of text which belongs to the categories you defined
  • The annotated data is saved to disk in .rds format and can be merged with a tokenised dataset. See the example in ?merge.chunks

To start the app, make sure you have the following packages installed.

install.packages("shiny")
install.packages("flexdashboard")
install.packages("DT")
install.packages("writexl")

And run the app with

rmarkdown::run(file = system.file(package = "crfsuite", "app", "annotation.Rmd"))

The app was developed with shiny 1.4.0.2, flexdashboard 0.5.1.1, DT 0.13, writexl 1.2 and rmarkdown 1.6

CRFSuite Shiny App

Improve the model

When building the model, you need to

  • tune the parameters of the training algorithm (L-BFGS, SBD, Averaged Perceptron, Passive/Aggressive, AROW)
  • provide good observation attributes which are specific to your domain

Model goodness of fit

In order to identify the parameters of the algorithm, look e.g. at

crf_options("lbfgs")
crf_options("l2sgd")

If you train the model with different algorithm parameters, you probably are interested to see the Precision / Recall / F1 statistics to compare them alongside the model hyperparameters. You can easily get these with the caret R package.

library(caret)
overview <- confusionMatrix(crf_test$entity, crf_test$label, mode = "prec_recall")
overview$overall
overview$byClass[, c("Precision", "Recall", "F1")]

Example with feature engineering

To obtain better models, you need to do feature engineering specific to your business domain.

  • In CRF models, the labels (y) obey the Markov property with respect to the graph conditional on the CRF attributes (x). In order to construct attributes (x) of the observations, one looks to neighbouring elements of the tokens (for example 2 words before/after the token). The following example includes features of the neighbouring tokens (namely neighbouring terms and neighbouring parts of speech tags).
  • Please visit the udpipe R package (https://CRAN.R-project.org/package=udpipe) for more information on how to extract e.g. parts of speech tags or other language features of tokens if you want get richer NLP features of the word neighbours.

This example below starts from scratch assuming that you have plain text and you annotated some chunks using the app in this package. Below the manually annotated dataset is shown.

library(crfsuite)
library(udpipe)
library(data.table)
data(airbnb_chunks, package = "crfsuite")
str(airbnb_chunks)
Classes 'chunkrange' and 'data.frame':  1091 obs. of  8 variables:
 $ annotation_time: POSIXct, format: "2018-09-02 22:47:53" "2018-09-02 22:47:59" "2018-09-02 22:48:07" "2018-09-02 22:48:15" ...
 $ doc_id         : int  26261897 26261897 26261897 11412934 11412934 11412934 11412934 19782360 19782360 19782360 ...
 $ text           : chr  "Fijn, ruim appartement. Locatie vlakbij Manneken Pis en de Grote Markt, dus uitstekend gelegen.\nBadkamer was s"| __truncated__ "Fijn, ruim appartement. Locatie vlakbij Manneken Pis en de Grote Markt, dus uitstekend gelegen.\nBadkamer was s"| __truncated__ "Fijn, ruim appartement. Locatie vlakbij Manneken Pis en de Grote Markt, dus uitstekend gelegen.\nBadkamer was s"| __truncated__ "Het appartement van Salvatore ligt midden in de trendy modewijk. Het is een ideale locatie om Brussel te verken"| __truncated__ ...
 $ start          : num  41 59 156 21 95 192 424 100 220 292 ...
 $ end            : num  52 70 160 29 101 200 437 104 224 296 ...
 $ chunk_id       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ chunk_entity   : chr  "LOCATION" "LOCATION" "PERSON" "PERSON" ...
 $ chunk          : chr  "Manneken Pis" " Grote Markt" "Aline" "Salvatore" ...

We want to build a classifier for the following categories:

table(airbnb_chunks$chunk_entity)

DISTANCE LOCATION   PERSON 
     111      419      464 

In order to build the training dataset, we need to have data at the token level. In the example below, this is done using the udpipe R package (https://CRAN.R-project.org/package=udpipe).

## Annotate text data with udpipe (version >= 0.7)
udmodel <- udpipe_download_model("dutch")
udmodel <- udpipe_load_model(udmodel$file_model)
airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")])
airbnb_tokens <- udpipe(x = airbnb_tokens, object = udmodel)
str(airbnb_tokens)
'data.frame':   29092 obs. of  17 variables:
 $ doc_id       : chr  "26261897" "26261897" "26261897" "26261897" ...
 $ paragraph_id : int  1 1 1 1 1 1 1 1 1 1 ...
 $ sentence_id  : int  1 1 1 1 1 2 2 2 2 2 ...
 $ sentence     : chr  "Fijn, ruim appartement." "Fijn, ruim appartement." "Fijn, ruim appartement." "Fijn, ruim appartement." ...
 $ start        : int  1 5 7 12 23 25 33 41 50 54 ...
 $ end          : int  4 5 10 22 23 31 39 48 52 55 ...
 $ term_id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ token_id     : chr  "1" "2" "3" "4" ...
 $ token        : chr  "Fijn" "," "ruim" "appartement" ...
 $ lemma        : chr  "fijn" "," "ruim" "appartement" ...
 $ upos         : chr  "PROPN" "PUNCT" "ADJ" "VERB" ...
 $ xpos         : chr  "N|eigen|ev|basis|zijd|stan" "LET" "ADJ|vrij|basis|zonder" "WW|vd|vrij|zonder" ...
 $ feats        : chr  "Gender=Com|Number=Sing" NA "Degree=Pos" "VerbForm=Part" ...
 $ head_token_id: chr  "0" "4" "4" "1" ...
 $ dep_rel      : chr  "root" "punct" "advmod" "parataxis" ...
 $ deps         : chr  NA NA NA NA ...
 $ misc         : chr  "SpaceAfter=No" NA NA "SpaceAfter=No" ...

Once you have the data in 1 row per doc_id/token, you can enrich this with the chunk entity. Next, by using the function crf_cbind_attributes we enrich the training data by adding relevant attributes of words in the neighbourhood of the word. We added also a basic column indicating if the term is in the beginning or end of the sentence (bos/eos). Based on that dataset, a model can be built.

x <- merge(airbnb_chunks, airbnb_tokens)
table(x$chunk_entity)

B-DISTANCE B-LOCATION   B-PERSON I-DISTANCE I-LOCATION   I-PERSON          O 
       110        411        451        367        214         39      27500 
## Indicate beginning of sequence and end of sequence and sequence position
x <- as.data.table(x)
x <- x[, bos := sprintf("BOS+%s", (1:.N)-1), by = list(doc_id)]
x <- x[, eos := sprintf("EOS-%s", (.N:1)-1), by = list(doc_id)]
x <- as.data.frame(x)
## Add preceding and next tokens and parts of speech tags
x <- crf_cbind_attributes(x, terms = c("lemma", "upos"), 
                          by = c("doc_id", "sentence_id"), ngram_max = 3, sep = "|")

attributes <- c("bos", "eos", grep("lemma|upos", colnames(x), value=TRUE))
model <- crf(y = x$chunk_entity, x = x[, attributes], 
             group = x$doc_id, 
             method = "lbfgs") 
scores <- predict(model, newdata = x[, attributes], group = x$doc_id)
barplot(table(scores$label[scores$label != "O"]), col = "royalblue", cex.names = 0.75)

Now up to you.

Support in text mining

Need support in text mining. Contact BNOSAC: http://www.bnosac.be