1 Introduction

soundcorrs is a small package whose purpose is to help linguists analyse sound correspondences between languages. It does not attempt to draw any conclusions on its own; this responsibility is placed entirely on the user. soundcorrs merely automates and facilitates certain tasks, such as preparing the material part of the paper, looking for examples of specific correspondences, or applying series of sound changes, and, by making various functions available, suggests possible paths of analysis which may not be immediately obvious to the more traditional linguist.

There are two ways to access soundcorrs: the command line interface (CLI), and the graphical user interface (GUI). Whether you use the R console, RStudio, or some other IDE, this is the CLI. It is by far more flexible and powerful, but also more demanding. To switch to the GUI, simply run soundcorrsGUI(). As of the current version, it provides access to the most important of soundcorrs functions, but not to all of soundcorrs functions. Enhancements are planned for future releases but it is rather unlikely that the GUI should ever fully match the CLI.

In this vignette, references to the CLI are formatted like this, whereas references to the GUI are given in quotes ‘like so’. Except for the last one, each chapter begins with a general introduction, which is followed by a detailed description of how the given functionality can be accessed via the CLI and via the GUI.

The CLI parts assume that the reader is has at least a passing familiarity with R and a basic understanding of statistics. Most problems can probably be read up on as they appear in the text, but it is nevertheless recommended to start by very briefly acquainting oneself with R by reading the first page of maybe Quick-R, R Tutorial, or another R primer. In particular, it is assumed that the reader will know how to access and understand the in-built documentation (run ?nameOfTheFunction), as not all of the arguments are discussed here.

Another topic that is highly recommended, as without it soundcorrs cannot be used to its full potential, are regular expressions. This is equally true for both CLI and GUI users. An accessible introduction can be found in R.D. Peng’s R Programming for Data Science, as well as in many other places on the Internet, and a handy cheat sheet has been made available by RStudio. Besides regular expressions that are well-known to programmers, soundcorrs makes available regular expressions and metacharacters that originate from linguistics. Two flavours are available: traditional wild-cards such as ‹V› for ‘any vowel’, and more modern binary notation such as ‹[+vowel]› for the same. More on those can be found in subsection Transcription.

Though a little dated by now, a less technical introduction to soundcorrs is also available in Stachowski K. 2020. Tools for Semi-Automatic Analysis of Sound Correspondences: The soundcorrs Package for R. Glottometrics 49. 66–86. If you use soundcorrs in your research, please cite this paper.

soundcorrs functions operate on pairs/triples/… of words which come from different languages. The discussion below will use L1 to refer to the first language in the dataset, L2 to the second, etc.

Naturally, all the examples in this vignette assume that soundcorrs is installed and loaded:

install.packages ("soundcorrs")
library (soundcorrs)
#> To start the graphic interface, run "soundcorrsGUI()"
#> (requires packages "shiny" and "shinyjqui").

If you found a bug, have a remark about soundcorrs, or wishes for its future releases, please write to kamil.stachowski@gmail.com. If you use soundcorrs in your research, please cite it as Stachowski K. 2020. Tools for Semi-Automatic Analysis of Sound Correspondences: The soundcorrs Package for R. Glottometrics 49. 66–86

2 Data preparation

soundcorrs uses three kinds of data: transcription, linguistic datasets, and sound changes. The first two are stored in .tsv file, i.e. as tab-separated tables in text files that can be edited with a spreadsheet editor such as LibreOffice Calc or Google Sheets, or with a programming editor. Sound changes are functions, so they are stored as R code in text files.

Several sample datasets are available. In the GUI, they are loaded automatically when the GUI is started. From the CLI, they can be loaded using the loadSampleDataset() function, and the raw files can be accessed via the system.file() function; both are explained below in more detail.

Under BSD, Linux, and macOS, the recommended encoding is UTF-8. Unfortunately, it has been found to cause problems under Windows, so Windows users are advised to not use characters outside of the ASCII standard. Some issues can be fixed by converting from UTF-8 to UTF-8 (sic!) with iconv(), but others resist this and other treatments. Future versions of soundcorrs hope to include a solution for this problem.

2.0.1 Transcription

Transcription is not strictly necessary for the very functioning of soundcorrs, but without it linguistic regular expresssions (“wildcards”) could not be defined, and involvement of phonetics in the analysis would be made more difficult. Transcription is stored in .tsv files with two or three columns:

  • GRAPHEME, which contains the graphemes. Characters used by R as metacharacters in regular expressions, i.e. . + * ^  $ ? | ( ) [ ] { }, are not allowed. Multigraphs are also advised against as they could potentially lead to unexpected and incorrect results, especially in the case metacharacters (“wildcards”).

  • VALUE, which contains a comma-separated list of features of the given grapheme (without spaces). These are intended to be phonetic but do not necessarily have to be so. If the column META is missing, it is generated based on the column VALUE.

  • META, which contains a regular expression covering all the graphemes the given grapheme is meant to represent. In regular graphemes, this is simply the grapheme itself. In a metacharacter (a “wildcard”), such as ‹C› for ‘any consonant’, this needs to be a listing of all consonantal graphemes in the transcription file, formatted as a regular expression. If this column is missing, soundcorrs will generate it automatically based on the VALUE column. Beware, however, that in this process any grapheme whose value is a subset of the value of another grapheme, will become a metacharacter for that other grapheme.

A small transcription file may look as follows:

GRAPHEME VALUE META
b const,stop,lab,vd b
p const,stop,lab,vl,mark1 b
f const,fric,lab,vl,mark1 b
B const,stop,lab [pb]
P mark1 [pf]
- NULL -

In the last row, is what is called in this vignette a linguistic zero. This character does not represent any sound. It doesn’t merely not represent any sound, though, it actively represents no sound. Its use will be made clear in subsection Linguistic datasets below. Every transcription must include a linguistic zero.

The combination of columns VALUE and META is what allows the use of traditional linguistic regular expressions. soundcorrs supports two types of such expressions: the traditional notation, and the more modern binary notation. The former are typically single characters, often upper case, that denote an entire class of sounds, such as ‹C› for ‘any consonant’. They can be defined either by listing all the characters they stand for in the META column (as a regular expression) or, since that column is optional, by giving them a VALUE that is a subset of the values of other graphemes. ‹B› and ‹P› in the sample transcription above are defined using both methods simultaneously. The binary notation, such as ‹[+cons,-stop]› for ‘any non-stop consonant’ does not need to be defined. It is just required that it is enclosed in square brackets, the features are comma-separated, without spaces, each feature is prepended with a ‹+› or ‹-›, and the features are the same as are used in the transcription.

In the GUI, on the ‘Data’ page, click the ‘View’ button below the ‘Select the transcription…’ field on the right side of the top row, to view one of the sample transcriptions included in soundcorrs.

2.0.2 Linguistic datasets

Like the transcription, linguistic datasets are also stored in .tsv files. Two formats are theoretically possible: the “long format” in which every word is given its own row, and the “wide format” in which one row holds a pair/triple/… of words. The former is certainly more convenient for manual alignment (below), but it is the latter that is used internally and required by soundcorrs. Functions long2wide() and wide2long() can be used to convert between the two formats.

With the notable exception of sound changes, words need to be segmented for most tasks in soundcorrs, and all words in a pair/triple/… must have the same number of segments. The default segment separator is "|". If the words are not segmented, the function addSeparators() can be used to facilitate the process of manual segmentation and alignment. Tools for automatic alignment also exist (e.g. alineR, LingPy, PyAline), but it is recommended that their results be thoroughly checked by a human, if for no other reason than to allow the researcher to acquaint themselves with the material and its specificity. Alignment is what necessitates the inclusion of linguistic zero in a transcription; the examples below should make this clear. Apart from the segmented and aligned form, each word must be assigned a language.

Hence, the two obligatory columns in the “long format” are:

  • ALIGNED which holds the segmented and aligned word, and

  • LANGUAGE which holds the name of the language.

In the “wide format”, similarly, a minimum of two columns is necessary, each holding words from a different language. The information about which column holds which language can then be encoded simply as column names (e.g. ‘LATIN’), or in the form of a suffix attached to the names (e.g. ‘ALIGNED.Latin’).

A sample dataset in the “long format” might look as follows:

LANGUAGE WORD ALIGNED
Latin mūsica m|-|ū|s|i|k|a
English music m|j|ū|z|i|k|-
German Musik m|-|u|z|ī|k|-
Polish muzyka m|-|u|z|y|k|a
Latin prōvincia p|r|ō|v|i|n|s|i|a
English province p|r|ɒ|v|i|n|s|-|-
German Provinz p|r|o|v|i|n|c|-|-
Polish prowincja p|r|o|v|i|n|c|j|a

which might corresponds to a dataset in the “wide format” which looks as follows:

WORD.LAT ALIGNED.LAT WORD.ENG ALIGNED.ENG WORD.GER ALIGNED.GER WORD.POL ALIGNED.POL
mūsica m|-|ū|s|i|k|a music m|j|ū|z|i|k|- Musik m|-|u|z|ī|k|- muzyka m|-|u|z|y|k|a
prōvincia p|r|ō|v|i|n|s|i|a province p|r|ɒ|v|i|n|s|-|- Provinz p|r|o|v|i|n|c|-|- prowincja p|r|o|v|i|n|c|j|a

In these examples, the segmentation is on a phoneme-to-phoneme basis but any other criterion could be used, depending on the needs of the analysis. It is perfectly acceptable to have more than one character in a segment, and it is likewise acceptable to have in one file multiple segmented columns, one designated as the ALIGNED column in one soundcorrs object, and another in the second. It is also possible, though such practice is not encouraged, to store the data from different languages in separate files, and encoded using different transcriptions. As of the current version, however, this is only achievable through the CLI.

It may sometimes be the case that one of the words in a pair/triple/… is missing but it is still desirable to include such an incomplete case in the dataset. The missing word should be written as NA, which is R’s symbol for ‘missing value’.

In the GUI, on the ‘Data’ page, click the ‘View’ button below the ‘Loaded datasets’ field on the right side of the middle row, to view one of the sample datasets included in soundcorrs.

2.0.3 Sound changes

Unlike transcriptions and linguistic datasets, sound changes in soundcorrs are functions, bits of code. (The distinction between data and code is in fact less than sharp in R, but this is a separate topic.) This gives the user much greater control, while remaining convenient because simple sound changes can be translated automatically.

A sound change function must take two arguments: x and meta, where x is a character string to which the change is to be applied, and meta is a piece of metadata that might need to be taken into account. The return value of a sound change function must be a vector of character strings, possibly of length 1, i.e. a single character string. The ability to output multiple strings, however, is important because it allows for the implementation of regressive changes (e.g. to account for both possible sources of a vowel that merged with another vowel).

Such functions may look as follows:

# change all a’s to e’s
function (x,meta)
    gsub ("a", "e", x)

# change a’s followed by j, to e’s – version 1
function (x, meta)
    gsub ("aj", "ej", x)

# change a’s followed by j, to e’s – version 2
function (x, meta)
    gsub ("a(?=j)", "e", x, perl=T)

# change a’s and ä’s to e’s
function (x, meta)
    gsub ("[aä]", "e", x)

# change a’s to e’s only in the northern dialect
function (x, meta)
    if (meta!="northern") x else gsub("a","e",x)

# change a in the last syllable to e
function (x, meta) {

    # find all the vowels
    #  this assumes that 'trans' is a transcription object,
    #    in which "V" is defined as a wild-card for 'all vowels'
    #  alternatively, a square bracket notation could be used here,
    #    similarly to the example above
    syllables <- gregexpr (expandMeta(trans,"V"), x)

    # find how many syllables x has
    last <- length (syllables[[1]])

    # replace a in the last one
    if (regmatches(x,syllables)[[1]][last] == "a")
        regmatches(x,syllables)[[1]][last] <- "e"

    # return the changed string
    return (x)

}

Simpler changes, ones that can be written in the form of a single regular expression like the first four examples above, need not be explicitly defined as functions. A sound change function is generally not expected to be used directly by the user, it is intended to be wrapped in a soundchange object using the soundchange() constructor function, and this function accepts both functions and character strings; the latter are then converted into functions automatically. Such a string must contain exactly one ">" or "<", possibly surrounded by spaces.

Warning! When the META column is missing from the transcription and generated automatically, the alternatives are listed using round- rather than square-bracket notation, i.e. as (x|y) rather than [xy]. This is necessary because some graphemes may be longer than one character, but a side effect of this is that when a user-defined metacharacter (“wild-card”) is used in the ‘find’ part of the sound change string (the part before "<"), it adds a set of round brackets to that part of the string. In turn, this means that backreferences in the ‘replace’ part of the sound change string must either be shifted accordingly, or round brackets around metacharacters in the ‘find’ part omitted. For example, if intervocalic s is to be replaced with r, the sound change string should be "VsV > \\1r\\2" rather than *"(V)s(V) > \\1r\\2" because "V" itself is already translated to "(a|ä|e|…".

The one instance when a sound change function is intended to be run directly by the user, is during testing. It is highly recommended that each function be tested on a variety of examples, with the intent of finding the one case which makes it fail to produce the desired result.

Apart from the function itself, a soundchange object holds the name and a brief description of the change, as well as the transcription that was used to encode it. The transcription is necessary to translate user’s custom regular expression (as "V" in the last example above, see subsection Transcription above for more details) into expressions that R can understand. A more detailed explanation of the soundchange() constructor can be found in subsection Sound changes below.

In the GUI, on the ‘Data’ page, click the ‘View’ button below the ‘Loaded changes’ field on the right side of the bottom row, to view one of the sample changes included in soundcorrs.

3 Loading data

Once the data are ready, they can be loaded either with the help of the CLI or the GUI. The former is more customizable and offers more possibilities, but the latter is less cumbersome.

3.0.1 Transcription

In the GUI, transcriptions can be loaded simply by clicking on the ‘Browse’ button on the left side of the top row of the ‘Data’ page. Accepted are files are in the .tsv format; they need to contain columns GRAPHEME and VALUE, and might additionally contain the column META, as discussed in subsection Transcription above.

In the CLI, transcriptions are loaded using the read.transcription() function. It is a simple wrapper around the constructor function for the transcription class, which reads a data table and passes it to the constructor. It is, however, recommended to use read.transcription(), rather than directly the constructor, because read.transcription() adds to the object the attribute file with the path to the original file. This path can be later used to make sure that the linguistic dataset or a sound change uses the correct transcription.

soundcorrs contains two sample transcription files: 1. a fragment of the traditional, ‘continental’ transcription: ‘trans-common.tsv’ and 2. a fragment of the IPA: ‘trans-ipa.tsv’. Both only cover the basics and are intended more as an illustration than anything else. In the GUI, they are loaded automatically, and can be previewed using the ‘View’ button. In the CLI, the paths to the raw files can be established through system.file("extdata", "nameOfTheFile", package="soundcorrs"), and the entire transcription objects can be loaded using loadSampleDataset() with either "trans-common" or "trans-ipa" as argument.

# establish the paths of the samples included in soundcorrs
path.trans.com <- system.file ("extdata", "trans-common.tsv", package="soundcorrs")
path.trans.ipa <- system.file ("extdata", "trans-ipa.tsv", package="soundcorrs")

# and load them
trans.com <- read.transcription (path.trans.com)
trans.ipa <- read.transcription (path.trans.ipa)
#> Warning in transcription(data, col.grapheme, col.meta, col.value): Missing the
#> metacharacters column. The "META" column was generated.

# transcription needs to be an object of class ‘transcription’
class (trans.com)
#> [1] "transcription"

# a basic summary
trans.com
#> A "transcription" object.
#>   File: /tmp/RtmpzbH71G/Rinst65ed1bddbb28/soundcorrs/extdata/trans-common.tsv.
#>   Graphemes: 75.

# ‘data’ is the original data frame
# ‘cols’ is a guide to column names in ‘data’
# ‘meta’ is a vector of characters which act as metacharacters
# ‘values’ is a named list of the values of graphemes, exploded into vectors
# ‘zero’ are the characters denoting the linguistic zero
str (trans.com, max.level=1)
#> List of 5
#>  $ data  :'data.frame':  75 obs. of  3 variables:
#>  $ cols  :List of 3
#>  $ meta  : Named chr [1:7] "(ᴍ|m|p|ʙ|b|φ|β|f|v|n|ɴ|t|ᴅ|d|c|ʒ|s|ᴢ|z|θ|δ|l|ʟ|č|ǯ|š|ž|ś|ź|ć|r|ʀ|ŋ|k|ɢ|g|χ|γ|h|ɦ)" "(ᴍ|m|n|ɴ|ŋ)" "(p|ʙ|b|t|ᴅ|d|k|ɢ|g)" "(s|ᴢ|z|š|ž|ś|ź)" ...
#>   ..- attr(*, "names")= chr [1:7] "C" "N" "P" "S" ...
#>  $ values:List of 75
#>  $ zero  : chr "-"
#>  - attr(*, "class")= chr "transcription"
#>  - attr(*, "file")= chr "/tmp/RtmpzbH71G/Rinst65ed1bddbb28/soundcorrs/extdata/trans-common.tsv"

3.0.2 Linguistic datasets

Loading datasets is a little bit complex, an unfortunate side effect of flexibility. This is considerably more pronounced in the CLI than in the GUI.

In the GUI, a file containing the dataset is selected through the ‘Browse’ button on the left of the middle row of the ‘Data’ page. Once a file containing a valid dataset in the “wide format” is selected (cf. subsection Linguistic datasets above), a series of checkboxes appears in the centre of the middle row. It contains the names of columns in the file, and the user is asked to select those columns which contain the aligned and segmented pairs/triples/… of words. As was mentioned above, alignment and segmentation are two time-consuming steps that can be skipped if soundcorrs is only used for sound changes and nothing else. When ready, click the ‘Load’ button below. Note that the data from all the languages will be assigned the transcription that was currently selected on the right side of the top row.

In the CLI, datasets are loaded using the read.soundcorrs() function which is a wrapper around the constructor function for the soundcorrs class. Whenever possible, it is recommended to use the wrapper rather than the constructor function directly. The arguments for the wrapper are: path to the file (in the “wide format”), name of the language, name of the column with segmented, aligned data, path to the transcription file, and optionally, the character used as the segment separator. This amounts to a rather lengthy call, and would amount to an extremely long one if multiple languages were to be loaded simultaneously from different files and with different transcriptions, which is why both read.soundcorrs() and the soundcorrs() constructor function only load data from a single language at a time. The resulting single-language soundcorrs object can then be merged using the merge() function. For details about the two functions, see their respective documentations by running ?read.soundcorrs and ?soundcorrs.

soundcorrs has three sample datasets: 1. the entirely made-up ‘data-abc.tsv’; 2. ‘data-capitals.tsv’ which contains the names of EU capitals in German, Polish and Spanish – from the linguistic point of view, this of course makes no sense; it is merely an example that will hopefully not be seen as too exotic regardless of which language or languages the user specializes in (my gratitude is due to José Andrés Alonso de la Fuente, PhD (Cracow, Poland) for help with Spanish data); and 3. ‘data-ie.tsv’ with a dozen examples of Grimm’s and Verner’s laws (adapted from Campbell L. 2013. Historical Linguistics. An Introduction. Edinburgh University Press. Pp. 136f). The ‘abc’ dataset is in the “long format” and requires conversion before loading; the ‘capitals’ and ‘ie’ datasets are in the “wide format”. In the GUI, they are converted and loaded automatically, and can be previewed using the ‘View’ button. In the CLI, the paths of the data files can be established with system.file("extdata", "nameOfTheFile", package="soundcorrs"), or the entire soundcorrs objects can be loaded using loadSampleDataset() with "data-abc", "data-capitals", or "data-ie" as argument.

# establish the paths of the two datasets
path.abc <- system.file ("extdata", "data-abc.tsv", package="soundcorrs")
path.cap <- system.file ("extdata", "data-capitals.tsv", package="soundcorrs")
path.ie <- system.file ("extdata", "data-ie.tsv", package="soundcorrs")

# read “capitals”
d.cap.ger <- read.soundcorrs (path.cap, "German", "ALIGNED.German", path.trans.com)
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered by
#> the transcription: ŋk, jus.
d.cap.pol <- read.soundcorrs (path.cap, "Polish", "ALIGNED.Polish", path.trans.com)
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered by
#> the transcription: ń, ẃ.
d.cap.spa <- read.soundcorrs (path.cap, "Spanish", "ALIGNED.Spanish", path.trans.com)
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered by
#> the transcription: ð, ŋk, ja.
d.cap <- merge (d.cap.ger, d.cap.pol, d.cap.spa)

# read “ie”
d.ie.lat <- read.soundcorrs (path.ie, "Lat", "LATIN", path.trans.com)
d.ie.eng <- read.soundcorrs (path.ie, "Eng", "ENGLISH", path.trans.ipa)
#> Warning in transcription(data, col.grapheme, col.meta, col.value): Missing the
#> metacharacters column. The "META" column was generated.
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered by
#> the transcription: eɪ, ɪ, aʊ, uː, ɑː, ʊ, iː.
d.ie <- merge (d.ie.lat, d.ie.eng)

# read “abc”
tmp <- long2wide (read.table(path.abc,header=T), skip=c("ID"))
d.abc.l1 <- soundcorrs (tmp, "L1", "ALIGNED.L1", trans.com)
d.abc.l2 <- soundcorrs (tmp, "L2", "ALIGNED.L2", trans.com)
d.abc <- merge (d.abc.l1, d.abc.l2)

# some basic summary
d.abc.l1
#> A "soundcorrs" object.
#>   Languages (1): L1.
#>   Entries: 6.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.
d.abc
#> A "soundcorrs" object.
#>   Languages (2): L1, L2.
#>   Entries: 6.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.

# ‘cols’ are the names of important columns
# ‘data’ is the original data frame
# ‘names’ are the names of the languages
# ‘segms’ are words exploded into segments; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
# ‘segpos’ is a lookup list to check which character belongs to which segment; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
# ‘separators’ are the strings used as segment separators
# ‘trans’ are ‘transcription’ objects
# ‘words’ are words obtained by removing separators from the ‘col.aligned’ column; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
str (d.abc, max.level=1)
#> List of 8
#>  $ cols      :List of 2
#>  $ data      :'data.frame':  6 obs. of  7 variables:
#>  $ names     : chr [1:2] "L1" "L2"
#>  $ segms     :List of 2
#>  $ segpos    :List of 2
#>  $ separators: chr [1:2] "\\|" "\\|"
#>  $ trans     :List of 2
#>  $ words     :List of 2
#>  - attr(*, "class")= chr "soundcorrs"

3.0.3 Sound changes

In the GUI, a file containing one or more sound changes is selected through the ‘Browse’ button on the left side of the bottom row of the ‘Data’ page. Accepted are .R files, i.e. files with R code in them, in which one or more soundchange objects are created and assigned to variables. This means that, at least as of the current version, there is no escape from the CLI. Definitions of sound changes contain references to transcription; when loading through the GUI, a special variable ..selectedTranscription.. (the dots are part of the name of the variable) is made available; as its name suggests, it points the currectly selected transcription. The same ‘Browse’ button can be used to load a modified set of sound changes saved from the ‘Sound changes’ page (see subsection Practice in section Sound changes).

In the CLI, sound changes are loaded using the constructor function for the soundchange class. Its main argument can be either a function or a character string which will be converted to a function. As was mentioned in subsection Sound changes above, a sound change function must take two arguments: x, a string to apply the change to; and meta, additional metadata. The return value must be a character string. If instead of a function a string is given, it must be in the format “x > y” (or “y < x”; spaces optional). Both parts of the string may contain regular expressions, both R’s standard ones, and user’s custom ones (see subsection Transcription above). Lookarounds and other features are disabled by default; they can be turned on by setting the perl argument to TRUE, but be aware that this also brings about other changes in behaviour; see this post for a detailed comparison. Also, make sure to see the caveat about round brackets and custom metacharacters in the subsection Sound changes above.

Other arguments taken by soundchange() are: name, transcription, and optionally description and perl. See the documentation (run ?soundchange) for a more detailed description.

soundcorrs contains three sample changes files: 1. a regressive implementation of the *dl > *l simplification in Slavic: ‘change-dl2l.R’; 2. a progressive implementation of the first palatalization in Slavic: ‘change-palatalization.R’; and 3. a progressive implementation of rhotacism in Latin: ‘change-rhotacism.R’. In the GUI, they are loaded automatically, and can be previewed using the ‘View’ button. In the CLI, the paths of the data files can be established with system.file("extdata", "nameOfTheFile", package="soundcorrs") as was done with transcription and soundcorrs object above, or the entire soundchange objects can be loaded using loadSampleDataset() with "change-dl2l", "change-palatalization", or "change-rhotacism" as argument. Note that these sample changes are stored as pieces of R code, so they must be loaded using source() rather than read.table() or similar.

# a simple sound change
sc.V2a <- soundchange ("V > a", "V>a", trans.com, "All vowels change into a.")

# basic summary
sc.V2a
#> A "soundchange" object.
#>   Name: V>a.
#>   Description: All vowels change into a.
#>   Transcription: /tmp/RtmpzbH71G/Rinst65ed1bddbb28/soundcorrs/extdata/trans-common.tsv.

# ‘name’ is the name of the sound change
# ‘desc’ is a brief description
# ‘fun’ is the sound change function
# ‘trans’ is the transcription used in the change function
str (sc.V2a, max.level=1)
#> List of 4
#>  $ name : chr "V>a"
#>  $ desc : chr "All vowels change into a."
#>  $ fun  :function (x, meta)  
#>  $ trans:List of 5
#>   ..- attr(*, "class")= chr "transcription"
#>   ..- attr(*, "file")= chr "/tmp/RtmpzbH71G/Rinst65ed1bddbb28/soundcorrs/extdata/trans-common.tsv"
#>  - attr(*, "class")= chr "soundchange"

# if need be, functions inside ‘soundchange’ objects can be applied directly
sc.V2a$fun ("ouroboros", NULL)
#> [1] "aarabaras"

# a slightly more complex change
sc.VV2a <- soundchange ("V{2,} > a", "VV>a", trans.com, "Only diphthongs change into a.")
sc.VV2a$fun ("ouroboros", NULL)
#> [1] "aroboros"

# a slightly more complex change
sc.CV2Ca <- soundchange ("CV > \\1a", "CV>Ca", trans.com, "Only postconsonantal vowels change into a.")
sc.CV2Ca$fun ("ouroboros", NULL)
#> [1] "ourabaras"

# a more complex sound change
sc.2ndV2a.fun <- function (x, meta) {
    tmp <- gregexpr (expandMeta(trans.com,"V+"), x)
    regmatches(x,tmp)[[1]][2] <- "a"
    return (x)
}
sc.2ndV2a <- soundchange (sc.2ndV2a.fun, "2ndV>a", trans.com,
    "Only the vowel in the second syllable changes into a.")
sc.2ndV2a$fun ("ouroboros", NULL)
#> [1] "ouraboros"

4 Extraction of examples

One of the uses for segmentation and alignment (see subsection Linguistic datasets in section Data preparation), is the extraction of examples. Using regular search tools, it is usually easy to find examples which contain a specific sound or sequence of sounds, e.g. all words in L1 that contain a, and all words in L2 that contain e, but it is virtually impossible to only find such pairs where L1 a is in the same position inside the word as L2 e, i.e. where L1 a corresponds to L2 e. This is in fact the ‘founding problem’ of soundcorrs, the one it was originally written to solve.

4.1 Theory

The main arguments are naturally the queries. Technically, as many queries need to be given as there are languages in the dataset. However, each query can be an empty string in which case everything will be considered a match; if all queries are empty strings, soundcorrs will simply return the entire dataset. Queries can be regular expressions in any of the four available flavours: R standard (the TRE engine), R Perl-like (the PCRE engine, needs to be explicitly turned on), user’s custom defined through the transcription, and user’s custom in binary notation. Regarding the first two, see this post for further details; regarding the latter two, see subsection Transcription. The first two differ slightly in behaviour, so only one can be chosen at a time. It can, however, be freely mixed with the latter two both between queries and inside a single query.

The second important set of parameters are distances. As was said above, the crux of extracting examples with soundcorrs is that it looks for sequences of sounds which occupy the same segment in all the words in a pair/triple/…. Sometimes, however, it may be desirable to also look for examples which occupy almost the same segment. For example, English words with the -er suffix, such as computer or farmer, have been borrowed to multiple languages in which the r is fully pronounced unlike in the English original. The correspondence is therefore Engl. ə : non-Engl. er, and in segment terms, Engl. ə|- : non-Engl. e|r. Such examples would not be found by simply searching for Engl. ə which corresponds to non-Engl. er because the former only occupies one segment while the latter spans two. The distance parameters allow soundcorrs to include examples in which the matches have been found in places that are n segments apart. In the case of -er, the begin distance can be 0 but the end distance will need to be 1 or greater. Setting either distance to -1 means that this distance is not taken into account at all. If both are set to -1, the result is equivalent to searching for one sequence in one column, and entirely independently, for another sequence in another column. It may therefore seem irresponsible to set the default values of both arguments to -1, but in my experience, it very rarely produces false positives. On the other hand, the opposite behaviour (both arguments set to 0) may easily result in false negatives, which are errors that are not only of a much less intuitive kind, but also never give the user a chance to spot the problem as they are simply not displayed.

Sometimes, one of the words in a pair/triple/… will be missing, which in R is represented by the symbol NA. The user is free to choose whether such cases should be considered a match or not.

As was illustrated in subsection Linguistic datasets in section Data preparation), linguistic zeros are indispensable for alignment. They do, however, make extraction of examples considerably more difficult and prone to errors for the user, which is why by default they are ignored during the search. Should need to include them arise, however, the user is free to do so.

Lastly, the search is performed on segmented and aligned columns and by default, it is those columns that are displayed as the result. But since they are typically not very convenient to read for a human, the user may choose to display other columns instead.

4.2 Practice

In the GUI, the ‘Examples’ page is divided into three rows. The top one is used for selecting the dataset, and the bottom one contains a cheat sheet with the most common regular expressions. The middle row is further divided into two parts: on the left, on grey background, are all the parameters discussed in subsection Theory above; on the right, on white background, is where the results are displayed after the ‘Apply’ button at the top of the grey field, is clicked.

In the CLI, three functions are available: findExamples() is the primary function for extraction of examples; findPairs() is a convenience wrapper around findExamples() for when there are only two languages in the dataset; and allPairs() produces an almost print-ready summary of the dataset, complete with tables and all the examples. Details about the usage of those functions can be found in their documentation (?findExamples, etc.), here only the general outlines, and several caveats will be given.

In findExamples(), all the parameters beyond the queries must be named in a call to the function because there is no way for R to know a priori how many languages there are in the dataset, and therefore how many queries there will be. The return value of findExamples() is a list with two fields: ‘data’ which is a data frame with matching examples, and ‘which’, a logical vector showing which examples in the original dataset were a match. The class of the return value is ‘df.findExamples’; this is purely for technical reasons, to allow for a more legible printed output.

# “ab” spans segments 1–2, while “a” only occupies segment 1
findExamples (d.abc, "ab", "a", distance.end=0)
#> No matches found.
findExamples (d.abc, "ab", "a", distance.end=1)
#>   ALIGNED.L1 ALIGNED.L2
#> 1      a|b|c      a|b|c
#> 2    a|b|a|c    a|b|a|c
#> 5    a|b|c|-    a|b|c|ə
#> 6  a|b|a|c|-  a|b|a|c|ə

# linguistic zeros cannot be found if ‘zeros’ is set to ‘FALSE’
findExamples (d.abc, "-", "", zeros=T)
#>   ALIGNED.L1 ALIGNED.L2
#> 5    a|b|c|-    a|b|c|ə
#> 6  a|b|a|c|-  a|b|a|c|ə
findExamples (d.abc, "-", "", zeros=F)
#> No matches found.

# both the usual and custom regular expressions are permissible
findExamples (d.abc, "a", "[ou]")
#>   ALIGNED.L1 ALIGNED.L2
#> 3      a|b|c      o|b|c
#> 4    a|b|a|c    u|w|u|c
findExamples (d.abc, "a", "O")
#>   ALIGNED.L1 ALIGNED.L2
#> 3      a|b|c      o|b|c
#> 4    a|b|a|c    u|w|u|c

# the output is actuall a list
str (findExamples(d.abc,"a","a"), max.level=1)
#> List of 2
#>  $ data :'data.frame':   4 obs. of  2 variables:
#>  $ which: logi [1:6] TRUE TRUE FALSE FALSE TRUE TRUE
#>  - attr(*, "class")= chr "df.findExamples"

# ‘data’ is what is displayed on the screen
# ‘which’ is useful for subsetting
subset (d.abc, findExamples(d.abc,"a","O")$which)
#> A "soundcorrs" object.
#>   Languages (2): L1, L2.
#>   Entries: 2.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.

# ‘which’ can also be used to find examples
#    that exhibit more than one correspondence.
aaa <- findExamples (d.cap, "a", "a", "a", distance.start=0, distance.end=0)$which
bbb <- findExamples (d.cap, "b", "b", "b", distance.start=0, distance.end=0)$which
d.cap$data [aaa & bbb,]
#>        ALIGNED.German ORTHOGRAPHY.German      ALIGNED.Polish ORTHOGRAPHY.Polish
#> 4 b|r|a|t|ī|s|l|a|v|a         Bratislava b|r|a|t|y|s|w|a|v|a         Bratysława
#> 6     b|ū|d|a|p|ä|s|t           Budapest     b|u|d|a|p|e|š|t          Budapeszt
#>       ALIGNED.Spanish ORTHOGRAPHY.Spanish OFFICIAL.LANGUAGE
#> 4 b|r|a|t|i|z|l|a|β|a          Bratislava            Slovak
#> 6     b|u|ð|a|p|e|s|t            Budapest         Hungarian

# the ‘cols’ argument can be used to alter the printed output
findExamples (d.abc, "a", "O", cols=c("ORTHOGRAPHY.L1","ORTHOGRAPHY.L2"))
#>   ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> 3            abc           aobc
#> 4           abac           uwuc

findPairs() is a convenience wrapper around findExamples() which only takes four arguments: the dataset, the two queries, and mode. Mode can be either ‘exact’ (1 or TRUE) which corresponds to the most strict settings in findExamples(), ‘inexact’ (0 or FALSE) which corresponds to the most lenient settings, or a middle ground (0.5) which corresponds to perhaps the most useful settings. The output of findPairs() is the same as the output of findExamples().

‘allPairs()’ does not have great analytic value in itself, but it can be useful to prepare the material part of a paper. Its output consists of sections devoted to each segment, which contain a general contingency table of the segment’s various renderings, as well as subsections which list all the pairs that exhibit the given correspondence. soundcorrs provides functions to format such output in HTML, in LaTeX, or not at all. Custom formatters are also not very difficult to write. Regarding the tables, see subsection Practice in section Tables.

# and see what result this gives
allPairs (d.abc, cols=c("ORTHOGRAPHY.L1","ORTHOGRAPHY.L2"))
#> section  [1] "-"
#> table    ə 
#> table    2 
#> subsection   [1] "-" "ə"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca
#> section  [1] "a"
#> table    a o u 
#> table    4 1 1 
#> subsection   [1] "a" "a"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   1            abc            abc
#> data.frame   2           abac           abac
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca
#> subsection   [1] "a" "o"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   3            abc           aobc
#> subsection   [1] "a" "u"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   4           abac           uwuc
#> section  [1] "b"
#> table    b w 
#> table    5 1 
#> subsection   [1] "b" "b"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   1            abc            abc
#> data.frame   2           abac           abac
#> data.frame   3            abc           aobc
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca
#> subsection   [1] "b" "w"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   4           abac           uwuc
#> section  [1] "c"
#> table    c 
#> table    6 
#> subsection   [1] "c" "c"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   1            abc            abc
#> data.frame   2           abac           abac
#> data.frame   3            abc           aobc
#> data.frame   4           abac           uwuc
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca

# a clearer result could be obtained by running
# allPairs (d.cap, cols=c("ORTHOGRAPHY.German","ORTHOGRAPHY.Polish"),
#    file="~/Desktop/d.cap.html", formatter=formatter.html)

As was mentioned, the “capitals” dataset is linguistically absurd, and so it should not matter that all the Polish names of European capitals are listed as borrowed from German. If however, one wished to fix this problem, and do it not by copying the output to a word processor and replacing ">" with ":" there, but rather inside soundcorrs, this wish can be fulfilled easily enough. First, the existing formatter.html() function needs to be written to a file to serve as a base for the new formatter: dput(formatter.html, "~/Desktop/myFormatter.R"). Then, the beginning of the first line of this file needs to be changed to something like myFormatter <- function…, and finally, the ">" and "<" signs (written in HTML as &amp;gt; and &amp;lt;, respectively) need to be replaced with a colon. All that is then left is to load the new function to R and use it to format the output of allPairs():

# load the new formatter function …
# source ("~/Desktop/myFormatter.R")

# … and use it instead of formatter.html()
# allPairs (d.cap, cols=c("ORTHOGRAPHY.German","ORTHOGRAPHY.Polish"),
#    file="~/Desktop/d.cap.html", formatter=myFormatter)
# note that this time the output will not open in the web browser automatically

5 Sound changes

Manual application of a series of sound changes to a series of words is not only a time-consuming endavour, it is also prone to errors due to the monotony of the task. soundcorrs provides a way to automate it, and what is important, it allows for changes to be implemented both progressively and regressively.

5.1 Theory

Like most tasks in soundcorrs, the application of sound changes requires the dataset to be a soundcorrs object. Unlike most other tasks, however, sound changes do not require the data to be segmented and aligned. In fact, it is generally recommended that they are not since otherwise sound change functions would be required to deal with segment separators and linguistic zeros. Should this be desirable, e.g. to only apply changes to certain segments, maybe affixes but not roots, this can be implemented in a fashion similar to the syllables example in subsection Sound changes in section Loading data.

The obvious arguments then are the dataset, and a list of changes. Needless to say, the order of the changes in the list is important, as it is in this order that they will be applied.

As the other arguments, the user can indicate between one and three columns in the dataset. The ‘source’ column is the one from which soundcorrs will take words to apply the sound changes to. Optionally, a ‘target’ column can be selected, with forms that will be compared to the forms resulting from the application of changes. Lastly, the optional ‘metadata’ column contains data that will be passed to sound change functions under the meta argument; see Sound changes in section Data preparation.

The last parameter is more technical, it allows the user to enable or disable the highlighting of differences between the post-change forms and the target forms, as well as between the subsequent intermediate forms. It has no impact on the forms themselves, but with large datasets and multiple sound changes, the difference in performance may become noticable.

The output is composed of three parts. One part is a list of the final forms, the second part is a list of trees containing all the intermediate forms, and the third part is a list with the information whether any of the final forms matches the target.

5.2 Practice

In the GUI, the dataset as well as all the columns in it are selected from drop-down menus in the top row. In the bottom row, the highlighted field on the left contains a list of changes and the ‘Apply’ button. Changes can be reordered by dragging them around, and they can also be activated and deactivated. Deactivation of a change does not cause it to be unloaded, it remains in the list and can be activated back again without having to reload it in the ‘Data’ page. To the right of the field highlighted in grey, is the area where the results are displayed. The table can be conveniently searched using the field in the top right corner, and it can also be filtered using the fields at the bottom of each column. ‘View’ buttons in the rightmost columns display the intermediate forms. The differences between the final forms and target forms, as well as those between subsequent intermediate forms, are all highlighted in red, regardless of whether it is a missing character, an additional character, or a different character. Missing characters are indicated by underscores ‹_›. Modifications made in the GUI to the set of sound changes (reorderings and deactivations) can be saved using the ‘Save’ button below the changes. The saved .rda file can be loaded in the same way as R files (see subsection Sound changes in section Loading data).

In the CLI, only one function is available, applyChanges(). Its arguments are: dataset, list of changes, name of columns: source, target, and metadata, and diff. The last three are optional, and are by default NULL. The last argument, highlight, can be set to "console" or "HTML" to highlight the differences between the post-change and target forms either for output to the console or for a web browser.

The return value of applyChanges() is a list, technically of class list.applyChanges, which contains three elements: $end with the final forms (this is the only element that is printed by default); $tree which saves the path by which the final forms in $end have been arrived at; and $match which is a named list of the results of comparison to the target forms. The values in $match can be: 0 if none of the results matches the target, 0.5 if at least one but not all of the results match the target, and 1 if all the results match the target.

# prepare a list of changes, in the order of application
#    for the definitions, see subsection Sound changes in section Loading data
sc.list <- list (sc.VV2a, sc.2ndV2a, sc.CV2Ca)

# prepare the data and the expected results
#    (warnings can be safely ignored in this case)
tmp.l1 <- soundcorrs(data.frame(SOURCE=c("ouroboros","jormungandr")), "L1", "SOURCE", trans.com)
tmp.l2 <- soundcorrs(data.frame(TARGET=c("arabaras","jarmangandr")), "L2", "TARGET", trans.com)
dataset <- merge (tmp.l1, tmp.l2)

# and apply the changes to our data
res <- applyChanges (dataset, sc.list, "SOURCE", "TARGET", NULL)
res
#> $`ouroboros`
#> "arabaras"
#> 
#> $`jormungandr`
#> "jormangandr"

# see if they match the expectations
res$match
#> $ouroboros
#> [1] 1
#> 
#> $jormungandr
#> [1] 0

# see which change did not work as expected
#    it was CV > Ca because our changes use the sample "common" transcription,
#    and j does not count in it as a consonant (it's a semivowel)
res$tree
#> 1 ouroboros [VV>a]
#> 2 .. a_roboros [2ndV>a]
#> 3 .. .. araboros [CV>Ca]
#> 4 .. .. .. arabaras
#> 1 jormungandr [VV>a]
#> 2 .. jormungandr [2ndV>a]
#> 3 .. .. jormangandr [CV>Ca]
#> 4 .. .. .. jormangandr

6 Contingency tables

Besides the extraction of examples (see vignette("soundcorrs-examples")), the main gain from segmentation and alignment are contingency tables. soundcorrs can produce three different kinds which may serve a variety of purposes.

6.1 Theory

The three kinds of contingency tables produced by soundcorrs are: segment-to-segment, correspondence-to-correspondence, and correspondence-to-metadata. Let us briefly introduce each in this order.

Segment-to-segment is the simplest kind. It shows how often segments from one language correspond to segments from another language. For example, in Pol. kark ‘neck’ : Cz. krk id., Pol. k corresponds twice to Cz. k, Pol. r once to Cz. r, and Pol. a once to Cz. ∅. Uses for such a table can be varied, e.g. to detect regularities, as well as irregularities, between two sets, or to determine whether establishing regularities is even possible.

A correspondence-to-correspondence table, on the other hand, shows how often a given correspondence co-occurs in the same word with another correspondence. In the example above, the Pol. k : Cz. k correspondence co-occurs in one word with three correspondences: 1. Pol. r : Cz. r, 2. Pol. a : Cz. ∅, and 3. Pol. k : Cz. k. Such a table can be useful e.g. for separating correspondences which always hold from those that depend on the phonetic surrounding or other factors.

Lastly, a correspondence-to-metadata table shows how often a given correspondence co-occurs with a piece of metadata. The ‘piece of metadata’ can be almost anything, depending on the purpose of the analysis. Typically, categorical data will used, such as the name of the consultant who provided the the given pronunciation, the year of the recording, the village where the recording was performed, the age of the consultant, the dialect he or she spoke, etc.

All three kinds of tables can be calculated in four different ways, determined by two parameters.

The first is named ‘count’ in soundcorrs; it refers to whether the results are given in absolute numbers, or in the relative. Note that the latter are not relative to the entire table, as that would be hardly informative, but rather in relation to all the correspondences of the given segment. In the case of segment-to-segment tables, this means that e.g. all the counterparts of L1 a add up to 1; in the case of correspondence tables, all the counterparts of L1 a_ add up to 1, and so do all the counterparts of L1 _a (underscore being the character used to connect correspondences, so that e.g. Pol. o : Cz. e is notated "o_e").

The second parameter is called ‘unit’. It determines whether soundcorrs should count the total number of occurrences of the given correspondence, or only the number of words in which it is attested. In the kark example above, Pol. k corresponds to Cz. k twice, but only in one word; and similarly, the Pol. k : Cz. k correspondence co-occurs twice with Pol. r : Cz. r, but only in one word.

6.2 Practice

In the GUI, the dataset and, in the case of correspondence-to-metadata tables, the metadata column are selected in the top row. The bottom row is divided into two. On the left, in the field with grey background, parameters ‘Count’ and ‘Unit’ can be set. On the right, are three tabs which correspond to the three kinds of tables soundcorrs produces. Results appear after the ‘Apply’ button is clicked. Tables can be searched using the field in the top right corner and, perhaps more conveniently, filtered using the fields below the columns.

In the CLI, three functions are available, but they do not correspond directly to the three kinds of tables. Segment-to-segment tables are produced by summary(), the two types of correspondence tables are produced by coocc(), and there is also allCooccs() which outputs multiple tables at once.

summary() only takes three arguments, the dataset, ‘count’, and ‘unit’.

# a general overview of the dataset as a whole
summary (d.abc)
#>    L2
#> L1  a b c o u w ə
#>   - 0 0 0 0 0 0 2
#>   a 4 0 0 1 1 0 0
#>   b 0 5 0 0 0 1 0
#>   c 0 0 6 0 0 0 0

# words are the default ‘unit’
summary (d.abc, unit="o")
#>    L2
#> L1  a b c o u w ə
#>   - 0 0 0 0 0 0 2
#>   a 6 0 0 1 2 0 0
#>   b 0 5 0 0 0 1 0
#>   c 0 0 6 0 0 0 0

# in relative values …
rels <- summary (d.abc, count="r")
round (rels, 2)
#>    L2
#> L1     a    b    c    o    u    w    ə
#>   - 0.00 0.00 0.00 0.00 0.00 0.00 1.00
#>   a 0.67 0.00 0.00 0.17 0.17 0.00 0.00
#>   b 0.00 0.83 0.00 0.00 0.00 0.17 0.00
#>   c 0.00 0.00 1.00 0.00 0.00 0.00 0.00

# … relative to entire rows
apply (rels, 1, sum)
#> - a b c 
#> 1 1 1 1

coocc() has the same arguments as summary(), plus one more. The additional one is column; when set to NULL, coocc() will produce a correspondence-to-correspondence table; when set to the name of a column in the dataset, coocc() will cross-tabulate correspondences with the data from that column.

# a general look in the internal mode
coocc (d.abc)
#>      L1_L2
#> L1_L2 -_ə a_a a_o a_u b_b b_w c_c
#>   -_ə   0   2   0   0   2   0   2
#>   a_a   2   2   0   0   4   0   4
#>   a_o   0   0   0   0   1   0   1
#>   a_u   0   0   0   1   0   1   1
#>   b_b   2   4   1   0   0   0   5
#>   b_w   0   0   0   1   0   0   1
#>   c_c   2   4   1   1   5   1   0

# now with metadata
coocc (d.abc, "DIALECT.L2")
#>      DIALECT.L2
#> L1_L2 north south std
#>   -_ə     0     2   0
#>   a_a     0     2   2
#>   a_o     1     0   0
#>   a_u     1     0   0
#>   b_b     1     2   2
#>   b_w     1     0   0
#>   c_c     2     2   2

# in the internal mode,
#    the relative values are with regard to segment-to-segment blocks
tab <- coocc (d.abc, count="r")
rows.a <- which (rownames(tab) %hasPrefix% "a")
cols.b <- which (colnames(tab) %hasPrefix% "b")
sum (tab [rows.a, cols.b])
#> [1] 1

# there are four different segments in L1
sum (tab)
#> [1] NaN

# if two correspondences never co-occur, the relative value is 0/0
#    which R represents as ‘NaN’, and prints as empty space
coocc (d.abc, count="r")
#>      L1_L2
#> L1_L2       -_ə       a_a       a_o       a_u       b_b       b_w       c_c
#>   -_ə           1.0000000 0.0000000 0.0000000 1.0000000 0.0000000 1.0000000
#>   a_a 1.0000000 0.6666667 0.0000000 0.0000000 0.6666667 0.0000000 0.6666667
#>   a_o 0.0000000 0.0000000 0.0000000 0.0000000 0.1666667 0.0000000 0.1666667
#>   a_u 0.0000000 0.0000000 0.0000000 0.3333333 0.0000000 0.1666667 0.1666667
#>   b_b 1.0000000 0.6666667 0.1666667 0.0000000                     0.8333333
#>   b_w 0.0000000 0.0000000 0.0000000 0.1666667                     0.1666667
#>   c_c 1.0000000 0.6666667 0.1666667 0.1666667 0.8333333 0.1666667

# in the external mode,
#    the relative values are with regard to blocks of rows, and all columns
tab <- coocc (d.abc, "DIALECT.L2", count="r")
rows.a <- which (rownames(tab) %hasPrefix% "a")
sum (tab [rows.a, ])
#> [1] 1

Lastly, allCooccs() splits a table produced by coocc() into blocks, each containing the correspondences of one segment. Its primary purpose is to facilitate the application of tests of independence, for which see lapplyTest() in subsection Practice in section Fitting models.

allCooccs() takes all the same arguments as coocc(), and in addition, the argument bin which determines whether the table should be just cut up, or whether all the resulting slices should also be binned. On binning, see subsection binTable().

The return value of allCooccs() is a list which holds all the resulting tables, under names composed from the correspondences and connected with underscores. If column is set to NULL, the names would be "a", "b", etc. if bin = F, and if bin = T, "a_b_c_d" meaning L1 a : L2 b cross-tabulated with L1 c : L2 d, and so on. If column is not NULL, the names will be "a_b_northern" meaning L1 a : L2 b tabulated with the ‘northern’ dialect, and so forth.

# for a small dataset, the result is going to be small
str (allCooccs(d.abc), max.level=0)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
#> List of 34

# but it can grow quite quickly with a larger dataset
str (allCooccs(d.cap), max.level=0)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |=====                                                                 |   6%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |=================================================================     |  94%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%
#> List of 5614

# the naming scheme
names (allCooccs(d.abc))
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
#>  [1] "-_ə_a_a" "-_ə_a_o" "-_ə_a_u" "-_ə_b_b" "-_ə_b_w" "-_ə_c_c" "a_a_-_ə"
#>  [8] "a_a_b_b" "a_a_b_w" "a_a_c_c" "a_o_-_ə" "a_o_b_b" "a_o_b_w" "a_o_c_c"
#> [15] "a_u_-_ə" "a_u_b_b" "a_u_b_w" "a_u_c_c" "b_b_-_ə" "b_b_a_a" "b_b_a_o"
#> [22] "b_b_a_u" "b_b_c_c" "b_w_-_ə" "b_w_a_a" "b_w_a_o" "b_w_a_u" "b_w_c_c"
#> [29] "c_c_-_ə" "c_c_a_a" "c_c_a_o" "c_c_a_u" "c_c_b_b" "c_c_b_w"

# and with ‘column’ not ‘NULL’
names (allCooccs(d.abc,column="DIALECT.L2"))
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
#>  [1] "-_ə_north" "-_ə_south" "-_ə_std"   "a_a_north" "a_a_south" "a_a_std"  
#>  [7] "a_o_north" "a_o_south" "a_o_std"   "a_u_north" "a_u_south" "a_u_std"  
#> [13] "b_b_north" "b_b_south" "b_b_std"   "b_w_north" "b_w_south" "b_w_std"  
#> [19] "c_c_north" "c_c_south" "c_c_std"

7 Fitting models

Once tables, e.g. such as discussed in section Contingency tables, are complete, the user might wish to fit various models to them. soundcorrs offers two functions such occasions; their use is not limited to soundcorrs contingency tables.

7.1 Theory

A common problem with fitting models in quantitative linguistics, is the number of tries one has to manually perform in order to establish the right starting estimates for the coefficients, or even the correct model – and its starting estimates. The process is not very easy to automate due to warnings and errors that often accompany failed attempts.

soundcorrs has a function to fit multiple models to a single dataset, and its extention which fits multiple models to multiple datasets. They only have one argument in common, and it is models. Models are represented as a list with two elements: $formula, and $start. The latter must be a list, each of whose elements is itself a list of the starting estimates for all the coefficients. So prepared models are then wrapped in a named list and passed to the functions as a single argument. See subsection Practice below for an example.

The remaining arguments differ considerably between the two functions, and it will be more convenient to discuss them separately while introducing the functions themselves.

7.2 Practice

As of the current version, the two functions cannot be accessed from the GUI.

In the CLI, the function that fits multiple models to a single dataset is called multiFit(), and the one that fits multiple models to multiple datasets is calles fitTable(). Below is only a brief introduction; further details can be found in the documentation (run ?multiFit and ?fitTable).

multiFit() takes as argument the dataset, and a list of models. The user can also specify the fitting function, as well as pass additional arguments to it. The return value of fitTable() is a list of lists which contain the outputs of the fitting function. Warnings and errors, which are suppressed by multiFit(), are attached to the individual elements of the output as attributes. Technically, the result is of class list.multiFit so that it can passed to summary() to produce a table for easier comparison of the fits. The available metrics are aic, bic, rss (the default), and sigma. In addition, the output of fitTable() has an attribute depth; it is intended for summary(), and should not be edited by the user.

# prepare some random data
set.seed (27)
dataset <- data.frame (X=1:10, Y=1:10 + runif(10,-1,1))

# prepare models to be tested
models <- list (
    "model A" = list( formula="Y~a+X", start=list(list(a=1)) ),
    "model B" = list( formula="Y~a^X", start=list(list(a=-1),list(a=1)) ))
# normally, (-1)^X would produce an error with ‘nls()’

# fit the models to the dataset
fit <- multiFit (models, dataset)

# inspect the results
summary (fit)
#>      model A  model B
#> rss 4.059485 11.51618

fitTable() applies multiFit() over a table, such as the ones produced by coocc() or summary(). The arguments are: the models, the dataset, margin (as in apply(): 1 for rows, 2 for columns), the converter function, and additional arguments passed to multiFit() (including the fitting function). The converter is a function that turns individual rows or columns of the table into data frames to which models can be fitted. soundcorrs provides three simple functions: vec2df.id() (the default one), vec2df.hist(), and vec2df.rank(). The first one only attaches a list of X values, the second one extracts from a histogram the midpoints and counts, and the third one ranks the data. Any function can be used, so long as it takes a numeric vector as the only argument, and returns a data frame. The names of columns in the data frames returned by these three functions are X and Y, something to be borne in mind when defining the formulae of the models.

As with multiFit(), the return value of fitTable() is a list of the outputs of the fitting function, only in the case of fitTable() it is nested. It, too, can be passed to summary() to produce a convenient table.

# prepare the data
dataset <- coocc (d.abc)

# prepare the models to be tested
models <- list (
    "model A" = list( formula="Y~a*(X+b)^2", start=list(list(a=1,b=1)) ),
    "model B" = list( formula="Y~a*(X-b)^2", start=list(list(a=1,b=1)) ))
# vanilla nls() often requires fairly accurate starting estimates

# fit the models to the dataset
fit <- fitTable (models, dataset, 1, vec2df.hist)

# inspect the results
summary (fit, metric="sigma")
#>               -_ə      a_a       a_o     a_u       b_b       b_w       c_c
#> model A        NA 1.272453        NA      NA        NA        NA        NA
#> model B 0.4291194  1.03122 0.9342932 0.72328 0.5919122 0.9342932 0.5919122

8 Helper functions

In addition to analytic functions, soundcorrs also exports several helpers which serve diverse purposes and do not form a thematically coherent whole. As of the current version, they are only accessible through the CLI. Below, they are discussed in alphabetical order. Further details can be found in the documentation of individual functions (run ?nameOfTheFunction).

8.1 addSeparators()

As was mentioned in subsection Linguistic datasets in section Data preparation, automatic segmentation and alignment requires careful supervision, and in effect, it may prove in the end to be easier to do by hand. addSeparators() can facilitate the first half of this task by interspersing a vector of character strings with a separator.

# using the default ‘|’ …
addSeparators (d.abc$data$ORTHOGRAPHY.L1)
#> [1] "a|b|c"   "a|b|a|c" "a|b|c"   "a|b|a|c" "a|b|c"   "a|b|a|c"

# … or a full stop
addSeparators (d.abc$data$ORTHOGRAPHY.L1, ".")
#> [1] "a.b.c"   "a.b.a.c" "a.b.c"   "a.b.a.c" "a.b.c"   "a.b.a.c"

8.2 binTable()

It may be sometimes that the data are insufficient for a test of independence, or that the contingency table is too diversified to draw concrete conclusions from it. binTable() takes one or more rows and one or more columns as arguments, and leaves those rows and columns unchanged, while summing up all the others.

# build a table for a slightly larger dataset
tab <- coocc (d.cap)

# let us focus on L1 a and o
rows <- which (rownames(tab) %hasPrefix% "a")
cols <- which (colnames(tab) %hasPrefix% "o")
binTable (tab, rows, cols)
#>       o_o_o non-o_o_o
#> a_a_a     0        57
#> a_a_o     0         6
#> a_a_u     0         5
#> other    16      1041

# or on all a-like and o-like vowels
rows <- which (rownames(tab) %hasPrefix% "[aāäǟ]")
cols <- which (colnames(tab) %hasPrefix% "[oōöȫ]")
binTable (tab, rows, cols)
#>       o_o_o ō_o_o ō_y_o other
#> a_a_a     0     1     0    56
#> a_a_o     0     0     0     6
#> a_a_u     0     0     0     5
#> ä_e_e     0     0     0    36
#> ā_-_-     1     0     0     6
#> ā_a_a     0     2     0    47
#> other    15    16     3   931

8.3 expandMeta()

Metacharacters defined in the transcription (“wildcards”) can be used inside sound changes, as well as inside a findExamples() or a findPairs() query, but they can also be used with grep() or any other function. The only difference is that the first three functions automatically make a call to expandMeta() in order to translate those metacharacters into regular expressions that vanilla R can understand, while for grep(), or any other function from outside soundcorrs, the user needs to call expandMeta() explicitly.

Beside the metacharacters defined in the transcription (see subsection Transcription in section Data preparation), expandMeta() can also understand ‘binary notation’, i.e. an enumeration of distinctive features such as “[+cons,-stop]”. The condition is that the enumeration must be enclosed in square brackets, it must contain the same features as are used in the VALUE column in the transcription, each feature must have a “+” or “-” sign in front of it, the features must be separated by commas, and there can be no spaces inside the brackets. Should any of those rules be broken, the would-be wildcard will be kept in the query string as is, and will surely fail to produce any match in the search.

# let us search a column other than the one specified as ‘aligned’
orth <- d.abc$data [, "ORTHOGRAPHY.L2"]

# look for all VCC sequences
query <- expandMeta (d.abc$trans[[1]], "VCC")
orth [grep(query,orth)]
#> [1] "abc"  "aobc" "abca"

# look for all VCC words
query <- expandMeta (d.abc$trans[[1]], "^VCC$")
orth [grep(query,orth)]
#> [1] "abc"

# the same in the binary notation
query <- expandMeta (d.abc$trans[[1]], "^[+vow][+cons][+cons]$")
orth [grep(query,orth)]
#> [1] "abc"

8.4 %hasPrefix% and %hasSuffix%

Check if a string begins or ends with another string. In soundcorrs, this can be useful for extracting specific rows and columns from a contingency table.

# build a table for a slightly larger dataset
tab <- coocc (d.cap)

# it is quite difficult to read as a whole, so let us focus
#    on a-like vowels in L1 and s-like consonants in L2
rows <- which (rownames(tab) %hasPrefix% "[aāäǟ]")
cols <- which (colnames(tab) %hasPrefix% "[sśš]")
tab [rows, cols]
#>                      German_Polish_Spanish
#> German_Polish_Spanish s_s_s s_s_z s_z_z s_š_s š_š_s
#>                 a_a_a     1     1     0     1     1
#>                 a_a_o     0     0     0     0     1
#>                 a_a_u     0     0     0     0     0
#>                 ä_e_e     0     0     0     2     0
#>                 ā_-_-     0     0     1     0     0
#>                 ā_a_a     0     0     0     2     0

# and now let us see what corresponds to a-like vowels in L1
#    and s-like consonants in L2
rows <- which (rownames(tab) %hasSuffix% "[aāäǟ]")
cols <- which (colnames(tab) %hasSuffix% "[sśš]")
tab [rows, cols]
#>                      German_Polish_Spanish
#> German_Polish_Spanish -_-_s s_s_s s_š_s z_s_s z_z_s š_š_s
#>               -_-_a       0     0     0     0     0     0
#>               -_a_a       1     1     0     0     0     0
#>               -_a_ja      0     0     0     0     0     1
#>               -_y_a       1     0     0     0     0     0
#>               a_a_a       1     1     1     1     0     1
#>               jus_o_a     0     0     0     0     0     0
#>               ā_a_a       0     0     2     0     1     0

8.5 lapplyTest()

lapplyTest() is a variant of base::lapply() specifically adjusted for the application of tests of independence. The main difference lies in the handling of warnings and errors.

This function takes a list of contingency tables, such as generated by allCooccs() (see subsection Practice in section Contingency tables), and applies to each of its elements a function given in fun. By default, it is chisq.test(), but any other test can be used, so long as its output contains an element named p.value. The result is a list of the outputs of fun, to each attached as an attribute a warning or an error if any were produced. Additional arguments to fun can also be passed in a call to lapplyTest().

Technically, the output is of class list.lapplyTest. It can be passed to summary() to sift through the results and only print the ones with the p-value below the specified threshold (the default is 0.05). Those tests which produced a warning are prefixed with an exclamation mark.

# let us prepare the tables
tabs <- allCooccs (d.abc, bin=F)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%

# and apply the chi-squared test to them
chisq <- lapplyTest (tabs)
chisq
#> $`-`
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  tab
#> X-squared = 6, df = 5, p-value = 0.3062
#> 
#> 
#> $a
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  tab
#> X-squared = 7.7467, df = 6, p-value = 0.2573
#> 
#> 
#> $b
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  tab
#> X-squared = 7.1944, df = 4, p-value = 0.126
#> 
#> 
#> $c
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  tab
#> X-squared = 6.5714, df = 5, p-value = 0.2545
#> 
#> 
#> attr(,"class")
#> [1] "list.lapplyTest"

# this is only an example on a tiny dataset, so let us be more forgiving
summary (chisq, p.value=0.3)
#> Total results: 4; with p-value ≤ 0.3: 3.
#> ! a: p-value = 0.257
#> ! b: p-value = 0.126
#> ! c: p-value = 0.255

# let us see the problems with ‘a’
attr (chisq$a, "error")
#> NULL
attr (chisq$a, "warning")
#> <simpleWarning in fun(tab, ...): Chi-squared approximation may be incorrect>

# this warning often means that the data were insufficient
tabs$a
#>      L1_L2
#> L1_L2 -_ə b_b b_w c_c
#>   a_a   2   4   0   4
#>   a_o   0   1   0   1
#>   a_u   0   0   1   1

8.6 loadSampleDataset()

Due to technical limitations of R and CRAN, primarily to do with encoding, sample datasets provided by soundcorrs cannot be stored in the preloaded form (non-ASCII characters). They also cannot be automatically loaded when soundcorrs is attached (staged install prevents this kind of usage of system.file()), and they cannot be included in full in the source files, even when Unicode characters are escaped because Windows do not know how to convert those to native encoding. It seems that the only half-convenient way of making Unicode datasets available is through a separate function that can load them on users request.loadSampleDataset()` is such a function.

loadSampleDataset() only takes one argument, ‘x’, which can take one of the following values:

  • ‘soundchange’s: ‘change-dl2l’ (the *l, *dl > *l merger in Slavic, regressively), ‘change-palatalization’ (the first palatalization in Slavic, progressively), ‘change-rhotacism’ (rhotacism in Latin, progressively);
  • soundcorrs: ‘data-abc’ (entirely made up), ‘data-capitals’ (EU capitals in German, Polish, and Spanish), ‘data-ie’ (a dozen Indo-European examples from Campbell’s Historical Linguistics, see above);
  • ‘transcription’: ‘trans-common’ (a fragment of the traditional continental transcription), ‘trans-ipa’ (a fragment of the IPA).
# load a transcription
tmp <- loadSampleDataset ("trans-common")

# it's the same that we've already loaded above
identical (tmp, trans.com)
#> [1] TRUE

8.7 long2wide()

long2wide(), together with wide2long() (below) are used to convert data frames between the “long format” and the “wide format” (see subsection Linguistic datasets in section Data preparation). Of these two, long2wide() is particularly useful because the “long format” tends to be easier for humans to perform the segmentation, and is therefore preferable for storing data, while the “wide format” is used internally and required by soundcorrs.

During the conversion, the number of columns is almost doubled (while the number of rows halved), but because it is unwise to have duplicate column names, they are given suffixes – which are taken from the values in the column LANGUAGE. The name of the column used for that purpose can be changed using the col.lang argument.

Some of the attributes pertain to only one word in a pair or to the pair as a whole. In the “long format” those have to be repeated, but in the “wide format” this is not necessary. long2wide() allows for certain columns to be excluded from the conversion, using the skip argument.

# the “abc” dataset is in the long format
abc.long <- read.table (path.abc, header=T)

# the simplest conversion unnecessarily doubles the ID column
long2wide (abc.long)
#>   ID.L1 DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 ID.L2 DIALECT.L2 ALIGNED.L2
#> 1     1        std      a|b|c            abc     1        std      a|b|c
#> 2     2        std    a|b|a|c           abac     2        std    a|b|a|c
#> 3     3        std      a|b|c            abc     3      north      o|b|c
#> 4     4        std    a|b|a|c           abac     4      north    u|w|u|c
#> 5     5        std    a|b|c|-            abc     5      south    a|b|c|ə
#> 6     6        std  a|b|a|c|-           abac     6      south  a|b|a|c|ə
#>   ORTHOGRAPHY.L2
#> 1            abc
#> 2           abac
#> 3           aobc
#> 4           uwuc
#> 5           abca
#> 6          abaca

# but this can be avoided with the ‘skip’ argument
abc.wide <- long2wide (abc.long, skip="ID")

8.8 ngrams()

ngrams() turns a vector of words into a list of n-grams, or a table of its frequencies. The first argument is the vector of words; the second is n, the length of n-grams to extract (defaults to 1); and the last as.table which determines whether the output is a list of n-grams or a table of its frequencies (defaults to TRUE).

Two more arguments are available. borders is a vector of two character strings: the first to be prepended to all the words, and the second to be appended to them. This way it is clear which n-grams were in the initial, and which in the final position inside the word. borders defaults to a vector of two empty strings. Lastly, rm is a string of characters that are to be removed from the words before they are cut into n-grams. For instance, to remove all linguistic zeros use rm="-", and to remove zeros and segment separators, use rm="[-\\|]".

# with n==1, ngrams() returns simply the frequencies of segments
ngrams (d.cap$data[,"ORTHOGRAPHY.Spanish"])
#> 
#>  A  B  C  D  E  H  L  M  N  P  R  S  T  V  Z  _  a  b  c  d  e  f  g  h  i  k 
#>  1  5  2  1  1  1  4  1  1  2  2  1  1  4  1  3 30  5  3  7 14  1  5  1 15  1 
#>  l  m  n  o  p  r  s  t  u  v  x  Á  í 
#> 11  5  9 10  2 11 13  7  9  2  1  1  4

# counts can easily be turned into a data frame with ranks
tab <- ngrams (d.cap$data[,"ORTHOGRAPHY.Spanish"], n=2)
mtx <- as.matrix (sort(tab,decreasing=T))
head (data.frame (RANK=1:length(mtx), COUNT=mtx, FREQ=mtx/sum(mtx)))
#>    RANK COUNT       FREQ
#> na    1     4 0.02339181
#> st    2     4 0.02339181
#> ag    3     3 0.01754386
#> ar    4     3 0.01754386
#> da    5     3 0.01754386
#> en    6     3 0.01754386

8.9 subset()

subset() does what its name suggests, i.e. it subsets a dataset using the provided condition. It returns a new soundcorrs object.

# select only examples from L2’s northern dialect
subset (d.abc, DIALECT.L2=="north") $data
#>   ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2 ORTHOGRAPHY.L2
#> 3  3        std      a|b|c            abc      north      o|b|c           aobc
#> 4  4        std    a|b|a|c           abac      north    u|w|u|c           uwuc

# select only capitals of countries where German is an official language
subset (d.cap, grepl("German",d.cap$data$OFFICIAL.LANGUAGE)) $data
#>           ALIGNED.German ORTHOGRAPHY.German        ALIGNED.Polish
#> 3            b|ä|r|l|ī|n             Berlin           b|e|r|l|i|n
#> 5      b|r|ü|-|s|ə|l|-|-            Brüssel     b|r|u|k|s|e|l|a|-
#> 13 l|u|k|s|ə|m|b|u|r|k|-          Luxemburg l|u|k|s|e|m|b|u|r|k|-
#> 26         v|ī|-|-|-|n|-               Wien         ẃ|-|e|d|e|ń|-
#>    ORTHOGRAPHY.Polish       ALIGNED.Spanish  ORTHOGRAPHY.Spanish
#> 3              Berlin           b|e|r|l|i|n               Berlín
#> 5            Bruksela     b|r|u|-|s|e|l|a|s             Bruselas
#> 13         Luksemburg l|u|k|s|e|m|b|u|r|γ|o Ciudad_de_Luxemburgo
#> 26             Wiedeń         b|j|e|-|-|n|a                Viena
#>              OFFICIAL.LANGUAGE
#> 3                       German
#> 5          Dutch,French,German
#> 13 Luxembourgish,French,German
#> 26                      German

# select only pairs in which L1 a : L2 a
subset (d.abc, findPairs(d.abc,"a","a")$which) $data
#>   ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2 ORTHOGRAPHY.L2
#> 1  1        std      a|b|c            abc        std      a|b|c            abc
#> 2  2        std    a|b|a|c           abac        std    a|b|a|c           abac
#> 5  5        std    a|b|c|-            abc      south    a|b|c|ə           abca
#> 6  6        std  a|b|a|c|-           abac      south  a|b|a|c|ə          abaca

8.10 wide2long()

wide2long() is simply the inverse of long2wide() (above). The conversion may not be perfect, as the order of the columns may change.

In long2wide(), suffixes were taken from the values in the LANGUAGE column; this time they must be specified explicitly. They will be stored in a column defined by the argument col.lang, which defaults to LANGUAGE. However, the string that separated column names from suffixes will not be removed by default. To strip it, the argument strip needs to be set to the length of the separator.

# let us use the converted “abc” dataset
abc.wide
#>   ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2 ORTHOGRAPHY.L2
#> 1  1        std      a|b|c            abc        std      a|b|c            abc
#> 2  2        std    a|b|a|c           abac        std    a|b|a|c           abac
#> 3  3        std      a|b|c            abc      north      o|b|c           aobc
#> 4  4        std    a|b|a|c           abac      north    u|w|u|c           uwuc
#> 5  5        std    a|b|c|-            abc      south    a|b|c|ə           abca
#> 6  6        std  a|b|a|c|-           abac      south  a|b|a|c|ə          abaca

# with the separator preserved
wide2long (abc.wide, c(".L1",".L2"))
#>      ALIGNED DIALECT ORTHOGRAPHY ID LANGUAGE
#> 1      a|b|c     std         abc  1      .L1
#> 2    a|b|a|c     std        abac  2      .L1
#> 3      a|b|c     std         abc  3      .L1
#> 4    a|b|a|c     std        abac  4      .L1
#> 5    a|b|c|-     std         abc  5      .L1
#> 6  a|b|a|c|-     std        abac  6      .L1
#> 7      a|b|c     std         abc  1      .L2
#> 8    a|b|a|c     std        abac  2      .L2
#> 9      o|b|c   north        aobc  3      .L2
#> 10   u|w|u|c   north        uwuc  4      .L2
#> 11   a|b|c|ə   south        abca  5      .L2
#> 12 a|b|a|c|ə   south       abaca  6      .L2

# and with the separator removed
wide2long (abc.wide, c(".L1",".L2"), strip=1)
#>      ALIGNED DIALECT ORTHOGRAPHY ID LANGUAGE
#> 1      a|b|c     std         abc  1       L1
#> 2    a|b|a|c     std        abac  2       L1
#> 3      a|b|c     std         abc  3       L1
#> 4    a|b|a|c     std        abac  4       L1
#> 5    a|b|c|-     std         abc  5       L1
#> 6  a|b|a|c|-     std        abac  6       L1
#> 7      a|b|c     std         abc  1       L2
#> 8    a|b|a|c     std        abac  2       L2
#> 9      o|b|c   north        aobc  3       L2
#> 10   u|w|u|c   north        uwuc  4       L2
#> 11   a|b|c|ə   south        abca  5       L2
#> 12 a|b|a|c|ə   south       abaca  6       L2