Processing math: 84%

Similarity and Distance Measures in proxyC

Kohei Watanabe

2024-04-07

This vignette explains how proxyC compute the similarity and distance measures.

Notation

x=[xi,xi+1,,xn]y=[yi,yi+1,,yn] The length of the vector n=||x||, while |x| is the absolute values of the elements.

Operations on vectors are element-wise:

z=xyn=||x||=||y||=||z||

Summation of the elements of vectors is written using sigma without specifying the range:

x=ni=1xi

When the elements of the vector is compared with a value in a pair of square brackets, the summation is counting the number of elements that equal (or unequal) to the value:

[x=1]=ni=1[xi=1]

Similarity Measures

Similarity measures are available in proxyC::simil().

Cosine similarity (“cosine”)

simil=xyx2y2

Pearson correlation coefficient (“correlation”)

simil=Cov(x,y)Var(x)Var(y)

Jaccard similarity (“jaccard” and “ejaccard”)

The values of x and y are Boolean for “jaccard”.

e=xyw=user-provided weightsimil=exw+ywe

Fuzzy Jaccard similarity (“fjaccard”)

The values must be 0x1.0 and 0y1.0.

simil=min(x,y)max(x,y)

Dice similarity (“dice” and “edice”)

The values of x and y are Boolean for “dice”.

e=xyw=user-provided weightsimil=2exw+yw

Hamann similarity (“hamann”)

e=xyn=||x||=||y||u=nesimil=eue+u

Faith similarity (“faith”)

t=[x=1][y=1]f=[x=0][y=0]n=||x||=||y||simil=t+0.5fn

Simple matching (“matching”)

simil=[x=y]

Distance Measures

Similarity measures are available in proxyC::dist(). Smoothing of the vectors can be performed when method is “chisquared”, “kullback”, “jefferys” or “jensen”: the value of smooth will be added to each element of x and y.

Manhattan distance (“manhattan”)

dist=|xy|

Canberra distance (“canberra”)

dist=|xy||x|+|y|

Euclidian (“euclidian”)

dist=x2+y2

Minkowski distance (“minkowski”)

p=user-provided parameterdist=(|xy|p)1p

Hamming distance (“hamming”)

dist=[xy]

The largest difference between values (“maximum”)

dist=max

Chi-squared divergence (“chisquared”)

O_{ij} = \text{augmented matrix from } \vec{x} \text{ and } \vec{y} \\ E_{ij} = \text{matrix of expected count for } O_{ij} \\ dist = \sum{\frac{(O_{ij} - E_{ij}) ^ 2}{ E_{ij}}} \\

Kullback–Leibler divergence (“kullback”)

\vec{p} = \frac{\vec{x}}{\sum{\vec{x}}} \\ \vec{q} = \frac{\vec{y}}{\sum{\vec{y}}} \\ dist = \sum{\vec{q} \log_2{\frac{\vec{q}}{\vec{p}}}}

Jeffreys divergence (“jeffreys”)

\vec{p} = \frac{\vec{x}}{\sum{\vec{x}}} \\ \vec{q} = \frac{\vec{y}}{\sum{\vec{y}}} \\ dist = \sum{\vec{q} \log_2{\frac{\vec{q}}{\vec{p}}}} + \sum{\vec{p} \log_2{\frac{\vec{p}}{\vec{q}}}}

Jensen-Shannon divergence (“jensen”)

\vec{p} = \frac{\vec{x}}{\sum{\vec{x}}} \\ \vec{q} = \frac{\vec{y}}{\sum{\vec{y}}} \\ \vec{m} = \frac{1}{2} (\vec{p} + \vec{q}) \\ dist = \frac{1}{2} \sum{\vec{q} \log_2{\frac{\vec{q}}{\vec{m}}}} + \frac{1}{2} \sum{\vec{p} \log_2{\frac{\vec{p}}{\vec{m}}}}

References