distance()The distance() function implemented in
philentropy is able to compute 46 different
distances/similarities between probability density functions (see
?philentropy::distance for details).
The distance() function is implemented using the same
logic as R’s base function stats::dist() and takes
a matrix or data.frame object as input. The
corresponding matrix or data.frame should
store probability density functions (as rows) for which distance
computations should be performed.
# define a probability density function P
P <- 1:10/sum(1:10)
# define a probability density function Q
Q <- 20:29/sum(20:29)
# combine P and Q as matrix object
x <- rbind(P,Q)Please note that when defining a matrix from vectors,
probability vectors should be combined as rows
(rbind()).
library(philentropy)
# compute the Euclidean Distance with default parameters
distance(x, method = "euclidean")euclidean 
0.1280713For this simple case you can compare the results with R’s base
function to compute the euclidean distance
stats::dist().
          P
Q 0.1280713However, the R base function stats::dist() only computes
the following distance measures:
"euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski",
whereas distance() allows you to choose from 46
distance/similarity measures and when selecting the native distance
functions underlying distance() users can speed up their
computations 3-5x.
To find out which methods are implemented in
distance() you can consult the
getDistMethods() function.
 [1] "euclidean"         "manhattan"         "minkowski"         "chebyshev"        
 [5] "sorensen"          "gower"             "soergel"           "kulczynski_d"     
 [9] "canberra"          "lorentzian"        "intersection"      "non-intersection" 
[13] "wavehedges"        "czekanowski"       "motyka"            "kulczynski_s"     
[17] "tanimoto"          "ruzicka"           "inner_product"     "harmonic_mean"    
[21] "cosine"            "hassebrook"        "jaccard"           "dice"             
[25] "fidelity"          "bhattacharyya"     "hellinger"         "matusita"         
[29] "squared_chord"     "squared_euclidean" "pearson"           "neyman"           
[33] "squared_chi"       "prob_symm"         "divergence"        "clark"            
[37] "additive_symm"     "kullback-leibler"  "jeffreys"          "k_divergence"     
[41] "topsoe"            "jensen-shannon"    "jensen_difference" "taneja"           
[45] "kumar-johnson"     "avg"Now you can choose any distance/similarity method that
serves you.
 jaccard 
0.133869Analogously, in case a probability matrix is specified the following output is generated.
# combine three probabilty vectors to a probabilty matrix 
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39))
rownames(ProbMatrix) <- paste0("Example", 1:3)
# compute the euclidean distance between all 
# pairwise comparisons of probability vectors
distance(ProbMatrix, method = "euclidean")#> Metric: 'euclidean'; comparing: 3 vectors.
          v1         v2         v3
v1 0.0000000 0.12807130 0.13881717
v2 0.1280713 0.00000000 0.01074588
v3 0.1388172 0.01074588 0.00000000Alternatively, users can specify the argument
use.row.names = TRUE to maintain the rownames of the input
matrix and pass them as rownames and colnames to the output distance
matrix.
# compute the euclidean distance between all 
# pairwise comparisons of probability vectors
distance(ProbMatrix, method = "euclidean", use.row.names = TRUE)#> Metric: 'euclidean'; comparing: 3 vectors.
          Example1   Example2   Example3
Example1 0.0000000 0.12807130 0.13881717
Example2 0.1280713 0.00000000 0.01074588
Example3 0.1388172 0.01074588 0.00000000This output differs from the output of
stats::dist().
# compute the euclidean distance between all 
# pairwise comparisons of probability vectors
# using stats::dist()
stats::dist(ProbMatrix, method = "euclidean")           1          2
2 0.12807130           
3 0.13881717 0.01074588Whereas distance() returns a symmetric distance matrix,
stats::dist() returns only one part of the symmetric
matrix.
However, users can also specify the argument
as.dist.obj = TRUE in philentropy::distance()
to retrieve a philentropy::distance() output which is an
object of type stats::dist().
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39))
rownames(ProbMatrix) <- paste0("test", 1:3)
distance(ProbMatrix, method = "euclidean", use.row.names = TRUE, as.dist.obj = TRUE)Metric: 'euclidean'; comparing: 3 vectors.
           test1      test2
test2 0.12807130           
test3 0.13881717 0.01074588Now let’s compare the run times of base R and
philentropy. For this purpose you need to install the
microbenchmark package.
Note: Please make sure to insert vector objects (in our example P, Q) when directly running the low-level functions such as
euclidean()etc. Otherwise, computational overheads are produced that significantly slow down computations when using large vectors.
# install.packages("microbenchmark")
library(microbenchmark)
microbenchmark(
  distance(x,method = "euclidean", test.na = FALSE),
  dist(x,method = "euclidean"),
  euclidean(P, Q, FALSE)
)Unit: microseconds
                                               expr    min      lq     mean  median      uq    max neval
 distance(x, method = "euclidean", test.na = FALSE) 26.518 28.3495 29.73174 29.2210 30.1025 62.096   100
                      dist(x, method = "euclidean") 11.073 12.9375 14.65223 14.3340 15.1710 65.130   100
                   euclidean(P, Q, FALSE)  4.329  4.9605  5.72378  5.4815  6.1240 22.510   100As you can see, although the distance() function is
quite fast, the internal checks cause it to be 2x slower than the base
dist() function (for the euclidean example).
Nevertheless, in case you need to implement a faster version of the
corresponding distance measure you can type philentropy::
and then TAB allowing you to select the base distance
computation functions (written in C++),
e.g. philentropy::euclidean() which is almost 3x faster
than the base dist() function.
The advantage of distance() is that it implements 46
distance measures based on base C++ functions that can be accessed
individually by typing philentropy:: and then
TAB. In future versions of philentropy I will
optimize the distance() function so that internal checks
for data type correctness and correct input data will take less
termination time than the base dist() function.