R package dbscan - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

CRAN version stream r-universe status CRAN RStudio mirror downloads Anaconda.org

Introduction

This R package provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes:

Clustering

Outlier Detection

Fast Nearest-Neighbor Search (using kd-trees)

The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are typically faster than the native R implementations (e.g., dbscan in package fpc), or the implementations in WEKA, ELKI and Python’s scikit-learn.

The following R packages use dbscan: AFM, bioregion, CIDER, CLONETv2, ClustAssess, cordillera, CPC, crosshap, daltoolbox, DDoutlier, diceR, dobin, doc2vec, dPCP, EHRtemporalVariability, eventstream, FCPS, fdacluster, FORTLS, funtimes, FuzzyDBScan, ksharp, ktaucenters, LOMAR, maotai, metaCluster, mlr3cluster, MOSS, oclust, openSkies, opticskxi, pagoda2, parameters, ParBayesianOptimization, performance, pguIMP, rMultiNet, seriation, sfdep, sfnetworks, sharp, shipunov, smotefamily, snap, spdep, spNetwork, squat, ssMRCD, stream, supc, synr, tidySEM, ts2net

Please cite the use of this package as:

To cite dbscan in publications use:

Hahsler M, Piekenbrock M, Doran D (2019). “dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software, 91(1), 1-30. doi:10.18637/jss.v091.i01 https://doi.org/10.18637/jss.v091.i01.

@Article{,
  title = {{dbscan}: Fast Density-Based Clustering with {R}},
  author = {Michael Hahsler and Matthew Piekenbrock and Derek Doran},
  journal = {Journal of Statistical Software},
  year = {2019},
  volume = {91},
  number = {1},
  pages = {1--30},
  doi = {10.18637/jss.v091.i01},
}

Installation

Stable CRAN version: Install from within R with

install.packages("dbscan")

Current development version: Install from r-universe.

install.packages("dbscan", repos = "https://mhahsler.r-universe.dev")

Usage

Load the package and use the numeric variables in the iris dataset

library("dbscan")

data("iris")
x <- as.matrix(iris[, 1:4])

DBSCAN

db <- dbscan(x, eps = 0.4, minPts = 4)
db
## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.4, minPts = 4
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 4 cluster(s) and 25 noise points.
## 
##  0  1  2  3  4 
## 25 47 38 36  4 
## 
## Available fields: cluster, eps, minPts, dist, borderPoints

Visualize the resulting clustering (noise points are shown in black).

pairs(x, col = db$cluster + 1L)

OPTICS

opt <- optics(x, eps = 1, minPts = 4)
opt
## OPTICS ordering/clustering for 150 objects.
## Parameters: minPts = 4, eps = 1, eps_cl = NA, xi = NA
## Available fields: order, reachdist, coredist, predecessor, minPts, eps,
##                   eps_cl, xi

Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

opt <- extractDBSCAN(opt, eps_cl = 0.4)
plot(opt)

HDBSCAN

hdb <- hdbscan(x, minPts = 4)
hdb
## HDBSCAN clustering for 150 objects.
## Parameters: minPts = 4
## The clustering contains 2 cluster(s) and 0 noise points.
## 
##   1   2 
## 100  50 
## 
## Available fields: cluster, minPts, coredist, cluster_scores,
##                   membership_prob, outlier_scores, hc

Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.

plot(hdb, show_flat = TRUE)

Using dbscan from Python

R, R package dbscan, and Python package rpy2 need to be installed.

import pandas as pd
import numpy as np

### prepare data
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
                   header = None, 
                   names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'])
iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]

# get R dbscan package
from rpy2.robjects import packages
dbscan = packages.importr('dbscan')

# enable automatic conversion of pandas dataframes to R dataframes
from rpy2.robjects import pandas2ri
pandas2ri.activate()

db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5)
print(db)
## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.5, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 2 cluster(s) and 17 noise points.
## 
##  0  1  2 
## 17 49 84 
## 
## Available fields: cluster, eps, minPts, dist, borderPoints
# get the cluster assignment vector
labels = np.array(db.rx('cluster'))
labels
## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
##         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
##         1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
##         2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,
##         2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
##       dtype=int32)

License

The dbscan package is licensed under the GNU General Public License (GPL) Version 3. The OPTICSXi R implementation was directly ported from the ELKI framework’s Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.

Changes

References