This vignette summarizes the methods for clustering snow profiles
available in sarp.snowprofile.alignment
from the following
papers:
We demonstrate clustering with the SPgroup2
snowprofile
set which contains 5 snow profiles from a single date.
Many clustering methods in R use a distance matrix of class
dist
as an input. The distanceSP
function
provides several ways to produce a distance matrix from a
snowprofileSet
by making pairwise comparisons of snow
profile similarities. distanceSP
passes arguments to
dtw
and then further to simSP
to configure the
comparisons. We recommend checking the documentation for
simSP
for available approaches to calculate
similarities.
## Distance matrix using default settings
distmat1 <- distanceSP(SPgroup2)
print(distmat1)
#> VIR077561.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00 0.2598810
#> VIR075907.2019-01-21 17:00:00 0.3016338
#> VIR084723.2019-01-21 17:00:00 0.3664091
#> VIR082519.2019-01-21 17:00:00 0.3893727
#> VIR080858.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00
#> VIR075907.2019-01-21 17:00:00 0.3236395
#> VIR084723.2019-01-21 17:00:00 0.3686663
#> VIR082519.2019-01-21 17:00:00 0.3313452
#> VIR075907.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00
#> VIR075907.2019-01-21 17:00:00
#> VIR084723.2019-01-21 17:00:00 0.4225680
#> VIR082519.2019-01-21 17:00:00 0.4405521
#> VIR084723.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00
#> VIR075907.2019-01-21 17:00:00
#> VIR084723.2019-01-21 17:00:00
#> VIR082519.2019-01-21 17:00:00 0.2002722
Pairwise distance calculations can be slow for large datasets. Three options for improved performance include:
n_cores
argument, which activates the parallel
package.symmetric = FALSE
to only compute one of
dtwSP(A, B)
or dtw(B, A)
, potentially
resulting in a loss of accuracy but cutting the number of alignments in
half.fast_summary = TRUE
to compute approximate
distances nearly instantaneously without aligning profiles with dynamic
time warping by comparing summary statistics.## Check for same result when computing distant in parallel on multiple cores
# library(parallel)
# n_cores <- detectCores() - 1
# distmat1b <- distanceSP(SPgroup2, n_cores = n_cores)
# identical(distmat1, distmat1b)
## Fast version of pairwise distances based on summary stats
distmat2 <- distanceSP(SPgroup2, fast_summary = TRUE)
print(distmat1)
#> VIR077561.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00 0.2598810
#> VIR075907.2019-01-21 17:00:00 0.3016338
#> VIR084723.2019-01-21 17:00:00 0.3664091
#> VIR082519.2019-01-21 17:00:00 0.3893727
#> VIR080858.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00
#> VIR075907.2019-01-21 17:00:00 0.3236395
#> VIR084723.2019-01-21 17:00:00 0.3686663
#> VIR082519.2019-01-21 17:00:00 0.3313452
#> VIR075907.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00
#> VIR075907.2019-01-21 17:00:00
#> VIR084723.2019-01-21 17:00:00 0.4225680
#> VIR082519.2019-01-21 17:00:00 0.4405521
#> VIR084723.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00
#> VIR075907.2019-01-21 17:00:00
#> VIR084723.2019-01-21 17:00:00
#> VIR082519.2019-01-21 17:00:00 0.2002722
print(distmat2)
#> VIR077561.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00 0.3185841
#> VIR075907.2019-01-21 17:00:00 0.8113144
#> VIR084723.2019-01-21 17:00:00 0.6276566
#> VIR082519.2019-01-21 17:00:00 0.3443081
#> VIR080858.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00
#> VIR075907.2019-01-21 17:00:00 0.5644011
#> VIR084723.2019-01-21 17:00:00 0.4471012
#> VIR082519.2019-01-21 17:00:00 0.3125985
#> VIR075907.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00
#> VIR075907.2019-01-21 17:00:00
#> VIR084723.2019-01-21 17:00:00 0.3631210
#> VIR082519.2019-01-21 17:00:00 0.7049122
#> VIR084723.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00
#> VIR075907.2019-01-21 17:00:00
#> VIR084723.2019-01-21 17:00:00
#> VIR082519.2019-01-21 17:00:00 0.4924477
For clustering applications, the clusterSP
function
handles the distance calculations, but it is worthwhile understanding
how you can control the distance calculations with the
clusterSPconfig
function, which has an output
args_distance
for passing arguments to the distance
calculations.
config <- clusterSPconfig(simType = 'layerwise', ddate = T)
str(config)
#> List of 6
#> $ args_distance :List of 4
#> ..$ rescale_resample: logi FALSE
#> ..$ simType : chr "layerwise"
#> ..$ dims : chr [1:3] "gtype" "hardness" "ddate"
#> ..$ weights : num [1:3] 0.375 0.125 0.5
#> $ args_centers :List of 3
#> ..$ simType: chr "layerwise"
#> ..$ dims : chr [1:3] "gtype" "hardness" "ddate"
#> ..$ weights: num [1:3] 0.375 0.125 0.5
#> $ args_cluster : list()
#> $ args_simpleweights: list()
#> $ verbose : logi TRUE
#> $ args_fast : Named num [1:10] 2 0 1 0 0 1 0 0 0 0
#> ..- attr(*, "names")= chr [1:10] "w_hs" "w_hn24" "w_h3d" "w_slab" ...
distmat3 <- do.call('distanceSP', c(list(SPgroup2), config$args_distance))
We present three distinct clustering methods:
hclust
.pam
.averageSP
.All methods are applied using the clusterSP
function
with different type
arguments. We use the already computed
a distance matrix distmat1
, however,
clusterSP
can also compute distance matrices if only
provided with a a snowprofileSet
. The output is a list of
class clusterSP
that contains a variety of information
about the clustering solution, which has a plot method
plot.clusterSP
to show the grouping of clustered
profiles.
Hierarchical clustering organizes data into a tree of nested
clusters. Agglomerative hierarchical clustering begins with each profile
as a separate cluster and then iteratively merges the closest clusters
until all are combined into one. This process forms a dendrogram
representing the data’s hierarchical structure. The desired number of
clusters can be set by specifying the number of groups k or by setting the threshold height
for cutting the tree. The method is implemented using the
stats::hclust
function.
Partitioning around medoids (PAM) is a partition-based clustering
method, where data points are assigned to a predetermined number of
distinct clusters k. Data
points are iteratively assigned to clusters to minimizing the sum of
squared distances between data points and the cluster centers. PAM uses
actual data points as centers (medoids), as opposed to common k-means
clustering that uses an average of data points as the center
(centroids). The method is implemented using the
cluster::pam
function.
Horton et al. (submitted) use a fuzzy variant of PAM where data
points are assign partial membership values to each cluster, which can
be done with the cluster::fanny
function. Note the example
snowprofileSet
in this vignette does not have enough data
points for fanny clustering.
k-dimensional barycenter averaging (kdba)) is a variant of k-means clustering that operates directly on sequences or time series data (Petitjean et al., 2011). It computes the barycenter (centroid) of each cluster based on the average dissimilarity between the objects in the cluster and assigns each object to the closest barycenter. For snow profiles, the cluster barycenters are represented by the average snow profile of each using the dynamic time warping barycenter averaging (DBA) method described by Herla et al. (2022).
An initial clustering condition (which can be random or based on a ‘sophisticated guess’) is iteratively refined by assigning individual profiles to the most similar cluster and at the end of every iteration recomputing the cluster centroids. The result is sensitive to the initial conditions. An advantage of this method is it considered the stratigraphy of the profiles in greater details, but a disadvantage is that iterating over different values of k is slow.
Producing representative profiles for each cluster can be useful. You
can compute these profiles as centroids using averageSP
or
as medoids using medoidSP
. Depending on the clustering
method, these may already be computed and included in the output of
clusterSP
as medoid indices ($id.med
) or a
snowprofileSet
of centroid profiles
($centroids
). To explicitly request specific representative
profiles, use the centers
argument with options ‘medoids’,
‘centroids’, ‘both’, or ‘none’. The plot method for
clusterSP
can plot the representative profile beneath the
cluster groups.
Horton et al. (submitted) apply clustering to forecast regions by manipulating the distance matrix to also account for spatial and temporal criteria. Here is an example modifying the distance matrix based on an additional criteria: is the top layer in the profile surface hoar?
## A binary vector identifying which profiles have SH on the surface
sh <- sapply(SPgroup2, function(x) x$layers$gtype[nrow(x$layers)] == 'SH')
## Construct a distance matrix
distmat_sh <- dist(sh)
## Create weighted average
distmat <- 0.25 * distmat1 + 0.75 * distmat_sh
cl_sh <- clusterSP(SPx = SPgroup2, k = 2, type = 'hclust', distmat = distmat)
plot(cl_sh)