fviz_nbclust {factoextra} | R Documentation |
Partitioning methods, such as k-means clustering require the users to specify the number of clusters to be generated.
fviz_nbclust(): Dertemines and visualize the optimal number of clusters using different methods: within cluster sums of squares, average silhouette and gap statistics.
fviz_gap_stat(): Visualize the gap statistic generated by the
function clusGap
() [in cluster package]. The optimal
number of clusters is specified using the "firstmax" method
(?cluster::clustGap).
Read more: Determining the optimal number of clusters
fviz_nbclust( x, FUNcluster = NULL, method = c("silhouette", "wss", "gap_stat"), diss = NULL, k.max = 10, nboot = 100, verbose = interactive(), barfill = "steelblue", barcolor = "steelblue", linecolor = "steelblue", print.summary = TRUE, ... ) fviz_gap_stat( gap_stat, linecolor = "steelblue", maxSE = list(method = "firstSEmax", SE.factor = 1) )
x |
numeric matrix or data frame. In the function fviz_nbclust(), x can be the results of the function NbClust(). |
FUNcluster |
a partitioning function which accepts as first argument a
(data) matrix like x, second argument, say k, k >= 2, the number of
clusters desired, and returns a list with a component named cluster which
contains the grouping of observations. Allowed values include: kmeans,
cluster::pam, cluster::clara, cluster::fanny, hcut, etc. This argument is
not required when x is an output of the function
|
method |
the method to be used for estimating the optimal number of clusters. Possible values are "silhouette" (for average silhouette width), "wss" (for total within sum of square) and "gap_stat" (for gap statistics). |
diss |
dist object as produced by dist(), i.e.: diss = dist(x, method = "euclidean"). Used to compute the average silhouette width of clusters, the within sum of square and hierarchical clustering. If NULL, dist(x) is computed with the default method = "euclidean" |
k.max |
the maximum number of clusters to consider, must be at least two. |
nboot |
integer, number of Monte Carlo ("bootstrap") samples. Used only for determining the number of clusters using gap statistic. |
verbose |
logical value. If TRUE, the result of progress is printed. |
barfill, barcolor |
fill color and outline color for bars |
linecolor |
color for lines |
print.summary |
logical value. If true, the optimal number of clusters are printed in fviz_nbclust(). |
... |
optionally further arguments for FUNcluster() |
gap_stat |
an object of class "clusGap" returned by the function clusGap() [in cluster package] |
maxSE |
a list containing the parameters (method and SE.factor) for determining the location of the maximum of the gap statistic (Read the documentation ?cluster::maxSE). Allowed values for maxSE$method include:
|
fviz_nbclust, fviz_gap_stat: return a ggplot2
Alboukadel Kassambara alboukadel.kassambara@gmail.com
set.seed(123) # Data preparation # +++++++++++++++ data("iris") head(iris) # Remove species column (5) and scale the data iris.scaled <- scale(iris[, -5]) # Optimal number of clusters in the data # ++++++++++++++++++++++++++++++++++++++ # Examples are provided only for kmeans, but # you can also use cluster::pam (for pam) or # hcut (for hierarchical clustering) ### Elbow method (look at the knee) # Elbow method for kmeans fviz_nbclust(iris.scaled, kmeans, method = "wss") + geom_vline(xintercept = 3, linetype = 2) # Average silhouette for kmeans fviz_nbclust(iris.scaled, kmeans, method = "silhouette") ### Gap statistic library(cluster) set.seed(123) # Compute gap statistic for kmeans # we used B = 10 for demo. Recommended value is ~500 gap_stat <- clusGap(iris.scaled, FUN = kmeans, nstart = 25, K.max = 10, B = 10) print(gap_stat, method = "firstmax") fviz_gap_stat(gap_stat) # Gap statistic for hierarchical clustering gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 10) fviz_gap_stat(gap_stat)