Cluster Analysis

Key Concepts and Terms

Cluster analysis, also called segmentation analysis or taxonomy analysis, is similar in purpose to Q-mode factor analysis -- both seek to identify homogenous subgroups of cases in a population. That is, cluster analysis seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation. Other techniques, such as latent class analysis, also perform clustering and are discussed separately.

Distance. The first step in cluster analysis is establishment of the similarity or distance matrix. This matrix is a table in which both the rows and columns are the units of analysis and the cell entries are a measure of similarity or distance for any pair of cases.
- Euclidian distance is the most common distance measure. A given pair of cases is plotted on two variables, which form the x and y axes. The Euclidian distance is the square root of the sum of the square of the x difference plus the square of the y distance. (Recall high school geometry: this is the formula for the length of the third side of a right triangle.)
- K-means cluster analysis. K-means cluster analysis uses Euclidian distance. Initial cluster centers are chosen in a first pass of the data, then each additional iteration groups observations based on nearest Euclidian distance to the mean of the cluster. Thus cluster centers change at each pass. The process continues until cluster means do not shift more than a given cut-off value or the iteration limit is reached.
  - In SPSS, after choosing Analyze, Cluster, K-Means Cluster Analysis, the dialog box allows you the option to specify initial cluster centers. This is done by checking the "Read initial from" box in the "Cluster Centers" are of the dialog box, then clicking on the FILE button. The file referenced after pressing the FILE button (called the "center file") must have as its first column a variable named "cluster_". The additional columns are the same variables you specified in the dialog box, though they need not be in the same order. You may have additional variables in the center file beyond those specified in the dialog box -- these will be ignored. There will be one row for each cluster and you must have at least as many rows as the number of clusters you asked for in the dialog box. The values of cluster_ are 1, 2, 3, ... , K", were K is the number of clusters.
    The SPSS Cluster procedure (Analyze, Cluster, Hierarchical Cluster Analysis) generates all possible clusters of sizes 1...K, but may be used only for relatively small samples. One may wish to use the Cluster procedure on a sample of cases (ex., 200) to inspect results for different numbers of clusters. The optimum number of clusters depends on the research purpose. Identifying "typical" types may call for few clusters and indentifying "exceptional" types may call for many clusters. After using Cluster to determine the desired number of clusters, the researcher may wish then to analyze the entire dataset with the Quick Cluster procedure (Analyze, Cluster, K-Means Cluster Analysis), specifying that number of clusters.
- Correlation of items can be used as a similarity measure. One transposes the normal data table in which columns are variables and rows are cases. By using columns as cases and rows as variables instead, the correlation is between cases and these correlations constitute the cells of the similarity matrix.
- Binary matching is another similarity measure, where 1 indicates a match and 0 indicates no match between any pair of cases. There are multiple matched attributes and the similarity score is the number of matches divided by the number of attributes being matched. Note that it is usual in binary matching to have several attributes because there is a risk that when the number of attributes is small, they may be orthogonal to (uncorrelated) with one another, and clustering will be indeterminate.
Cluster formation is the selection of the procedure for determining how many clusters are created, and how the calculations are done.
- Q-mode factor analysis can be used if the distance matrix is composed of correlations (see above). This method of cluster analysis has the special problem of negative factor loadings. In conventional factor analysis of variables, a negative loading indicates a negative relation of the variable to the factor. In Q-mode factor analysis, a negative loading does not have a clear meaning. One common approach is to consider all cases with negative loadings as being in a cluster of their own. Factor analysis as a clustering procedure also suffers from the problem of what to do with cases which crossload on more than one factor. Also, factor analysis assumes a linear model which may not be appropriate for clustering cases in particular research settings.
- F-ratio methods. There are many other cluster formation methods. Most rely on an analog to the F-ratio in analysis of variance, which is the ratio of between-groups variance to within-groups variance. Note that the results of F-ratio methods can depend on the initial, arbitrary selection of pairs of cases as starting centers for clusters. Because of this, an iterative approach is needed to establish stable results.
A dendrogram is a tree diagram often used to represent the results of a cluster analysis. Trees are usually depicted horizontally, not vertically, with each row representing a case. Cases with high similarity are adjacent. Lines indicate the degree of similarity or dissimilarity between cases. In ultrametric trees, line lengths indicate dissimilarities and by adding line lengths along the tree path connecting any two variables, one gets the dissimilarity index for those two variables.
Summary measures assess how the clusters differ from one another.
- Means and variances. A table of means and variances of the clusters with respect to the original variables shows how the clusters differ on the original variables.
- Discriminant analysis may be performed on membership/nonmembership in a cluster. This will show which variables contributed the most to definition of the cluster. It also gives the discriminant function formula for predicting cluster membership for additional cases.

Assumptions

Data are interval in level or are true dichotomies.
The same assumptions as correlation, regression, factor analysis, and other members of the multiple linear general hypothesis family of procedures.
K-means cluster analysis assumes a large sample (ex., > 200).
K-means cluster analysis usually generates different solutions, depending on the sequence of observations in the dataset.

Frequently Asked Questions

How do I get cluster analysis in SPSS?
What is hierarchical clustering?
Isn't discriminant analysis the same as cluster analysis?
What is Clustan?
How do I get cluster analysis in SPSS?
Isn't discriminant analysis the same as cluster analysis?
What is hierarchical clustering?
What is Clustan?

Bibliography

Aldenderfer, Mark S. and Roger K. Blashfield (1984). Cluster analysis. Thousand Oaks, CA: Sage Publications, Quantitative Applications in the Social Sciences Series No. 44.
Corter, James E. (1996). Tree models of similarity and association. Thousand Oaks, CA: Sage Publications, Quantitative Applications in the Social Sciences Series No. 112.
Everitt, Brian, Sabine Landau, and Morven Leese (2001). Cluster analysis, 4th Edition. London: Edward Arnold Publishers Ltd. Highly recommended introductory text.
Kachigan, Sam K. (1982). Multivariate statistical analysis. NY: Radius Press. Chapter 8 provides a very readable introduction to cluster analysis.
Kaufman, Leonard and Peter J. Rousseeuw (1990). Finding groups in data: An introduction to cluster analysis NY: John Wiley & Sons, ISBN: 0471878766.