Cluster analysis: a misused term
The term “cluster analysis” is (mis)used at the drop of a hat in everyday life. You’d think that every single manager is clustering. And at the same time they’re the boss of a cluster. And every company and every school is built of clusters. The only thing you can learn from this abuse is that the terms “clustering” and “cluster analysis” can have many different meanings, depending on the industry. Here are the various meanings of “cluster”:
- Astronomy: A group of stars or star systems that belong together and are influenced by each others’ gravity.
- Business: A group of departments with the same background or function.
- Education: A group of students that follows the same class together at the same time.
- Math: A collection of elements that shares one or more properties.
- Organization: Parts of a company that perform certain activities in different (geographical) locations.
- Computer science: A data set that the designer thinks belongs together for whatever reason.
As you can see, the term “cluster” can be used in many ways, and many definitions qualify in some way, shape, or form. The common denominator among all these definitions is dividing things into groups.
Training Course Artificial Intelligence
Do you want to learn how to successfully apply Predictive Analytics in your organization? Follow our 3-day Predictive Analytics & Big Data training course and increase your Business Intelligence.
The problem of clustering
Clustering is easily described, which is the root of the problem. It’s easy to give a definition:
Clustering is putting things into groups
The simple definition means that it’s easy to meet this basic criterion. It doesn’t say anything about the way in which the groups are made, for instance. Or whether or not the groups share the same properties. Breaking the term down makes it a bit easier to grasp:
Cluster analysis:  forming a compact group.  growing together
This definition clears things up a bit. Cluster analysis means bringing elements together in a compact group. This definition doesn’t indicate the basis for forming the groups, however, or how compact they have to be. So there’s still a lot of room for interpretation in this definition. That leaves a lot of wiggle room in the results. That’s why we’ll give a more exact definition of the term “clustering” that reasonably applicable within ICT and Big Data processing.
Cluster analysis: Uniting in compact groups based on one or more properties.
The question is to what extent the above definition is a usable solution for segmenting (very) large quantities of information. And can the division be made effectively? The goal of a cluster analysis is turning a collection of information into several smaller collections with fewer internal differences than the original collection.
Homogeneity and clustering
In other words, the clusters are more homogeneous than the collection they were taken from. That means you can validate each cluster based on the homogeneity of the various clusters compared to the original data set.
Solutions to performing cluster analyses
The origin of clustering can be found in anthropology. In 1911, the Polish Jan Czekanowski introduced this term in his work about the separation and origins of European peoples. From 1938 onward this method made its way over to psychology thanks to the works of Zubin and Robert Tryon, who used clustering in his research into possible hereditary intelligence traits in rats (Tryon’s Rat Experiment).
Many ways to divide clusters
This means that cluster analysis is not a specific algorithm but basically a method of partitioning information. That doesn’t mean that there are no algorithms that can perform cluster analyses and determine the homogeneity of clusters. There are so many ways of performing a cluster analysis that we can only discuss a few of them here.
1. Hierarchical cluster analysis
Hierarchical clustering is a way of dividing information into several related groups. These groups are formed by combining elements based on a function of distance. If the distance between an element and the center of the cluster is below a certain threshold, the element becomes a part of the cluster.
After determining the various clusters, it’s also possible to combine these clusters into combined clusters, this time based on the distance between the clusters themselves. This lets you build a tree structure, a dendogram that shows the relations between the various clusters. This method leaves it up to the user to determine what level of clustering is required or desired.
Tree structure, but different
Hierarchical clustering only delivers a tree structure, wherein you choose the desired level. That means that this method of clustering does not deliver unique results. It only provides a series of possibilities that you can choose from depending on the circumstances.
2. Centroid-based clustering
For this method, the first step is determining which properties the clusters should be based on, and how many clusters are needed. The next step is determining the size and place of the various clusters by calculating the average distance between the points of a cluster and the estimated middle of the cluster. These are then minimized.
A set procedure
A commonly used method for centroid-based clustering is the k-means cluster method. This involves first choosing clusters k. Then you walk through several set steps.
- Choose random k elements in the collection that serve as starting points for the clusters to be formed. These points are called vectors and centers of gravity. Respectively because they indicate different properties simultaneously and because they should be in the middle of a cluster.
- Now determine the borders of clusters surrounding the various vectors. Now it’s almost certain that the chosen vectors don’t point to the real center of gravity, the middle of the cluster.
- Calculate the distance to the nearest center of gravity for all points in every cluster. From here, calculate a new center of gravity based on the positions of the old point and the calculated (average) distance.
- If the calculated distance in step 3 is below a threshold, then the process is complete. If that isn’t the case, repeat the process from step 2.
Take note that the term “distance” can be defined in various ways. One of those definitions is the “normal” Euclidian distance, which can be measured using a ruler.
Slower but more reliable
Because the choice of vectors (starting points) influences the clustering, it’s necessary to use this method several times using different vectors. That makes this method slower, but the result is a better and more reliable answer. Compared to hierarchical clustering, the user has less influence – they only choose the amount of clusters k.
3. Density-based clustering
This method of clustering is based on the density of elements within a cluster. That means that whenever there are many elements close together there is a lot of density and those elements are counted as being part of a certain cluster. The density of appropriate elements for a certain collection of information is based on a formula. If the calculated value of density p is below threshold drp, you can assume that the collection in question has too few appropriate elements to form a cluster.
No guaranteed beforehand
One condition for using this method is that the data in question has clear difference in densities. This isn’t always the case and certainly can’t be guaranteed beforehand.
Validation of cluster analysis
Once a certain method has resulted in a cluster, the problem of validation rears its head. The question is to what extent the cluster analysis is correct. In other words, is there an independent method of verifying the quality of a cluster analysis? The answer to that question unfortunately isn’t satisfying. There are basically two ways of validating a clustering:
- Internal validation: Compare the outcome of the clustering quantitatively using alternative predictive analytics methods. This has its disadvantages. The success of a given method also depends on the way in which the data set has been put together. Two or more methods converging on one solution is no guarantee that it’s the best solution.
- External Validation: Test the method on a data set compiled by a specialist, the outcome of which has been determined beforehand. This approach also has its disadvantages. The way in which the test set has been put together privileges and disadvantages certain methods. The specialist’s solution isn’t automatically the best solution for any given data set. That means that validation for clustering doesn’t go any further than showing that one method is better or worse than another one for a certain data set. No absolute value can be attached to it.
The limitations of cluster analysis
The biggest limitation of cluster analysis is in the broadness of the term “clustering”. Because of this broad definition, there are many different methods that all somehow divide data into groups. As a consequence, a certain segregation of a set may be entirely artificial, such as societal segregation based on age, gender, or cultural background.
Is the chosen cluster relevant?
In many cases, no one wonders to what extent the chosen clustering is relevant given the issue. This illustrates a second limitation of clustering. It’s almost impossible to test whether a chosen clustering is relevant and reliable beforehand. In other words, the division of certain groups is based on properties that are handy and logical. Usually, this question can only be answered after the fact, and in the best case only partially.
Limited validation options
Another limitation of clustering is the limited ways in which the results can be validated. That means the reliability of clustering can only be determined to a very limited extent.
Especially useful for new research
On the other hand, it can be said that if you recognize the limitations of clustering, it can be a very useful method to gain insight into the properties of the elements of a large data set. In particular when it comes to the initial research into a new, unknown set. For a more thorough analysis, other methods, such as the decision tree and the neural network, may be more appropriate. Many of these techniques can be independently validated, or the quality of the results can be checked. These methods are preferred for research involving critical information.
Want to meet cluster specialists?
Our consultants and experts would love to introduce you to our vision on data analytics, machine learning solutions, and cluster analyses. Make an appointment to discuss the opportunities and possibilities of clustering for your organization.