Unsupervised Clustering

As data sets grow larger and more complex, machine learning methods are becoming more pervasive in the biomedical community. Many researchers are not trained in building and interpreting such models, and it can be difficult to choose the appropriate approach in a given context. In its new version, Qlucore Omics Explorer expands the collection of machine learning functionality, and now also includes k-means clustering and classification.

When choosing between machine learning methods it is important to distinguish between supervised and unsupervised methods, which are used in different contexts and for different purposes. Unsupervised methods do not use any external information (annotations, such as disease status or other traits) about the objects to be analyzed, but rather try to find dominating structure or patterns in the data, patterns that can then be interpreted by the researcher. Supervised methods, on the other hand, typically aim at building models that predict or 'explain' some pre-specified annotation, e.g. disease status or the response to a treatment. This annotation may or may not correspond to the main pattern(s) in the data. Classification, or predictive modeling, is an example of supervised learning. You can read more about the new classification functionality in Qlucore Omics Explorer 3.2.

If the goal is to get an overview of a data set, to see which the strongest patterns are and whether the samples naturally partition into subgroups, an unsupervised method like clustering or PCA should be used. Here, we describe unsupervised clustering and discuss how and when it can be used.

Qlucore Omics Explorer offers two types of clustering methods: hierarchical clustering (combined with heatmaps) and k-means clustering. Both are used for the same purpose: to find subgroups among the samples, such that samples within one group are more “similar” to each other than samples belonging to different groups, where “similar” can be formally defined in various ways. The difference is that the hierarchical clustering builds a “cluster tree” (or dendrogram), which organizes the samples hierarchically but does not directly divide them into clusters, while the k-means clustering partitions the samples into a pre-defined number of groups.

Practical situations where you would like to use a clustering approach could be to:

  • evaluate whether there are subtypes of a particular disease, i.e. if the samples group into different clusters is based on some measured data. These clusters may represent different disease types, which have different prognosis and behavior.
  • explore a data set and look for artifacts. This can be done by clustering the data and examining whether the obtained clusters are associated with the signal(s) or interest, or rather with spurious ones such as batch effects or other technical artifacts.

Large and complex data sets often contain a lot of noise, in the sense that weaker signals interfere with the stronger ones and hence can impact the performance of clustering algorithms. Qlucore Omics Explorer includes two tools that can reduce the impact of noise; variance filtering and projection score. Moreover, the silhouette plot type option is included to help with evaluating the quality of a given sample partitioning. In Qlucore Omics Explorer, silhouette values are calculated for each generated k-means clustering.

A possible exploratory workflow, combining the noise reduction and clustering functionality of Qlucore Omics Explorer to find subgroups among the samples, is to:

  • Use PCA in combination with variance filtering and projection score to reduce noise levels. It is not possible to give specific recommendations on how much to filter since data sets can be very different. Note that the implementation of projection score in the program aims at maximizing the information in the 3-dimensional PCA plot, and when the subgroup structure is more complex or if there is one very dominant group in data, variance filtering to the maximum projection score can be too stringent and also remove valuable signals.
  • Generate a clustering with one less cluster than you expect from a biological point of view.
  • Inspect the silhouette plot.
  • Increase the number of clusters by one and inspect the silhouette plot again.
  • Continue to increase the number of clusters by one and inspect the silhouette plots.
  • Terminate when increasing the number of clusters no longer leads to improved silhouette values.

Learn more or download a free trial and try on your own data.