Unsupervised Learning and Clustering

There are three ways an AI can learn to deal with data: Supervised Learning, Unsupervised Learning and Reinforcement Learning. We have already discussed Reinforcement Learning before in a post through the AWS Deepracer. In this blog post, I will outline Unsupervised Learning to deal with unlabelled data, focusing in particular on the technique of clustering.

What is Unsupervised Learning?

When we let a machine figure out patterns and connections in data, we usually provide the algorithm with both the data and a label. This label can take any shape or form. It defines what the machine should learn and what it is that links two observations together. It tells the machine what to look for in new data. In Unsupervised Learning, these labels are missing, and all the machine can take into account are the observations themselves. As a result, the algorithm is not supervised when it is trained.

Unsupervised Learning works similarly to how children learn to identify new things. If a family drives a car often with their child it gets familiar with what a car is. As a result, when they buy a new car, the child does not explicitly need to be told that this is a car: it learned that the new object has 4 wheels, seats, and it goes on the road. They make the logical deduction that it is close enough to the previous car end therefore this must be a car as well. This child might not, however, differentiate whether the new car is an SUV or a sports car without being introduced to more of each example.

Unsupervised Learning

In Unsupervised Learning, a similar approach is taken with data. The machine is allowed to figure out on its own what makes observations similar in order to make its conclusions about the data.


Clustering is a technique used in unsupervised learning, where the machine creates clusters (groups) of observations based on how similar they are to each other. One of the most well-known clustering techniques is K-means clustering. This algorithm finds a pre-defined number of clusters iteratively in the data by moving K centre points around to find where the intra-cluster distances, the distance between each member of a cluster, are minimal and the inter-cluster distances, the distance between each cluster, are maximum. Each data point is then assigned to the centre point it is closest to, forming the clusters.

K-means clustering
K-means clustering, image from Victor Lavrenko`s video lecture

This method works similarly to how a child figures out what counts as a car. We can plot the observations in a two-dimensional graph, with regards to number of tires and number of people that it can fit, and see if we can find the car/not car clusters:

It`s clear on this graph that the cars will be clustered together, but would K-means be able to identify that the bike and the motorcycle are not cars? What if the plane is much further than the rest of the observations? The human child would intuitively know that the plane and the dog are not a car, but if the wrong features of the data are selected, a clustering algorithm might put them in the same cluster as the car.

Why and when to use Unsupervised Learning?

The main advantage of Unsupervised Learning lies in its ability to deal with unlabelled data. When we need to structure the chaos, bring some sense into massive amounts of data and segment data based on its features, we can always apply a clustering technique for a first insight. It`s important to keep in mind to use Unsupervised Learning mainly for descriptive and exploratory tasks, as accuracy tends to be lower than in supervised approaches. However, Unsupervised Learning helps greatly with finding out the important features, identifying outliers and introducing additional features for supervised learning. To get the most out of your data, a combination of both methods works best.

What real-world examples can you think of where Unsupervised Learning is the right way to go?

Back to blogs