Health Benefits Associated With Coffee

For those people who love drinking coffee, there are plenty of reasons to rejoice. Studies around the world have repeatedly shown that there are many health benefits associated with drinking this…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Unsupervised clustering data with graphs

How to use graph community detection algorithms to cluster MNIST digits dataset and get 95% accuracy.

In this article I will share a simple approach how one can build a graph from vectorised data and apply graph community detection algorithm to cluster this data. This is a toy example which clusters MNITS digits dataset, as this simple dataset is already labeled and allows us to check algorithm accuracy. I tried to reduce the description of the community detection algorithms as much as I can. Yet, if you are not interested in details of community detection algorithms at all, I would recommend to skip description of Louvain and Newman’s leading vector algorithms.

MNIST digits dataset is a set of 40k handwritten digits (0–9) with labels. It’s a well-known ‘hello world’ type dataset for data clustering and some image labelling machine learning algorithms. Each sample is a 28x28 pixels image represented as a 784 (28x28) length vector. Sample images below.

Sample images from MNIST digits dataset.

A network (graph) is considered to have a community structure if it’s nodes can be grouped into collections of nodes which are highly connected internally as compared to their connections to nodes outside a given community. Communities can be defined as overlapping (one node can belong to several communities) or non-overlapping (one node can only belong to one community). In our case we will consider the latter group.

It is shown in practice that such communities can define some functional units of nodes e.g. proteins with similar biological functions in protein-protein interaction network or some specific social groups like group of students taking the same course in social networks.

The concept of community is related to modularity¹. Modularity is a measure which allows to quantify the strength of identified communities in a given network. High modularity for a graph means dense connections between nodes in identified communities and sparse connections between nodes from different communities. Additionally, it’s value is normalised, so that it lies within range [-1 and 1]. Modularity is represented by the following formula.

Modularity formula. Latex code written by the author.

As the number and size of huge networks has increased significantly over the last decade (e.g. social networks), the study of communities detection have drawn an attention of many scientists. In this article we will make use of two well-known community detection algorithms: Louvain algorithm and Newman’s leading eigenvector algorithm.

Louvain algorithm

Louvain algorithm² is a method which miaximises modularity based on heuristics. It can be divided into two phases. During the first phase, different community is assigned to each node (one community for each node). Then, for each node i we assign community of it’s neighbour j and measure the gain in modularity, using the formula below.

If this gain is positive, then the node is removed from it’s community and moved to the neighbour’s community. If there is no gain, node stays within it’s community. When there is no further improvement of modularity to gain by moving any node in the network, the first phase stops. The second phase consists of creating super-nodes from identified communities — all of the nodes from a given community are merged into one super-node. The weights of the links between the new nodes are calculated by the sum of links’ weights between nodes of the two identified communities. Additionally, a self-loop is added for each super-node with weight equal to sum of the links’ weights between the nodes of this community. After this phase the first phase and then second phase can be reapplied. This algorithm is repeated until no further gains in modularity can be made.

Newman’s leading eigenvector algorithm

Newman’s leading eigenvector algorithm¹ generalises graph spectral partitioning method to community detection by maximising modularity measure instead of minimising the cut size. In short, it achieves this by finding the eigenvector corresponding to the highest eigenvalue of the modularity matrix of a particular graph.

Modularity matrix formula. Latex code written by the author.

Then it assigns nodes to one of the two communities based on values in this eigenvector using the formula below.

Formula for values in index vector. Latex code written by the author.

After the first pass of the algorithm, the network is divided into two groups. In order to divide it into more groups, the algorithm is run again but this time instead of maximising modularity, it maximises the difference between the modularities of the network before and after the subdivision. This difference is given by the following formula:

Formula for additional modularity contribution from subdivision. Latex code written by the author.

The algortighm stops when no modularity gain can be achieved by subdivisions.
As we can see Newman’s leading vector method perform bisection of each community (and then subcommunities). Yet, this doesn’t constrain it from achieving good results.

Dimensionality reduction

Before we start creating a graph we will reduce dimensionality of our dataset. These images (and thus vectors) share a lot of patterns — images are rather centred, so most of the variability is present in the centre of the images. Moreover, most of the images have white corners. Hopefully, dimensionality reduction will remove redundancy in our data.
Current dimensionality of our dataset is 784 (28x28), so let’s try to reduce it. Having played around with this model a bit, I found UMAP algorithm reducing dimensionality to 15 dimensions to be quite a good approach. Yet I only compared it with PCA, so there is a lot of space for exploration. Let’s write our code for dimensionality reduction.

Graph creation

Now we can create a graph. We will consider each image as a node in our graph and create a link between two nodes based on their similarity, where the weight of the nodes’ edge will be a value measuring similarity between them. In order to proceed with this step we need to define a similarity measure between our vectors. In this case we will use simple cosine similarity as this measure seems to be the most accurate from the basic measures (the other measures considered as basic: Euclidean, Manhattan, Chebyshev). It is worth menioning that it is not the best idea to create a dense graph with every node connected to every other node. Thus, we will set a threshold which must be exceeded by a similarity between two images in order to create a link between them.

In this example we will arbitrarily set up this threshold for 0.92 percentile of all similarities. I found it to be quite a good starting threshold on other datasets too. Code for graph creation below.

Intuition

Let’s pause for a moment to think what we have just created and build some intuition. We created a graph where, ideally, images are linked to each other only if they present the same digit. Unfortunately, we know that people have different handwriting styles and these digits are not perfectly written. See the example image below for digits “3” and “5”:

Digits from MNIST dataset labeled as “3” and“5”, where “5” might slightly resemble “3”.

Yet, we might assume that handwritten digits follow normal distribution (multivariate) — most of them are written somehow similarly with some small number of outliers. So we are most likely to end up with a graph where nodes labeled with different digits are linked (e.g. digits “5” and “3” from above example). However, we expect nodes which share the same label to be densely connected in comparison to nodes with different labels. This, actually sounds quite like the definition of modularity. Our hypothesis is that our algorithm should be able to divide our dataset based purely on graph structure and find the right communities.

We can expect sloppy scribbled digits to play roles of bridge nodes (which link nodes from different communities) in our network. Thanks to the graph structure our community algorithms should be able to assign nodes to the correct communities — some badly scribbled digit can relate to nodes representing two different digits (like image from the previous example might have a link to “3” and “5”) yet we expect them to be more densely linked to nodes which are even more densely linked to other nodes with their true category.

This also allows us to formulate another hypothesis, that our model’s ability to generalise should increase as the number of samples grows — which is true for many Machine Learning models, yet not always true for unsupervised models. The more samples we have, the more representative sample of the population we have and, as a consequence, connections between sloppy written digits as well as their connections to well written digits of the same label might increase, allowing our model to distinguish these groups even better.

Let’s run our model and check it’s results. Before this, let’s pick up any other well-known clustering model which will serve us as a benchmark. In this case we will use DBSCAN as our benchmark model. Precisely we will use an sklearn implementation of DBSCAN.

We also need to keep in mind that our models are not aware of labels as this is unsupervised method. We will simply assume that the label of a specific cluster is the most common true label for vectors (images) in this cluster.

As we want to have proper clusters which can identify digits 0–9, we will set a constrain on number of output clusters to lie within the range 10–12 clusters. I believe it’s not too bad to keep one or two more clusters if there are one or two digits that each is written in two significantly different ways. This means that we will try to tweak our models (DBSCAN and graph clustering) to get 10–12 clusters as a result.

500 images

Screenshot of jupyter notebook script and output for DBSCAN algorithm run on 500 size sample. Code written by the author.

Louvain algorithm with default threshold returns 11 clusters and scores 0.756 accuracy and ~0.6984. Not so bad as for such a simple approach.

Screenshot of jupyter notebook script and output for louvain algorithm run on 500 size sample. Code written by the author.

Newman’s leading eigenvector algorithm gives us 10 clusters and scores 0.744 accuracy and ~0.7008 NMI. In order to keep constraint for number of clusters the threshold was set up for 0.9983 by grid search.

Screenshot of jupyter notebook script and output for Newman’s leading vector algorithm run on 500 size sample. Code written by the author.

20 000 images

This time let’s try running our models on the sample comprised of 20 000 images.

When we run our Louvain algorithm we receive 10 clusters and accuracy score of 0.9468 and NMI ~0.8900. This time in order to keep constraint for number of clusters the threshold was set up for 0.9991 by grid search. This is quite nice accuracy, but let’s see if we can do better.

After running Newman’s leading eigenvector algorithm on 20 000 images sample we get 10 clusters with accuracy ~0.9522 and NMI ~0.8941. As previously the threshold was set up for 0.9991 by grid search.
This is quite a nice result, isn’t it? 95% of accuracy with unsupervised learning without bringing up Deep Learning!

Screenshot of jupyter notebook script and output for Newman’s leading vector algorithm run on 20 000 size sample. Code written by the author.

Let’s now check what score Newman’s leading eigenvector algorithm can achieve on the first 500 digits this time.

We can see that this time our algorithm scores 0.944 accuracy and 0.8971 NMI on the same first 500 digits which it scored 0.744 accuracy and ~0.7008 NMI when it’s run only on 500 digits sample. This confirms one of our hypothesis that enriching graph by more nodes, we provide it both more accurate distribution of vectors and the graph structure which is closer to reality, which results in better data clustering.

Wrongly assigned classes

Let’s have a quick look at a few images which were wrongly classified by our best algorithm (Newman’s leading eigenvector).

Wrongly assigned classes by Newman’s leading eigenvector algorithm ran on 20 000 sample size. T — Ground Truth label, A — Assigned label.

As we can see, some of these wrongly assigned labels are quite extreme cases. 4th and 5th images from the first row and the whole second row might be misleading even to a human eye. Personally, I would also consider the last digit from the 2nd row as “1”. It shows that our algorithm in most cases is confused by sloppy writing.

Graphs visualisation

As any work on data without any visualisation doesn’t count let’s draw some a graph created by us first with true labels, and then with labels assigned by our algorithm.

Created graph with ground truth labels for first 500 images. Image create by the author.
Created graph with labels assigned by Newman’s leading eigenvector trained on first 500 images. Image create by the author.

Graphs size and complexity

We created a graph from vectors in our dataset by representing each vector (image) in our dataset as a node and creating a link between nodes if cosine similarity between two vectors (images) is grater than arbitrarily set threshold. As a result, we reached 95% of accuracy using Newman’s leading eigenvector community detection method for clustering the data.

Although the accuracy is really amazing, we need to keep in mind that MNIST digits dataset is not the most difficult dataset for clustering. It was used here as a toy example as it provides ground true labels to verify it’s accuracy. This method might contain some flaws which might be pronounced on different datasets. Thus, it would be interesting to see how it works on different examples and explore other similarity measures as well as new dimensionality reduction methods.

Add a comment

Related posts:

Check your code using Checkov

In the previous post I made an Nginx container using Docker, Terraform and Checkov and now we will show the details of how to solve the vulnerabilities of a simple file. The command to check …

Banks are joining the Wall Street of Crypto

In the shadow of Dubai’s sail-molded Burj Al Arab inn, crypto chiefs hobnobbed with Emirati royals, Wall Street investors, and Instagram powerhouses. The merriments in late March were coordinated by…

First Time in an Ambulance

I have seen many times a siren sound from the middle of traffic in India. I always think how difficult it will be for the people who’s inside that vehicle. One life is struggling to hold it’s place…