Unbalanced datasets in Machine Learning

Published on Saturday 24 October 2020

Tags: ML1

Unbalanced datasets in Machine Learning

Short notes about dealing with unbalanced datasets in Machine Learning.

Let's consider a sparse dataset for a one-dimensional binary classification problem. There are two labels: "1" and "0". Most of the training samples are classified as "0" and just a few of them as "1". In other words, we deal with an unbalanced dataset problem.

Now, our goal is to train a classifier which can detect all the "1", i.e. we are not interested in the "0".

What's the issue here? Well. By minimizing the output error, the algorithm could reach a very high precision just by returning...always zero. This is an example of high precision which is completely useless.

What would be the correct approach to tackle this problem?

Depending on the situation, some possibilities may or may not work:

Oversample the "1" until both classes have the same number of samples. Sample data can be slightly changed using for example random noise or translated (= data augmentation).
Undersample the "0" class.
Change the error function to assign higher weights for the "1".
Change the error function to maximize accuracy.

Supervised Machine Learning is not necessarily the best approach.

In this case, clustering algorithms may deserve a try. Clusters are built around the "0" class using the appropriate features. Distance measures are used to detect a treshhold between "0" and "1".

Of course the main drawback is about chosing proper features in such a way that samples are distributed with a proper shape letting clusters boundaries to properly separate "0" and "1" classes.