Authors
Michael Abobor and Darsana P. Josyula, Bowie State University, USA
Abstract
Imbalanced datasets present significant challenges in machine learning. The disproportionate distribution of labels in imbalanced multi-label datasets is a result of the low datapoints of the minority class. This imbalance can lead to biases in model predictions as algorithms tend to favor the majority class, resulting in poor generalization for the minority class. Any effort to balance the inequality within each class individually can inadvertently create issues across the other classes. This paper introduces the multi-view learning approach that combines pre-trained large language models and embeddings augmented with techniques such as SMOTE, MLSMOTE, and MLTL. This helps address the issue of imbalanced multi-label datasets in classification. This dual input model combines the original tokenized text, and the augmented embeddings extracted from the penultimate layer of the transformer model giving the model the ability to learn from both sources of information. This approach conserves the contextual significance of the input text and makes it possible for training transformers with the augmented embeddings thereby tackling the issue of imbalance multi-class datasets.
Keywords
Imbalanced datasets, Multi-label, Transformer, Augmented Embeddings, Machine Learning