LoRAS: an oversampling approach for imbalanced datasets

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 12 Nov 2020Embargo end date: 01 Jan 2019 English Publisher:Springer Science and Business Media LLCJournal:Machine Learning, volume 110, pages 279-301 (issn: 0885-6125, eissn: 1573-0565,

Copyright policy )

Authors: Saptarshi Bej; Narek Davtyan; Markus Wolfien; Mariam Nassar; Olaf Wolkenhauer;

doi: 10.1007/s10994-020-05913-4 , 10.48550/arxiv.1908.08346

arXiv: 1908.08346

LoRAS: an oversampling approach for imbalanced datasets

- Summary
- Subjects
- Metrics

Abstract

AbstractThe Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.

Related Organizations

Grenoble Alpes University
France
INSTITUT POLYTECHNIQUE DE GRENOBLE
France
Grenoble INP - UGA
France
Stellenbosch University
South Africa
University of Rostock
Germany

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Classification and discrimination; cluster analysis (statistical aspects), Learning and adaptive systems in artificial intelligence, imbalanced datasets, Machine Learning (stat.ML), synthetic sample generation, Machine Learning (cs.LG), oversampling, Statistics - Machine Learning, manifold learning, data augmentation

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	143
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%

Found an issue? Give us feedback

143

Top 1%

Top 10%

Top 1%

Green

hybrid

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering