Learning from Imbalanced Data
| dc.contributor.author | Mohosheu, Md. Salman | |
| dc.contributor.author | Noman, Md. Abdullah Al | |
| dc.contributor.author | Al-Amin | |
| dc.date.accessioned | 2025-03-03T06:13:27Z | |
| dc.date.available | 2025-03-03T06:13:27Z | |
| dc.date.issued | 2024-07-29 | |
| dc.description | Supervised by Mr. Asif Newaz, Lecturer, Department of Electrical and Electronic Engineering (EEE) Islamic University of Technology (IUT) Board Bazar, Gazipur, Bangladesh This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Electrical and Electronic Engineering, 2024 | en_US |
| dc.description.abstract | Class imbalance is a common challenge in real-world datasets. In critical applications such as medical diagnosis, intrusion detection, fault detection, and disease identification. In most of these cases, the positive examples are very rare. For this, machine learning models often get biased towards to negative class and identify any unseen samples as negative class examples. This imbalance mostly favors the majority class, resulting in poor prediction performance for the minority class. This thesis thoroughly evaluates various state-of-the-art methods for addressing class imbalance over 100+ datasets with different imbalance ratios. A thorough experimental analysis have been done to find out the patterns of the outcomes. By experimenting with numerous sampling strategies, including under-sampling, over-sampling, and hybrid approaches, this study highlights the strengths and weaknesses of each technique. Additionally, we explored the impact of class overlap, a condition where instances of different classes share similar features, further complicating predictive modeling. The findings underscore the necessity of combining sampling methods with cost-sensitive learning to improve prediction accuracy and generalization. The research introduces novel hybrid approaches that optimize the balance between majority and minority classes, demonstrating significant improvements in performance. These advancements contribute valuable insights and methodologies for future research and practical applications in handling imbalanced data. | en_US |
| dc.identifier.citation | [1] A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, “Learning from Imbalanced Data Sets,” Learning from Imbalanced Data Sets, 2018, doi: 10.1007/978-3-319-98074-4. [2] M. Dudjak and G. Martinović, “An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult,” Expert Syst Appl, vol. 182, p. 115297, Nov. 2021, doi: 10.1016/J.ESWA.2021.115297. [3] H. Y. J. Kang, E. Batbaatar, D. W. Choi, K. S. Choi, M. Ko, and K. S. Ryu, “Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and Conquer Strategy,” JMIR Med Inform, vol. 11, no. 1, Jan. 2023, doi: 10.2196/47859. [4] P. Vuttipittayamongkol, E. Elyan, and A. Petrovski, “On the class overlap problem in imbalanced data classification,” Knowl Based Syst, vol. 212, p. 106631, Jan. 2021, doi: 10.1016/J.KNOSYS.2020.106631. [5] V. García, R. A. Mollineda, and J. S. Sánchez, “On the k-NN performance in a challenging scenario of imbalance and overlapping,” Pattern Analysis and Applications, vol. 11, no. 3–4, pp. 269–280, Sep. 2008, doi: 10.1007/S10044-007-0087-5/FIGURES/7. [6] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst Appl, vol. 73, pp. 220–239, May 2017, doi: 10.1016/J.ESWA.2016.12.035. [7] R. Mohammed, J. Rawashdeh, and M. Abdullah, “Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results,” 2020 11th International Conference on Information and Communication Systems, ICICS 2020, pp. 243–248, Apr. 2020, doi: 10.1109/ICICS49469.2020.239556. [8] A. Fernández, S. García, F. Herrera, and N. V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” Journal of Artificial Intelligence Research, vol. 61, pp. 863–905, Apr. 2018, doi: 10.1613/JAIR.1.11192. [9] A. Newaz, M. S. Mohosheu, and M. A. Al Noman, “Predicting complications of myocardial infarction within several hours of hospitalization using data mining techniques,” Inform Med Unlocked, vol. 42, p. 101361, Jan. 2023, doi: 10.1016/J.IMU.2023.101361. [10] J. F. Díez-Pastor, J. J. Rodríguez, C. García-Osorio, and L. I. Kuncheva, “Random Balance: Ensembles of variable priors classifiers for imbalanced data,” Knowl Based Syst, vol. 85, pp. 96–111, Sep. 2015, doi: 10.1016/J.KNOSYS.2015.04.022. [11] H. J. Kim, N. O. Jo, and K. S. Shin, “Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction,” Expert Syst Appl, vol. 59, pp. 226–234, Oct. 2016, doi: 10.1016/J.ESWA.2016.04.027. [12] G. Kovács, “Smote-variants: A python implementation of 85 minority oversampling techniques,” Neurocomputing, vol. 366, pp. 352–354, Nov. 2019, doi: 10.1016/J.NEUCOM.2019.06.100. [13] A. S. Tarawneh, A. B. Hassanat, G. A. Altarawneh, and A. Almuhaimeed, “Stop Oversampling for Class Imbalance Learning: A Review,” IEEE Access, vol. 10, pp. 47643–47660, 2022, doi: 10.1109/ACCESS.2022.3169512. 62 | P a g e [14] Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data,” J Biomed Inform, vol. 107, p. 103465, Jul. 2020, doi: 10.1016/J.JBI.2020.103465. [15] A. Newaz, S. Hassan, F. Shahriyar Haq, and C. Author, “An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification,” Aug. 2022, Accessed: Mar. 15, 2024. [Online]. Available: https://arxiv.org/abs/2208.11852v1 [16] J. J. Rodríguez, J. F. Díez-Pastor, Á. Arnaiz-González, and L. I. Kuncheva, “Random Balance ensembles for multiclass imbalance learning,” Knowl Based Syst, vol. 193, p. 105434, Apr. 2020, doi: 10.1016/J.KNOSYS.2019.105434. [17] V. H. Alves Ribeiro and G. Reynoso-Meza, “Ensemble learning by means of a multi-objective optimization design approach for dealing with imbalanced data sets,” Expert Syst Appl, vol. 147, p. 113232, Jun. 2020, doi: 10.1016/J.ESWA.2020.113232. [18] K. Yang et al., “Hybrid Classifier Ensemble for Imbalanced Data,” IEEE Trans Neural Netw Learn Syst, vol. 31, no. 4, pp. 1387–1400, Apr. 2020, doi: 10.1109/TNNLS.2019.2920246. [19] A. Anaissi, P. J. Kennedy, M. Goyal, and D. R. Catchpoole, “A balanced iterative random forest for gene selection from microarray data,” BMC Bioinformatics, vol. 14, no. 1, pp. 1–10, Aug. 2013, doi: 10.1186/1471-2105-14-261/TABLES/4. [20] R. Blagus and L. Lusa, “SMOTE for high-dimensional class-imbalanced data,” BMC Bioinformatics, vol. 14, no. 1, pp. 1–16, Mar. 2013, doi: 10.1186/1471-2105-14-106/FIGURES/7. [21] G. Kovács, “An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets,” Appl Soft Comput, vol. 83, p. 105662, Oct. 2019, doi: 10.1016/J.ASOC.2019.105662. [22] H. Nugroho, K. Wikantika, S. Bijaksana, and A. Saepuloh, “Handling imbalanced data in supervised machine learning for lithological mapping using remote sensing and airborne geophysical data,” Open Geosciences, vol. 15, no. 1, Jan. 2023, doi: 10.1515/GEO-2022- 0487/DOWNLOADASSET/SUPPL/S2A_MSIL2A_20190109T011721_N0211_R088_T53MPR_20190109 T032340.ZIP. [23] R. C. Prati, G. E. A. P. A. Batista, and D. F. Silva, “Class imbalance revisited,” Knowl Inf Syst, vol. 45, no. 1, pp. 247–270, Oct. 2015, doi: 10.1007/S10115-014-0794-3. [24] A. N. Tarekegn, M. Giacobini, and K. Michalak, “A review of methods for imbalanced multi-label classification,” Pattern Recognit, vol. 118, p. 107965, Oct. 2021, doi: 10.1016/J.PATCOG.2021.107965. [25] “Myocardial infarction complications - UCI Machine Learning Repository.” Accessed: Jul. 14, 2023. [Online]. Available: https://archive.ics.uci.edu/dataset/579/myocardial+infarction+complications [26] J. Alcalá-Fdez et al., “KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework,” vol. 17, pp. 255–287, 2011, Accessed: Mar. 15, 2024. [Online]. Available: http://the-data-mine.com/bin/view/Softwar | en_US |
| dc.identifier.uri | http://hdl.handle.net/123456789/2338 | |
| dc.language.iso | en | en_US |
| dc.publisher | Department of Electrical and Elecrtonics Engineering(EEE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh | en_US |
| dc.subject | Imbalanced learning, Smote, Cost-sensitive learning, sampling, hybrid-sampling | en_US |
| dc.title | Learning from Imbalanced Data | en_US |
| dc.type | Thesis | en_US |
Files
Original bundle
1 - 3 of 3
Loading...
- Name:
- Fulltext_ EEE_190021218_190021236_190021210_Book - Md. Salman Mohosheu 190021218.pdf
- Size:
- 1.8 MB
- Format:
- Adobe Portable Document Format
- Description:
Loading...
- Name:
- Signature Page_ EEE_190021218_210_236_.pdf
- Size:
- 401.07 KB
- Format:
- Adobe Portable Document Format
- Description:
Loading...
- Name:
- Turnitin Report_ EEE__190021218_210_236_.pdf
- Size:
- 140.24 KB
- Format:
- Adobe Portable Document Format
- Description:
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 1.71 KB
- Format:
- Item-specific license agreed upon to submission
- Description: