LLM-BasedAuto-Labeling of Developer Discussions AComparative Study of Zero-Shot, Sampling Methods, Ensembles and Judge-Guided Strategies

Shakhawat, Chowdhury Ashfaq; Soyeb, Md; Haque, Iftekharul

LLM-BasedAuto-Labeling of Developer Discussions AComparative Study of Zero-Shot, Sampling Methods, Ensembles and Judge-Guided Strategies

dc.contributor.author	Shakhawat, Chowdhury Ashfaq
dc.contributor.author	Soyeb, Md
dc.contributor.author	Haque, Iftekharul
dc.date.accessioned	2026-06-24T09:32:52Z
dc.date.issued	2025-10-25
dc.description	Supervised by Mr. Md. Tariquzzaman, Junior Lecturer, Dr. KamrulHasan, Professor, Dr. Hasan Mahmud, Professor, Department of Computer Science and Engineering (CSE) Islamic University of Technology (IUT) Board Bazar, Gazipur, Bangladesh This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Software Engineering, 2025
dc.description.abstract	Software bugs have long posed challenges to the delivery of reliable digital services, promptingextensiveresearch intoautomatedbuglabeling. Whilesignificantadvance ments have been made, existing approaches often struggle with high false positive rates and face difficulties in practical deployment due to reliance on structured bug reports. Most contemporary studies utilize structured datasets containing developer generated bug reports, typically written in natural language. These reports require manual or semi-automated extraction of relevant inputs, a process that is both time consuming and error-prone. With the emergence of Large Language Models (LLMs), a new research opportu nity arises: can LLMs effectively extract failure-inducing inputs from unstructured, community-driven sources such as GitHub, Stack Overflow, and other developer fo rums? In this study, we propose a novel end-to-end pipeline that leverages LLMs for bug labeling directly from raw, unstructured text. Our methodology focuses on au tomated labeling, utilizing prompt-based approaches to optimize the performance of generative models. Wecuratedandannotateda datasetcomprising1885StackOverflow questions posted between 2023 and2025,andfurthervalidatedourapproachusingadatasetofGitHub issue reports. Through extensive experimentation, we assess the accuracy and ro bustness of our pipeline across diverse input formats. Unlike existing solutions, our proposed framework emphasizes simplicity, scalability, and cost-effectiveness, mak ing it well-suited for integration into real-world software development workflows.
dc.identifier.citation	[1] G. Aracena, K. Luster, F. Santos, I. Steinmacher, and M. A. Gerosa, “Applying large language models api to issue classification problem,” in Proceedings of the 46thInternationalConferenceonSoftwareEngineering(ICSE),Lisbon,Portugal: Association for Computing Machinery, 2024. [Online]. Available: https:// doi.org/10.48550/arXiv.2401.04637 [2] J. Bai et al., Qwen technical report, 2023. arXiv: 2309.16609 [cs.CL]. [Online]. Available: https://doi.org/10.48550/arXiv.2309.16609 [3] R.H.Bugzilla,“Theredhatbugtrackingsystem,”inBugReport,RedHatBugzilla, 2025.[Online].Available:https://bugs.launchpad.net/bugs/bugtrackers/ redhat-bugs [4] M.Cate, “Understanding zero-shot and few-shot learning in llms,” Jun. 2023. [5] J.Cohen,“Acoefficientofagreementfornominalscales,”EducationalandPsy chologicalMeasurement,vol.20,no.1,pp.37–46,1960.doi:10.1177/001316446002000104 eprint: https://doi.org/10.1177/001316446002000104. [Online]. Avail able: https://doi.org/10.1177/001316446002000104 [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. arXiv: 1810.04805 [cs.CL]. [Online]. Available: https://doi.org/10.48550/arXiv.1810. 04805 [7] X. Du, Z. Liu, C. Li, X. Ma, Y. Li, and X. Wang, “Llm-brc: A large language model-based bug report classification framework,” Software Quality Journal, vol. 32, no. 3, pp. 985–1005, 2024. doi: https://doi.org/10.1007/s11219 024-09675-3 59 [8] Z.-g. Fang, S.-q. Yang, C.-x. Lv, S.-y. An, and W. Wu, “Application of a data driven xgboost model for the prediction of covid-19 in the usa: A time-series study,” BMJ Open, vol. 12, no. 7, e056685, 2022. doi: 10.1136/bmjopen-2021 056685 [Online]. Available: https://bmjopen.bmj.com/content/12/7/ e056685 [9] A.A.Hasan,S.Saha,M.M.Imran,andT.S.Zaman,“Llput:Investigatinglarge language models for bug report-based input generation,” in Companion Pro ceedings of the 33rd ACM Symposium on the Foundations of Software Engineer ing(FSE’25),Trondheim,Norway:AssociationforComputingMachinery,2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.20578 [10] B. Hui, J. Yang, Z. Cui, and J. Yang, Qwen 2.5 coder, 2024. arXiv: 2409.12186 [cs.CL]. [Online]. Available: https://doi.org/10.48550/arXiv.2409. 12186 [11] M.J.Islam,G.Nguyen,R.Pan,andH.Rajan,“Acomprehensivestudyondeep learning bug characteristics,” in Pro-ceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Soft ware Engineering (ESEC/FSE ’19), New York, NY, USA: ACM, 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1906.01388 [12] A.R.D.Kelin, B. Nagarajan, S. Rajendran, and S. Muthumari, “Automatic bug classificationsystemtoimprovethesoftwareorganizationproductperformance,” International Journal of Sociotechnology and Knowledge Development, vol. 14, no. 1, 2022. doi: https://doi.org//IJSKD.310066 [13] E.Lewtun,“Githubissuesdataset,”2022.[Online].Available:https://huggingface. co/datasets/lewtun/github-issues [14] A. F. Otoom, S. Al-jdaeh, and M. Hammad, “Automated classification of soft ware bug reports,” in Proceedings of the 18th International Conference on Soft ware Engineering Research and Practice (SERP’19), Las Vegas, USA: The Steer ing Committee of The World Congress in Computer Science, Computer Engi neering and Applied Computing (WorldComp), 2019. doi: 10.1145/3357419. 3357424 [Online]. Available: https://doi.org/10.1145/3357419.3357424 60 [15] K.Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ACL, 2002, pp. 311–318. [16] G.RauandY.-S.Shih,“Evaluationofcohen’skappaandothermeasuresofinter rater agreement for genre analysis and other nominal data,” Journal of English for Academic Purposes, vol. 53, p. 101026, 2021, issn: 1475-1585. doi: https: //doi.org/10.1016/j.jeap.2021.101026 [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S1475158521000709 [17] statsmodels,Fleisskappa,2025.[Online].Available:https://www.statsmodels. org/dev/generated/statsmodels.stats.inter_rater.fleiss_kappa. html [18] E. H. Yılmaz, C. E. Öztürk, and Ö. Köksal, “Bug report classification with en semble learning for closed-source software,” Preprint, 2023, Available on arXiv and SSRN. [Online]. Available: https://doi.org/10.22541/au.169401793. 32858063/v1
dc.identifier.uri	https://repository.iutoic-dhaka.edu/handle/123456789/2636
dc.language.iso	en
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh
dc.title	LLM-BasedAuto-Labeling of Developer Discussions AComparative Study of Zero-Shot, Sampling Methods, Ensembles and Judge-Guided Strategies
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 48 Fulltext_ CSE_LLM-BasedAuto-Labeling of Developer Discussions_ 200042123_57_59_.pdf
Size:: 1.31 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

2025