A Semi-Automated Approach to Generate Bangla Dataset for Question-Answering and Query-Based Text Summarization

dc.contributor.authorMushabbir, Mueeze Al
dc.contributor.authorAlamgir, Refaat Mohammad
dc.contributor.authorHumdoon, Ahmed Azaz
dc.date.accessioned2023-04-28T05:16:13Z
dc.date.available2023-04-28T05:16:13Z
dc.date.issued2022-05-30
dc.descriptionSupervised by Dr. Kamrul Hasan, Professor, Department of Computer Science and Engineering(CSE), Islamic University of Technology (IUT) Board Bazar, Gazipur-1704, Bangladesh. This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2022.en_US
dc.description.abstractWith the vast amount of information available on the Internet, finding answers to questions is as important as ever in today’s day and age. In Natural Language Processing Research, Question Answering (QA) and Query-based Text Summarization (QBSUM) are there to tackle this challenge. However, most of the work being done neglects low resource languages such as Bangla, resulting in the small number of quality datasets available in the literature. Therefore to address this research gap, in this work, we propose a semi-automated methodology for generating a Bangla dataset with Natural Questions for three tasks - Question Answering (QA), Query-based Single Document Text Summarization (SD-QBSUM) and Query-based Multi-Document Text Summarization (MD-QBSUM). We then provide baselines for this dataset on those tasks and also compare our dataset with existing ones on various metrics.en_US
dc.identifier.citation1] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al., “Natural questions: a benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019. [2] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” arXiv preprint arXiv:1806.03822, 2018. [3] T. Tahsin Mayeesha, A. Md Sarwar, and R. M. Rahman, “Deep learning based question answering system in bengali,” Journal of Information and Telecommunication, vol. 5, no. 2, pp. 145–178, 2021. [4] S. Kulkarni, S. Chammas, W. Zhu, F. Sha, and E. Ie, “Aquamuse: Automatically generating datasets for query-based multi-document summarization,” arXiv preprint arXiv:2010.12694, 2020. [5] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” arXiv preprint arXiv:1910.10683, 2019. [6] J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Nikolaev, and J. Palomaki, “Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 454–470, 2020. [7] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” arXiv preprint arXiv:2010.11934, 2020. 53 54 BIBLIOGRAPHY [8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [9] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzm´an, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019. [10] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019. [11] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language-agnostic bert sentence embedding,” arXiv preprint arXiv:2007.01852, 2020.en_US
dc.identifier.urihttp://hdl.handle.net/123456789/1860
dc.language.isoenen_US
dc.publisherDepartment of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur, Bangladeshen_US
dc.subjectQuestion Answering, Query Based Single Document Summarization, Query Based Multi-Document Summarization, Semi-Automatic Approach, Semi-Supervised Method, Natural Question, mT5, mBERT, SBERTen_US
dc.titleA Semi-Automated Approach to Generate Bangla Dataset for Question-Answering and Query-Based Text Summarizationen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Alamgir_fulltext_thesis.pdf
Size:
1.36 MB
Format:
Adobe Portable Document Format
Description:
Full text of the Thesis
Loading...
Thumbnail Image
Name:
Alamgir_13% turnitin similarity.pdf
Size:
452.18 KB
Format:
Adobe Portable Document Format
Description:
Turnitin report_13% similarity

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections