Bangla Text Summarization using Deep Learning
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Department of Computer Science and Engineering (CSE), Islamic University of Technology (IUT), Board Bazar, Gazipur-1704, Bangladesh
Abstract
In this thesis, we present our work regarding text summarization. Text summarization is the technique for
generating concise and precise summaries of voluminous texts while focusing on the sections that convey
useful information without losing the overall meaning. In this age of information, there are vast quantities of
textual data available. Example sources include online documents, articles, news, and user reviews of various
products and services. We can present the underlying information present in these texts concisely through
summaries. However, generating summaries for such a large source of text documents by hand is troublesome.
We can utilize neural machine summarization systems to generate summaries automatically. These
systems leverage the power of deep learning models. Recently, with the invention of Transformer architecture,
modern summarization systems have achieved revolutionary performance gains. Efficient transformer-based
summarization systems exist for English and other popular languages, but not Bangla. In this research, we
present an efficient transformer-based text summarization system for the Bangla language. We use subword
encoding to eliminate the problem of rare and unknown words. We have created a large dataset, consisting
of 600 thousand news articles, to train our model. We trained a 6 million parameter model that is capable
of producing accurate summaries. We evaluated out summaries by observing it’s generative performance.
Description
Supervised by
Dr. Abu Raihan Mostofa Kamal, PhD
Professor
Department of Computer Science and Engineering (CSE)
Islamic University of Technology (IUT), OIC
Keywords
Citation
[1] Md Talukder, Sheikh Abujar, Abu Mohammad Masum, Fahad Faisal, and Syed Hossain. Bengali abstractive text summarization using sequence to sequence rnns. 07 2019. [2] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units, 2016. [3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. [5] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019. [6] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2020. [7] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014. [8] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. [9] Benjamin Heinzerling and Michael Strube. BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018 2018. European Language Resources Association (ELRA). [10] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. [11] Lutz Prechelt. Early stopping - but when? 03 2000. [12] Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming Zhou. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training, 2020. [13] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization, 2020. [14] Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021. [15] Liu Yang, Mingyang Zhang, Cheng Li, Michael Bendersky, and Marc Najork. Beyond 512 tokens: Siamese multi-depth transformer-based hierarchical encoder for long-form document matching, 2020.