MixSarc: A Bangla-English Code-Mixed Corpus For Implicit Meaning Identification
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh
Abstract
Thisthesisfocusesondetectinghumor,sarcasm,offensiveness,andvulgarityinBangla
English code-mixed text, an area largely overlooked in existing natural language pro
cessing (NLP) research. A novel dataset has been proposed, which will be created
by scraping and filtering social media content, followed by manual annotation across
fourattributes. Twotransformer-basedapproacheswereexploredinsmallscale: multi
class and multi-label text classification. The study also proposes future directions, in
cluding dataset balancing, comparative evaluation of transformer models and large
language models (LLMs), and the introduction of a SarOff Score to better capture
sarcasm-offense overlap. By addressing the complexities of code-mixed tone detec
tion, this work advances NLP in low-resource, multilingual settings
Description
Supervised by
Mr. Md Rafid Haque,
Lecturer,
Department of Computer Science and Engineering (CSE)
Islamic University of Technology (IUT)
Board Bazar, Gazipur, Bangladesh
This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Computer Science and Engineering, 2025
