BaitBuster-Bangla: A Comprehensive Dataset for Clickbait Detection in Bangla with Multi-Feature and Multi-Modal Analysis. (arXiv:2310.11465v1 [cs.LG])
This study presents a large multi-modal Bangla YouTube clickbait dataset
consisting of 253,070 data points collected through an automated process using
the YouTube API and Python web automation frameworks. The dataset contains 18
diverse features categorized into metadata, primary content, engagement
statistics, and labels for individual videos from 58 Bangla YouTube channels. A
rigorous preprocessing step has been applied to denoise, deduplicate, and
remove bias from the features, ensuring unbiased and reliable analysis. As the
largest and most robust clickbait corpus in Bangla to date, this dataset
provides significant value for natural language processing and data science
researchers seeking to advance modeling of clickbait phenomena in low-resource
languages. Its multi-modal nature allows for comprehensive analyses of
clickbait across content, user interactions, and linguistic dimensions to
develop more sophisticated detection methods with cross-linguistic
applications.
Source: https://arxiv.org/abs/2310.11465