A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis

Kong, Jeffery TH; Juwono, Filbert; Ngu, Ik Ying; Nugraha, I. Gde Dharma; Maraden, Yan; Wong, Wei Kitt

doi:10.3390/bdcc7020061

dc.contributor.author	Kong, Jeffery TH
dc.contributor.author	Juwono, Filbert
dc.contributor.author	Ngu, Ik Ying
dc.contributor.author	Nugraha, I. Gde Dharma
dc.contributor.author	Maraden, Yan
dc.contributor.author	Wong, Wei Kitt
dc.date.accessioned	2024-05-30T02:54:29Z
dc.date.available	2024-05-30T02:54:29Z
dc.date.issued	2023
dc.identifier.citation	Kong, J.T.H. and Juwono, F. and Ngu, I.Y. and Nugraha, I.G.D. and Maraden, Y. and Wong, W.K. 2023. A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis. Big Data and Cognitive Computing. 7 (2): 61.
dc.identifier.uri	http://hdl.handle.net/20.500.11937/95202
dc.identifier.doi	10.3390/bdcc7020061
dc.description.abstract	Social media has evolved into a platform for the dissemination of information, including fake news. There is a lot of false information about the current situation of the Coronavirus Disease 2019 (COVID-19) pandemic, such as false information regarding vaccination. In this paper, we focus on sentiment analysis for Malaysian COVID-19-related news on social media such as Twitter. Tweets in Malaysia are often a combination of Malay, English, and Chinese with plenty of short forms, symbols, emojis, and emoticons within the maximum length of a tweet. The contributions of this paper are twofold. Firstly, we built a multilingual COVID-19 Twitter dataset, comprising tweets written from 1 September 2021 to 12 December 2021. In particular, we collected 108,246 tweets, with over (Formula presented.) in Malay language, (Formula presented.) in English, (Formula presented.) in Chinese, and (Formula presented.) in other languages. We then manually annotated and assigned the sentiment of 11,568 tweets into three-class sentiments (positive, negative, and neutral) to develop a Malay-language sentiment analysis tool. For this purpose, we applied a data compression method using Byte-Pair Encoding (BPE) on the texts and used two deep learning approaches, i.e., the Multilingual Bidirectional Encoder Representation for Transformer (M-BERT) and convolutional neural network (CNN). BPE tokenization is used to encode rare and unknown words into smaller meaningful subwords. With the CNN, we converted the labeled tweets into image files. Our experiments explored different BPE vocabulary sizes with our BPE-Text-to-Image-CNN and BPE-M-BERT models. The results show that the optimal vocabulary size for BPE is 12,000; any values beyond that would not contribute much to the F1-score. Overall, our results show that BPE-M-BERT slightly outperforms the CNN model, thereby showing that the pre-trained M-BERT network has the advantage for our multilingual dataset.
dc.publisher	MDPI
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.title	A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis
dc.type	Journal Article
dcterms.source.volume	7
dcterms.source.number	2
dcterms.source.issn	2504-2289
dcterms.source.title	Big Data and Cognitive Computing
dc.date.updated	2024-05-30T02:54:29Z
curtin.department	Global Curtin
curtin.accessStatus	Open access
curtin.faculty	Global Curtin
curtin.contributor.orcid	Ngu, Ik Ying [0000-0001-6385-2831]
curtin.contributor.orcid	Kong, Jeffery TH [0000-0001-7453-5532]
curtin.identifier.article-number	61
dcterms.source.eissn	2504-2289
curtin.contributor.scopusauthorid	Ngu, Ik Ying [57195289487]
curtin.repositoryagreement	V3

Files in this item

Name:: 94986.pdf
Size:: 539.5Kb
Format:: PDF

This item appears in the following Collection(s)

Curtin Research Publications

Show simple item record

Except where otherwise noted, this item's license is described as http://creativecommons.org/licenses/by/4.0/

A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis

Files in this item

This item appears in the following Collection(s)

Related items