Sean, Benhur (2022) Offensive language detection in Tamil YouTube comments by adapters and cross-domain knowledge transfer. Computer Speech & Language, 76. ISSN 0885-2308
3.pdf - Published Version
Download (2MB)
Abstract
Over the past few years, researchers have been focusing on the identification of offensive
language on social networks. In places where English is not the primary language, social
media users tend to post/comment using a code-mixed form of text. This poses various
hitches in identifying offensive texts, and when combined with the limited resources available
for languages such as Tamil, the task becomes considerably more challenging. This study
undertakes multiple tests in order to detect potentially offensive texts in YouTube comments,
made available through the HASOC-Offensive Language Identification track in Dravidian CodeMix FIRE 2021.1 To detect the offensive texts, models based on traditional machine learning
techniques, namely Bernoulli Naïve Bayes, Support Vector Machine, Logistic Regression, and KNearest Neighbor, were created. In addition, pre-trained multilingual transformer-based natural
language processing models such as mBERT, MuRIL (Base and Large), and XLM-RoBERTa
(Base and Large) were also attempted. These models were used as fine-tuner and adapter
transformers. In essence, adapters and fine-tuners accomplish the same goal, but adapters
function by adding layers to the main pre-trained model and freezing their weights. This study
shows that transformer-based models outperform machine learning approaches. Furthermore,
in low-resource languages such as Tamil, adapter-based techniques surpass fine-tuned models
in terms of both time and efficiency.
Of all the adapter-based approaches, XLM-RoBERTa (Large) was found to have the highest
accuracy of 88.5%. The study also demonstrates that, compared to fine-tuning the models, the
adapter models require training of a fewer parameters. In addition, the tests revealed that the
proposed models performed notably well against a cross-domain data set.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Adapter Cross-domain analysis Finetuning HASOC Multilingual Machine learning models Offensive texts Transformer models |
Depositing User: | Mr Team Mosys |
Date Deposited: | 06 Oct 2022 05:53 |
Last Modified: | 06 Oct 2022 05:53 |
URI: | http://ir.psgcas.ac.in/id/eprint/1573 |