Hate Speech Detection of Arabic Shorttext


Abdullah Aref1, Rana Husni Al Mahmoud2, Khaled Taha3 and Mahmoud Al-Sharif3, 1Princess Sumaya University for Technology, Jordan, 2University of Jordan, Jordan and 3Trafalgar AI, Jordan


The aim of sentiment analysis is to automatically extract the opinions from a certain text and decide its sentiment. In this paper, we introduce the first publicly-available Twitter dataset on Sunnah and Shia (SSTD), as part of a religious hate speech which is a sub problem of the general hate speech. We, further, provide a detailed review of the data collection process and our annotation guidelines such that a reliable dataset annotation is guaranteed. We employed many stand-alone classification algorithms on the Twitter hate speech dataset, including Random Forest, Complement NB, DecisionTree, and SVM and two deep learning methods CNN and RNN. We further study the influence of word embedding dimensions FastText and word2vec. In all our experiments, all classification algorithms are trained using a random split of data (66% for training and 34% for testing). The two datasets were stratified sampling of the original dataset. The CNN-FastText achieves the highest F-Measure (52.0%) followed by the CNN-Word2vec (49.0%), showing that neural models with FastText word embedding outperform classical feature-based models.


HateSpeech, Dataset, Text classification, Sentiment analysis.

Full Text  Volume 10, Number 5