N-Gram-Based Serbian Text Classification


Petar Prvulović1, Nemanja Radosavljević1, Dušan Vujošević1, Dhinaharan Nagamalai2, Jelena Vasiljević1, 1Union University, Serbia, 2Wireilla, Australia


Natural language processing is an active area of research which finds many applications in variety of fields. Low-resource languages are a challenge as they lack curated datasets, stemmers and other elements used in text processing. Statistical approach is an alternative which can be used to bypass lack of rule-based implementations. The paper presents a model for classification of unstructured text in Serbian language. The model uses n-gram-based stemming to create document attributes vectors. Vectors are created on 3-, 4- and 5-grams. Vector reduction is tested on two criteria: n-gram entropy and number of occurrences, and two lengths: 1000 and 2000 n-grams. The support vector machine is used to classify documents. The model is trained and tested on a dataset collected from a Serbian news portal. Classification accuracy of over 80% is achieved. The presented model provides a good basis for range of applications in business decision automation for low-resource languages.


N-gram stemming, Serbian language, Unstructured natural text categorization

Full Text  Volume 13, Number 16