Sentiment analysis one of the most common tasks in NLP. The task is to predict the sentiment of text(usually negative/positive) base on a given text. For example “This is a really good movie” has positive sentiment. There are a lot of methods for this task. One based on different dictionary and rule. But this method requires people to work to make this dictionaries and rules. A more automatic method based on machine learning algorithms. Using this method allows us fast and precisely assess texts. Also, we can use deep learning net based on attention mechanism (Elmo, Bert, Roberta etc), but in this case, we can’t interpret why we take one’s decision. So we choose more interpretable method skip-m-ngramm, that use dictionaries(can be generated automatically) and ml methods.
There are a lot of datasets on the sentiment analysis task, especially in English. One of the most famous IMBD movie review dataset. But in this article, we use the Russian language dataset. We choose the new dataset from Kaggle’s challenge. It consists of 2 file JSON train and test. We use train JSON for training and validation. It consists from 8263 russian language news. 1434 is negative, positive 2795, neutral 4034.
Also as predefined n-grams we use files from Aigents system. Official Platform Overview:
1.Aigents Social and Media Intelligence Platform for Business joins heterogeneous social and online media sources, blockchains and payment systems and couples them with artificial intelligence to find and track changes in the field of information to let its force be with you.
2.Aigents Personal Artificial Intelligence is serving as a magic glass ball in the world of social and online networks to recognize one’s preferences, find what they need and help managing connections.
In the project repository, we can find files with negative/ positive n-gramms for Russian and English texts. Russian ones contain 3341 positive n-gram (1–5 word in ngram) and 9745 negative n-gram.
To use ML methods we need to transform text data to numerical data(vectors). For this, we use the n-skip-m-grams method.
Idea split a text into a group of words. Lenght of this group can be only 1 when we have 1gram, but if we also group word by pair because a pair of a word can have another sentiment than distinguish words. For example “cold” has rather a negative sentiment, but “cold drink” is positive. but not always connected words are near. Then we use skip-m part of the method. The idea we take 1 word of n-gram (where n > 1) then skip from 1 to m words and take another word in ngram.
After we can use different methods. Firstly we can use all ngrams for a custom dictionary, and then one hot encode text by this dictionary i.e. encode texts into vectors length equal number ngram in the dictionary wherein i-position 1 if ith word from the dictionary in text or number ith word of words in the text. We can define a dictionary in different ways. We can use a predefined dictionary, can take the most frequent grams or more difficult ways.
After this we can use this vector and use different methods to evaluate sentiment analysis.
In NLP preprocess data is very important. Especially in Russian where a word has many different endings. Words “красивый” “красивые” are a different form of one word. So we need somehow convert this form to unique form (lemma). In this work, we use the pymorphy2 library for lemmatisation. In the text, we have many words without sentiments (preposition), so-called stopword. In the preprocessing stage we throve out this word. We load Russian stopwords from NLTK library.
Simplest approach calculates positive and negative ngram, compare. if a number of positive if
So it is equivalent of LogisticRegression with equal weights.
But some word is more related to sentiment. So module of these words is bigger.So we provide result of Logistic regression model from sklearn.
In logistic regressin different features have different weights. We choose 20 most positive (most bigger weights) and 20 most negative ngrams.
Also we try different parameter m,n. Evaliate this model on Logistic regression model.
During this work we show that:
1.Predefined dictionary of ngram from AIgent gives results. It can be very useful if we have small data.
2.Best result on n = 4, m =3 , score 0.897 i.e. skipping have sence.
3.Simple discriminator gives a good result, but Logistic Regression gives better. Different n-grams have different influences. It means that make sense in positive/negative dictionary provides weights of tokens.