Learn sentiment evaluation on n-skip-m-grams.

Photo by Roman Kraft on Unsplash

Introduction

Dataset

Also as predefined n-grams we use files from Aigents system. Official Platform Overview:

1.Aigents Social and Media Intelligence Platform for Business  joins heterogeneous social and online media sources, blockchains and  payment systems and couples them with artificial intelligence to find  and track changes in the field of information to let its force be with you.
2.Aigents Personal Artificial Intelligence is serving as a magic glass ball in the world of social and online networks to recognize one’s preferences, find what they need and help managing connections.

In the project repository, we can find files with negative/ positive n-gramms for Russian and English texts. Russian ones contain 3341 positive n-gram (1–5 word in ngram) and 9745 negative n-gram.

Method

Example of skip-3–2gram.

After we can use different methods. Firstly we can use all ngrams for a custom dictionary, and then one hot encode text by this dictionary i.e. encode texts into vectors length equal number ngram in the dictionary wherein i-position 1 if ith word from the dictionary in text or number ith word of words in the text. We can define a dictionary in different ways. We can use a predefined dictionary, can take the most frequent grams or more difficult ways.

After this we can use this vector and use different methods to evaluate sentiment analysis.

Preprocessing

Evaluation

Result simple classifier.

So it is equivalent of LogisticRegression with equal weights.
But some word is more related to sentiment. So module of these words is bigger.So we provide result of Logistic regression model from sklearn.

Result of Logistic regressin model.

In logistic regressin different features have different weights. We choose 20 most positive (most bigger weights) and 20 most negative ngrams.

Most Influential ngram.

Also we try different parameter m,n. Evaliate this model on Logistic regression model.

Result of different parameters n,m.

Conclusion

1.Predefined dictionary of ngram from AIgent gives results. It can be very useful if we have small data.

2.Best result on n = 4, m =3 , score 0.897 i.e. skipping have sence.

3.Simple discriminator gives a good result, but Logistic Regression gives better. Different n-grams have different influences. It means that make sense in positive/negative dictionary provides weights of tokens.