Sentence Emotion Detection

Given a dataset of an Indonesiain short story, this projects aims to create a model that could determine the emotion of a sentence based on the dataset using classification algorithm.

Methodology

The data firstly pre-processed using regex, word steming and stop words. Then, the bag-of-words and TF-IDF were created as the input for the model. There were 2 models created using the same algorithm, one using Bag-of-Words and the other using TF-IDF.

Dataset: 1000 rows (sentences) with labelled emotion
Model: Multinomial Naive-Bayes using Bag-of-Words and TF-IDF
Data Splitting: train data 80% and test data 20%

Data Exploration

It is essential to take a look at the dataset and discover label proportoin within the dataset as shown below

There are 6 emotions contained in the dataset, which are:

Happy (senang)
Sad (sedih)
Surprised (terkejut)
Angry (marah)
Scared (takut)
Disgusted (jijik)

Data Pre-processing

The sentences are preprocessed by:

Removing digits

Using regex function to remove digits

Before:'Jatuhnya Jayakarta ke tangan Kompeni Belanda pada tahun 1619 membuat banyak ulama marah.'

After: 'Jatuhnya Jayakarta ke tangan Kompeni Belanda pada tahun membuat banyak ulama marah.'

Word stemming

Using Indonesian corpus in Sastrawi library to take the word stem

Before:'Jatuhnya Jayakarta ke tangan Kompeni Belanda pada tahun membuat banyak ulama marah.'

After: 'jatuh jayakarta ke tangan kompeni belanda pada tahun buat banyak ulama marah'

Removing stop Words

Using Indonesian corpus in Sastrawi library to remove the stop words

Stopwords in Indonesian includes: 'yang', 'untuk', 'pada', 'ke', 'para', 'namun'

Feature Extraction

Bag-of-Words

TF-IDF

Model Training

The model used is Multinomial Naive Bayes and laplace smoothing is applied to the model.

Based on the multinomial NB text classification, it is concluded that:

Overally, the model accuracy using Bag-of-words achieved higher accuracy
Laplace helpls in improving the model accuracy as much as 10-11%
Due to unequal label distribution, it causes the model to determine the most frequent label more accurate than the least frequent labels

Checkout the source code on Github