This chapter describes the key concepts and research methodologies that are used in the extraction of services delivery sentiment on social media. It explain the overall process of methodizes used to extract comments from social media, data pre-processing, classification of those data and the last process is testing and training of data, as well as the method that used to visualize the model. Also this chapter is going to answer all our research questions in a good manner.
3.2 Research Methodology
Methodology is the word which implies more than simply the methods you intend to use to collect data.
It also takes a consideration on different concepts which underlie the methods. The data collection was conducted through literature review where study different materials on how to extract data from social media especially instagram and Facebook where ikulu mawasiliano instagram account and Zitto Kabwe instagram and Facebook page used to collect data.
Literature review is a research tool which enables evaluator to make the best use of the previous work in the field under investigation.
These help the researcher to learn from experiences, findings and mistakes of the previous related work (Goode 1952). The literature review
3.3 data collection
In this project, data collection is done on Instagram account of ikulu Mawasiliano (kurugenzi ya mawasiliano ya Rais Ikulu) which is the official account of the government which used to produce different announcements and publication, official Facebook page and instagram page of zitto Kabwe who is a member of Tanzania parliament representing Kigoma
Figure 3. example of people who comments on ikulu mawasiliano page.
.1 example of people who comments on ikulu mawasiliano page.
In the process of data collection different craping tools used to help the mining of data from the specific social page, those tools include chrome scraping extensions tool which is added as extension on chrome browser. This scrapping tools help to mine all the comments from a specific posts.
Figure 4 Chrome scrapper extension
Other method used in the mining of data is Octourse and parsehub,
Figure 5. pursehub tool as used to mine data
Instagram and Facebook especially the specified social media account (ikulu mawasiliano and Zitto Kabwe) was selected for data collection because by investigation don on social media, those account revealed used to post direct social services delivery related posts, and some of people use to comment on the post provided by sharing their opinions and their views on social media. Also are the page which are frequently posting means are active. The second reason is just area of specification, because there are many accounts and the research cannot use all the account to extract data.
3.4 data pre-processing
Is the stage where un required data are being removed, for instance additional information like emoji, links, and other unrequired character which are included by people when they share their opinions. In order to prepare the data collected for machine learning tasks, the text pre-processing including stop word removal, tokenization, lemmatization, and stemming, feature engineering. Instance selection also cope with the infeasibility of learning from a large datasets (Kotsiantis, 2007), and it attempt to maintain the quality of mining with minimum sample size
For the non- English language such as Arabic language is highly derivative of tens or even hundreds of words that could be formed using only by one stem. Due to that one stem may form many other words. According to the Ahmed A Elbery working with the Arabic document without stemming may result to the enormous that number of words being input into the classification phase.
Tokenization it refers to the process of split text or words into unit that called tokens, and the process called tokenization. In tokenization text is being read, tokenizing it into tokens or words generally it take place through by either blank space or any other character.
Stop Words Removal
Another step performed in this research work is removing of all Arabic Words that have little meaning that are occur frequency to the documents such as or, whose, on, where, in, from, beyond, from and all. Process of removing stops word result to the effective processing and ensure efficiency of the terms indexing procedure.
3.5 feature engineering
This is a process modifying the existing data features into the new features that will be used to train a machine learning model. This process is important because the machine learning algorithm learn from the given data.
3.6 Confusion Matrix
The process of evaluating the model is done by using confusion matrix, this is done after data cleaning and preprocessing. Confusion matrix is the measurement of performance of machine learning classification problem and the output can be of two or more classes. This is a table which includes the combinations of actual values and predicted one
negative Positive Negative
Table 1. Confusion matrix
TP stands for True positive which means that the prediction was positive and the actual value is true
TN stands for true negative this means the prediction value is negative and the actual value is true
FP stands for false positive which means the prediction is positive but in the acual value it is false
FN standa for false negative and this means the prediction was negative and it is false means the actual was false
The confusion matrix is used to measure the accuracy of the model from a given dataset . accuracy of the model means the collectness of a classifier by using predicted value and the actual datasets.
Accuracy=(Number of correctly predictions)/(Total number of predictions)
Accuracy = (TP+TN)/(TP+TN+FN+FP )
Accuracy find out how the prediction is closest to the actual data, this means high accuracy
Recall can be defined as the percentage of correctly total relevant results that has been classified by algorithm. The model that can produce no false negative has the recall of 1
Recall= (True Positive)/(True Positive+False Negative )
Recall= TP/(TP+FN )
Precision can be referred as the ability of the model to identify only the relevant data.
Precision= (True Postive)/(True Positive+False positive)
Precision= TP/(TP+FP )
F-score is the one used to measure precision and recall at the same time, it use harmonic mean to measure. Mathematically shown below
3.7 Classification techniques
This research will use the following classification techniques
Na?ve Bayes this is one of supervised machine learning algorithms which applies Bayesian theorem with the assumption of independence between every pair of features.
But in real life problems, there are multiple X variables as shown below.
P(Y|x_1,x_2,x_3, ,x_n)= (P(?(x?_1,x_2,x_3, ,x_n)?Y)*P(Y))/(P(x_1,x_2,x_3, ,x_n))
Why na?ve Bayes
It is effective in high dimensional space and when number of sample is less than the number of dimensions
Can use a subset of training points in the decision function and it is memory efficient
It require the predicator to be independent, while the predicators are dependent in many real life cases, this can limit the performance of the classifier
Support Vector Machine (SVM)
Is the supervised machine learning which use hyper plane in a dimension space that classifies the data point?
Advantages of SVM
It can work well with clear margin of separation and it is effective in high dimensional spaces
Effective when number of samples is less than number of dimension
It is efficiency in the use of memory
In high dimensional paces tend to be effective
It does not perform well when the datasets have a high amount of noise
Cannot provide a direct probability estimates and cannot perform efficiently when large amount of data is used