Twitter Sentiment Analysis part 3: Creating a Predicting Function and testing it.

Hello and welcome to the 3rd part of this series on Twitter Sentiment Analysis using NLTK. In the previous parts we learned about the basics of NLTK and then creating a dataset using positive and negative movie reviews, In this part we will create a function to predict the nature of sentences and later we will use it for tweets. So let’s understand how it works.

Our approach will be to split the words in the sentence and then count the number of occurrence of these words in both positive and negative dataset and then compare the numbers,

if the sentence contains more positive words then it’s a  positive type, else it will consider as negative type, but if in case positive and negative count become equal then it willcount as none.

We will also save the difference of positive and negative count as confidence, we can use it later to add another filter on the analysis.

As always, I am adding the full code here, if you want to understand the specific function or specific line then just navigate to the particular line in the explanation

from nltk.corpus import stopwords
from nltk import word_tokenize
import pickle

stop_words= set(stopwords.words("english"))

pickle_out=open("/Users/pushkarsingh/Desktop/twitter/pos-yy_adj.pickle","rb")
pos_dict=pickle.load(pickle_out)

pickle_out=open("/Users/pushkarsingh/Desktop/twitter/neg-yy_adj.pickle","rb")
neg_dict=pickle.load(pickle_out)

def check(example):
    pos_count=0
    neg_count=0
    ex_words=word_tokenize(example)

    for ex_word in ex_words:
        if ex_word.lower() not in stop_words:
            for key, value in pos_dict.items():
                if key==ex_word.lower():
                    pos_count+=value
            for key, value in neg_dict.items():
                if key==ex_word.lower() :
                    neg_count+=value

    if pos_count>neg_count:
        conf=pos_count-neg_count
        checker="pos"
        
            
    elif pos_count<neg_count:
        conf=neg_count-pos_count
        checker="neg"
       
    elif pos_count==neg_count:
        checker="None"
        conf=0
        
    return checker, conf

example_1="The movie is just a waste of time, 
it's complete junk, Totally waste of money."

example_2="The food of this restaurant is very good, 
I will recommend this place to everyone"

example_3="This is a low-quality product, 
even the reviews of this product is very poor."

print(predict(example_1))
print(predict(example_2))
print(predict(example_3))

Output:

('neg', 10)
('pos', 14)
('neg', 29)

Explanation

Here we are importing the dependencies.

from nltk.corpus import stopwords
from nltk import word_tokenize
import pickle

In this step, we are importing the stopwords.

Note that here I am using the custom stopword file which you should already have if you are following the lessons from part 1. but still, you can download the files from this GitHub repository. and save it in a stopword folder. For more info on how to save and where to save, you can refer to part 1.

stop_words= set(stopwords.words("english"))

Here we are loading our pickles which we saved in previous lessons. These are kind of dictionary for positive and negative words. and then saving to a variable called pos_adj and neg_adj.

pickle_out=open("/Users/pushkarsingh/Desktop/twitter/pos_adj.pickle","rb")
pos_adj=pickle.load(pickle_out)

pickle_out=open("/Users/pushkarsingh/Desktop/twitter/neg_adj.pickle","rb")
neg_adj=pickle.load(pickle_out)

Here we are creating a function which we will use to predict the nature of tweets, we call this function “check” then we initialize counter variables, which we will use later.

def predict(example):
    pos_count=0
    neg_count=0

In this step we are spliting the lines into words. and saving them in ex_words.

ex_words=word_tokenize(example)

In this step we are iterating each word in the list of ex_words and then search if the word is in stop words or not.

If not then we search the occurrence of that word in our pos_dict key set, if found then we add the value of occurrence with pos_count. Similarly, we search for the word in the negative set and add the value with neg_count.

 for ex_word in ex_words:
        if ex_word.lower() not in stop_words:
            for key, value in pos_dict.items():
                if key==ex_word.lower():
                    pos_count+=value
            for key, value in neg_dict.items():
                if key==ex_word.lower() :
                    neg_count+=value

Here if the value of pos_count is greater then the neg_count then we can conclude that sentence has more positive words thus the sentence is positive type similarly we can check for negative.

We also calculate the difference between both the counters and save it in confidence or conf variable. Higher the conf value, higher the accuracy, We can use it to add another layer of reliability.

Now if the pos_count and neg_count become equal then we conclude that the sentence is neutral or don’t have words which can matches with our pos and neg dictionary so we tossed them out and mark as ‘None’.

if pos_count>neg_count:
        conf=pos_count-neg_count
        checker="pos"
        
            
    elif pos_count<neg_count:
        conf=neg_count-pos_count
        checker="neg"
       
    elif pos_count==neg_count:
        checker="None"
        conf=0
        

In the end, we return two variables, checker and conf. checker is pos, neg or none, while conf is the difference between pos_count and neg_count, In case of None, the conf becomes zero.

return checker, conf

Here we have three examples. Lets check the prediction values.

example_1="The movie is just a waste of time, 
it's complete junk, Totally waste of money."

example_2="The food of this restaurant is very good, 
I will recommend this place to everyone"

example_3="This is a low-quality product, 
even the reviews of this product is very poor."

print(predict(example_1))
print(predict(example_2))
print(predict(example_3))

Output:

('neg', 10)
('pos', 14)
('neg', 29)

Here we can clearly see that the labels are accurate and confidence is decent.

Thats all from my side, In the next part we will learn the procedure to load the tweets from tweeter API, we will learn how to get keys and tokens and use then in our program to fetch the tweets.

If you have any doubt, concern or suggestions till this point, feel free to comment below.

Thanks for reading 😀


Leave a Reply

Your email address will not be published. Required fields are marked *