Twitter Sentiment Analysis part 2: Preprocessing the Data and Pickle

Hello and welcome to part 2 of this series, In part 1 we learned the basics of NLTK like, tokenizing, stop words, part of speech tagging etc

In this part our work is easy. All we have to do is to load positive and negative movie reviews, split them into words and then process it in such a way that efficiency will increase, after the completion of the process we will pickle it so that we don’t need to process it again and again.

For dataset,  click on the Github repo and download the positive and negative dataset.

As always, I am adding the full code here, if you want to understand the specific function or specific line then just navigate to the particular line in the explanation

import nltk
from nltk.corpus import stopwords, wordnet
from nltk import word_tokenize
import pickle

stop_words= set(stopwords.words("english"))

neg_rev=open("/Users/pushkarsingh/Desktop/negative.txt","rb").read()
pos_rev=open("/Users/pushkarsingh/Desktop/positive.txt","rb").read()

pos=[]
neg=[]

for rev in pos_rev.splitlines():
    pos.append(rev)

for rev in neg_rev.splitlines():
    neg.append(rev)

pos_words=[]
neg_words=[]    

for pos_line in pos:
    pos_words.append(word_tokenize(str(pos_line)))

for neg_line in neg:
    neg_words.append(word_tokenize(str(neg_line)))


pos_words_new=[]
neg_words_new=[]    

for line in pos_words:
    for words in line:
        pos_words_new.append(words)

for line in neg_words:
    for words in line:
        neg_words_new.append(words)

pos_words_new_stopwords=[]
neg_words_new_stopwords=[]
        
for words in pos_words_new:
    if words not in stop_words:
        pos_words_new_stopwords.append(words)

for words in neg_words_new:
    if words not in stop_words:
        neg_words_new_stopwords.append(words)


tagged_pos=[]
tagged_neg=[]
pos_adj=[]
neg_adj=[] 

tagged_pos.append(nltk.pos_tag(pos_words_new_stopwords))
for i in range(len(tagged_pos[0])):
    if tagged_pos[0][i][1]=="JJ":
        try:
            pos_adj.append((tagged_pos[0][i][0]))
        except Exception as e:
            print(str(e))
            
tagged_neg.append(nltk.pos_tag(neg_words_new_stopwords))
for i in range(len(tagged_neg[0])):
    if tagged_neg[0][i][1]=="JJ":
        neg_adj.append((tagged_neg[0][i][0]))
      
for i in pos_adj:
    if i in neg_adj:
        pos_adj.remove(i)
        neg_adj.remove(i)
        
pos_syn=[]
neg_syn=[]

for words in pos_adj:
    for syn in wordnet.synsets(words):
        for syn_word in syn.lemmas():
            pos_syn.append(syn_word.name())

for words in neg_adj:
    for syn in wordnet.synsets(words):
        for syn_word in syn.lemmas():
            neg_syn.append(syn_word.name())

pos_syn= list(set(pos_syn))
neg_syn= list(set(neg_syn))

for words in pos_adj:
    pos_syn.append(words)

for words in neg_adj:
    neg_syn.append(words)
        
pos_adj_FreqDist=dict(nltk.FreqDist(pos_syn))
neg_adj_FreqDist=dict(nltk.FreqDist(neg_syn))


pos_dict={}
neg_dict={}
count=0

for key1, value1 in pos_adj_FreqDist.items():
    for key2, value2 in neg_adj_FreqDist.items():
        if key1==key2:
            count+=1
            if(value1>value2):
                value1=value1-value2
                value2=0
                pos_dict.update({key1:value1})
            elif (value2>value1):
                value2=value2-value1
                value1=0
                neg_dict.update({key2:value2})

tagged_neg_dict=[]
tagged_pos_dict=[]
tagged_neg_dict_list=[]
tagged_pos_dict_list=[]


tagged_pos_dict.append(nltk.pos_tag(pos_dict.keys()))
for i in range(len(tagged_pos_dict[0])):
    if tagged_pos_dict[0][i][1]=="JJ":
        tagged_pos_dict_list.append(tagged_pos_dict[0][i][0])
        
tagged_neg_dict.append(nltk.pos_tag(neg_dict.keys()))
for i in range(len(tagged_neg_dict[0])):
    if tagged_neg_dict[0][i][1]=="JJ":
        tagged_neg_dict_list.append((tagged_neg_dict[0][i][0]))


pos_dict_updated={}
neg_dict_updated={}

for key1, value1 in pos_dict.items():
    for i in range(len(tagged_pos_dict_list)):
        if key1==tagged_pos_dict_list[i]:
            pos_dict_updated.update({key1:value1})

for key1, value1 in neg_dict.items():
    for i in range(len(tagged_neg_dict_list)):
        if key1==tagged_neg_dict_list[i]:
            neg_dict_updated.update({key1:value1})

pickle_in=open("/Users/pushkarsingh/Desktop/twitter/pos-yy_adj.pickle","wb")
pickle.dump(pos_dict_updated,pickle_in)
pickle_in.close()

pickle_in=open("/Users/pushkarsingh/Desktop/twitter/neg-yy_adj.pickle","wb")
pickle.dump(neg_dict_updated,pickle_in)
pickle_in.close()        

Here we are importing necessory packages

As we already discuss about installing nltk in part 1,

Now to install pickle, we can use pip,

pip install pickle
import nltk
from nltk.corpus import stopwords, wordnet
from nltk import word_tokenize
import pickle

Here we are importing the stopwords, as we discussed in part 1 that stop words are those words which have no significant meaning in the sentence, like “ .” ,” : ”, “ a ”, “ b “, “ c “, “ – “, “the”, “ an “, “ a “  etc

These stopwords don’t alter the meaning of a sentence in sentiment analysis and also consume the space so we tossed them out.

stop_words= set(stopwords.words("english"))

Here we are loading the datasets which consist of positive and negative reviews.

neg_rev=open("/Users/pushkarsingh/Desktop/negative.txt","rb").read()
pos_rev=open("/Users/pushkarsingh/Desktop/positive.txt","rb").read()

First of all, we split the data by a new line using splitlines() then save in pos and neg.

pos=[]
neg=[]

for rev in pos_rev.splitlines():
    pos.append(rev)

for rev in neg_rev.splitlines():
    neg.append(rev)

Here, we are converting the lines into words using word_tokenize() function from NLTK as we discussed in part 1, and save these words in pos_words and neg_words. Here the words are arranged in a list of list.

pos_words=[]
neg_words=[]    

for pos_line in pos:
    pos_words.append(word_tokenize(str(pos_line)))

for neg_line in neg:
    neg_words.append(word_tokenize(str(neg_line)))

Since the words are arranged in a list of list, then first we iterating to each list or line. and then each word in that line and save the words in pos_word_new and neg_word_new.

pos_words_new=[]
neg_words_new=[]    

for line in pos_words:
    for words in line:
        pos_words_new.append(words)

for line in neg_words:
    for words in line:
        neg_words_new.append(words)

Now we have all the words that are in the datasets but the number is huge and most of the words are stopwords

So, in this step, we are removing these words and save this new list of refined words in pos_words_new_stopwords and neg_words_new_stopwords.

pos_words_new_stopwords=[]
neg_words_new_stopwords=[]
        
for words in pos_words_new:
    if words not in stop_words:
        pos_words_new_stopwords.append(words)

for words in neg_words_new:
    if words not in stop_words:
        neg_words_new_stopwords.append(words)

At this moment we have a refined list of words which belong to positive and negative reviews but still, there are some words like actors names, car names and all sort of words which are redundant for us. So, in this step, we are tagging the words.

We already discussed Part of Speech tagging in part 1.

We are only interested in Adjective(JJ), because they are the main words which show the positivity or negativity of a sentence, like awesome, worst, good, bad etc. then save them to pos_adj and neg_adj.

tagged_pos=[]
tagged_neg=[]
pos_adj=[]
neg_adj=[] 

tagged_pos.append(nltk.pos_tag(pos_words_new_stopwords))
for i in range(len(tagged_pos[0])):
    if tagged_pos[0][i][1]=="JJ":
        try:
            pos_adj.append((tagged_pos[0][i][0]))
        except Exception as e:
            print(str(e))
            
tagged_neg.append(nltk.pos_tag(neg_words_new_stopwords))
for i in range(len(tagged_neg[0])):
    if tagged_neg[0][i][1]=="JJ":
        neg_adj.append((tagged_neg[0][i][0]))

Now we have a more refined set of words. but still, there are many words which are in both positive set and negative set, but the number of occurrence plays an important role

For ex:

  • 1. This movie is good
  • 2. this movie is not good,

One is positive but another is negative but we took the words, so good is in both the sets. if we take a singular instance of each word for each set then “good” become null and void, so to counter this, we are doing like if any word found in the positive set and if the same word is in the negative set then it will remove that word in both the sets,

for example, if there are 30 “good” in positive set but 5 “good” in negative set then there will become 25 “good” in positive set and 0 “good” in negative set, so it will reduce some more words and increase the speed.

for i in pos_adj:
    if i in neg_adj:
        pos_adj.remove(i)
        neg_adj.remove(i)

In this step, we are finding the synonyms, for each word in positive data set, and append these synonyms words in a new list pos_syn and neg_syn.

But the problem is that it will append the synonyms of each instance of the same words but we will take care of it in the next step.

The synonyms help us to make the data more reliable. since we have a lot of words for positive and negative classes. Maybe it gets fail for one or two tweets but when we use sentiment analysis we will work on like more than 1000 tweets and then the overall sentiment analysis will show the true value.

pos_syn=[]
neg_syn=[]

for words in pos_adj:
    for syn in wordnet.synsets(words):
        for syn_word in syn.lemmas():
            pos_syn.append(syn_word.name())

for words in neg_adj:
    for syn in wordnet.synsets(words):
        for syn_word in syn.lemmas():
            neg_syn.append(syn_word.name())

In this step, we are using the set() function. this function reduces any number of instance of a word to one. Like if the data contains 401 words out of which 400 “good” word and 1 “bad” word then after this operation, the dataset will contain only 2 words “good” and “bad”.  

Note that we are only applying the set() function on synonym words

we are also converting set dataset into a list using type casting,

pos_syn= list(set(pos_syn))
neg_syn= list(set(neg_syn))

for words in pos_adj:
    pos_syn.append(words)

for words in neg_adj:
    neg_syn.append(words)

In this step, we are using the FreqDist() function from NLTK. It returns the word and number of occurrence of that word in the list.

By using dict typecasting, we will get a dictionary in which the key is the word and value is the number of occurrences.

We need the number of occurrences to increase the weight of a word in the analysis.

pos_adj_FreqDist=dict(nltk.FreqDist(pos_syn))
neg_adj_FreqDist=dict(nltk.FreqDist(neg_syn))

Here we are searching for the same words in both the dictionary and if found then we delete the word and adjusting the values, just like we did in line 70

pos_dict={}
neg_dict={}
count=0

for key1, value1 in pos_adj_FreqDist.items():
    for key2, value2 in neg_adj_FreqDist.items():
        if key1==key2:
            count+=1
            if(value1>value2):
                value1=value1-value2
                value2=0
                pos_dict.update({key1:value1})
            elif (value2>value1):
                value2=value2-value1
                value1=0
                neg_dict.update({key2:value2})

Till this step, we have a dict in which key is the word and the value is the number of occurrence, moreover the same word is not in both the lists.

Now there are many words which may be a name, or address, or anything else which don’t contribute any part in the analysis. So for this, we are again tagging each word with its part of speech, and extracting the adjective as we did earlier just to double check.

Now in the first line, we assign a part of speech tag to the keys of positive dictionry and save them into a list.

Here we have the list of list of word and tag. like [["good”,JJ],[["America”,NNP]

In the next step, we iterate over each combination and search for “JJ”(adjective) if found we save it in new array.

tagged_neg_dict=[]
tagged_pos_dict=[]
tagged_neg_dict_list=[]
tagged_pos_dict_list=[]


tagged_pos_dict.append(nltk.pos_tag(pos_dict.keys()))
for i in range(len(tagged_pos_dict_list[0])[0]    if tagged_neg_dict_list[0][[0][i][1]J":
        tagged_pos_dict_list.append(tagged_pos_dict[0][[0][i][0]      
tagged_neg_dict.append(nltk.pos_tag(neg_dict.keys()))
for i in range(len(tagged_neg_dict_list[0])[0]    if tagged_neg_dict[0][[0][i][1]J":
        tagged_neg_dict_list.append((tagged_neg_dict[0][[0][i][0]/pre>






Now we have a list of adjectives, in this step, we are making a combination of these adjectives and number of occurrence,

Since we have pos_dict and neg_dict in which there are all the words and values, so here we are finding these adjectives in this list and if found then copy the values and making a new and final dictionary.

pos_dict_updated={}
neg_dict_updated={}

for key1, value1 in pos_dict.items():
    for i in range(len(tagged_pos_dict_list)):
        if key1==tagged_pos_dict_p[i]:[i]          pos_dict_updated.update({key1:value1})

for key1, value1 in neg_dict.items():
    for i in range(len(tagged_neg_dict_list)):
        if key1==tagged_neg_dict_p[i]:[i]          neg_dict_updated.update({key1:value1})

Finally, we have a dictionary in which key is the adjective words and value is the number of occurrences, In this step, we are picking these dictionaries so that we don’t have to make these dictionaries again and again.

pickle_in=open("/Users/pushkarsingh/Desktop/twitter/pos-yy_adj.pickle","wb")
pickle.dump(pos_dict_updated,pickle_in)
pickle_in.close()

pickle_in=open("/Users/pushkarsingh/Desktop/twitter/neg-yy_adj.pickle","wb")
pickle.dump(neg_dict_updated,pickle_in)
pickle_in.close() 

And this concludes this part.

In the next part we will create a function to predict if the sentence is more positive or negative, and in the later parts we will load the tweets from twitter and showing the analysis on the live graph.

As always if you have any doubt, suggestion or concern then please comment below.

Thanks a lot. 😀


Leave a Reply

Your email address will not be published. Required fields are marked *