Twitter Sentiment Analysis part 1: Installing NLTK and Learning Basics

Hello and welcome in this series on NLTK or Natural Language ToolKit. The craze of Natural Language Processing is at another level, with the advancement in Artificial Intelligence and in technology be it in a semantic analysis or in a chatbot or some kind of virtual AI like Google, Siri or Alexa, Natural Language Processing has a vast use case.

In this tutorial series, we will learn about various features of NLTK, even though there is a lot to learn in NLTK,  we will learn some basic here. We will also build a project on Twitter Sentiment Analysis, in which we will load various tweets on a particular subject and then try to analyse if the tweet tends to positive or negative, and then we will see the whole analysis on the live graph.

First of all, we need NLTK. So let’s start by installing the NLTK,

There are two files which we require to use this toolkit. First is NLTK itself and second is a pack of NLTK Data files.

To install NLTK we can simply use pip.

pip install nltk

Once the NLTK is installed, just open an IDLE and type the following code and run

import nltk
nltk.download()

Upon successful execution, a dialogue box will open just like below, just select download and it will download the whole tool kit.

nltk download

Once NLTK is installed and NLTK_data is downloaded, we can proceed to learn the basics.

1. Tokenizing

If you are using NLTK then you will surely use tokenizing at some place.

Suppose we have a line, “I am feeling great today.”  Now to analyze this line we have to separate each word and understand the meaning of each and every word and then we can somehow predict the meaning of the whole line.

So in the nutshell, we need something that can split the sentence into words.

For this we have word_tokenize and sent_tokenize

Just like word_tokenize, sent_tokenize is used to split the paragraph into lines.

Let’s understand better with the help of an example.

Word Tokenize

from nltk import word_tokenize
example= "Natural Programming Language has a great future."
print(word_tokenize(example))
Output: 

['Natural', 'Programming', 'Language', 'has', 'a', 'great', 'future', '.']

Here we have the list of words which contains in a sentence.

Sent Tokenize

from nltk import sent_tokenize
example= "Winter is coming. Thanos will die. 
print(sent_tokenize(example))
Output:

['Winter is coming.', 'Thanos will die.']

Here we have the list of the sentences which contains in a paragraph.

Both the functions have their own use case. we will use word_tokenize in twitter analysis where we split the tweets into words and then check how many words are tend to negative and how many tend to positive and then we conclude if the tweet is more positive or more negative.

2. Stopwords

Stopwords are the words which have no significant effect on the meaning of a sentence. As we can see the above example of word_tokenize, we have ‘has’, ‘a’, ‘.’ , these words are not contributing to the meaning of a sentence and without these words, the sentence becomes, “Natural Programming Language great future” and the meaning is still understandable.

Thus stopwords are the collection of such words, which include is, the, am, are,   a , an etc.

you can also add or delete the words in this list. for this, just search for “nltk_data” and you will get a folder like this.

then just navigate to corpora>stopwords>english

and open it in a text editor, here we can add and delete the words. I have added some more symbols which I find is not added in this list. You can find the file from this GitHub repo and replace it with yours.

Now let’s come to the stopwords example and look at how we can use it in our program

from nltk.corpus import stopwords
from nltk import word_tokenize

stop_words= set(stopwords.words("english"))

example="I am a big fan of Avengers."

words= word_tokenize(example)
for word in words:
	if word not in stop_words:
		print(word)


Output: 

I 
big
fan
Avengers

Here as you can see the words like “am”, “a”, “of” is tossed out.

3. Part of speech tagging

We use part of speech tagging to tag the entity like a verb, adjective, noun, adverb, preposition etc. This is very useful when we have to find a particular set of words like if we need to search a set of the name of a person, place or thing then we can search for all the Noun in a paragraph.

Lets understand with the help of an example

import nltk
from nltk import word_tokenize, sent_tokenize

example= “Thank you very much. Mr. Speaker, Mr. President, distinguished members of Congress, honored guests and fellow citizens. May I congratulate all of you who are members of this historic 100th Congress of the United States of America. In this 200th anniversary year of our Constitution, you and I stand on the shoulders of giants–men whose words and deeds put wind in the sails of freedom.”

tagged=[]
words=[]
for lines in sent_tokenize(example):
    for word in word_tokenize((lines)):
        words.append(word)
tagged.append(nltk.pos_tag(words))
print(tagged)
Output:

[[(‘Thank’, ‘NNP’), (‘you’, ‘PRP’), (‘very’, ‘RB’), (‘much’, ‘RB’), (‘.’, ‘.’), (‘Mr.’, ‘NNP’), (‘Speaker’, ‘NNP’), (‘,’, ‘,’), (‘Mr.’, ‘NNP’), (‘President’, ‘NNP’), (‘,’, ‘,’), (‘distinguished’, ‘VBD’), (‘members’, ‘NNS’), (‘of’, ‘IN’), (‘Congress’, ‘NNP’), (‘,’, ‘,’), (‘honored’, ‘VBD’), (‘guests’, ‘NNS’), (‘and’, ‘CC’), (‘fellow’, ‘JJ’), (‘citizens’, ‘NNS’), (‘.’, ‘.’), (‘May’, ‘NNP’), (‘I’, ‘PRP’), (‘congratulate’, ‘VBP’), (‘all’, ‘DT’), (‘of’, ‘IN’), (‘you’, ‘PRP’), (‘who’, ‘WP’), (‘are’, ‘VBP’), (‘members’, ‘NNS’), (‘of’, ‘IN’), (‘this’, ‘DT’), (‘historic’, ‘JJ’), (‘100th’, ‘JJ’), (‘Congress’, ‘NNP’), (‘of’, ‘IN’), (‘the’, ‘DT’), (‘United’, ‘NNP’), (‘States’, ‘NNPS’), (‘of’, ‘IN’), (‘America’, ‘NNP’), (‘.’, ‘.’), (‘In’, ‘IN’), (‘this’, ‘DT’), (‘200th’, ‘CD’), (‘anniversary’, ‘JJ’), (‘year’, ‘NN’), (‘of’, ‘IN’), (‘our’, ‘PRP$’), (‘Constitution’, ‘NNP’), (‘,’, ‘,’), (‘you’, ‘PRP’), (‘and’, ‘CC’), (‘I’, ‘PRP’), (‘stand’, ‘VBP’), (‘on’, ‘IN’), (‘the’, ‘DT’), (‘shoulders’, ‘NNS’), (‘of’, ‘IN’), (‘giants’, ‘NNS’), (‘–‘, ‘:’), (‘men’, ‘NNS’), (‘whose’, ‘WP$’), (‘words’, ‘NNS’), (‘and’, ‘CC’), (‘deeds’, ‘NNS’), (‘put’, ‘VBD’), (‘wind’, ‘NN’), (‘in’, ‘IN’), (‘the’, ‘DT’), (‘sails’, ‘NNS’), (‘of’, ‘IN’), (‘freedom’, ‘NN’), (‘.’, ‘.’)]]

here is the tagged output for the example program, JJ is for the adjective,like “historic”, “fellow”. PRP is for the preposition like “you”, “I” etc

Here is the full Part of Speech tag list:

CC	coordinating conjunction
CD	cardinal digit
DT	determiner
EX	existential there (like: "there is" ... think of it like "there exists")
FW	foreign word
IN	preposition/subordinating conjunction
JJ	adjective	'big'
JJR	adjective, comparative	'bigger'
JJS	adjective, superlative	'biggest'
LS	list marker	1)
MD	modal	could, will
NN	noun, singular 'desk'
NNS	noun plural	'desks'
NNP	proper noun, singular	'Harrison'
NNPS	proper noun, plural	'Americans'
PDT	predeterminer	'all the kids'
POS	possessive ending	parent's
PRP	personal pronoun	I, he, she
PRP$	possessive pronoun	my, his, hers
RB	adverb	very, silently,
RBR	adverb, comparative	better
RBS	adverb, superlative	best
RP	particle	give up
TO	to	go 'to' the store.
UH	interjection	or
VB	verb, base form	take
VBD	verb, past tense	took
VBG	verb, gerund/present participle	taking
VBN	verb, past participle	taken
VBP	verb, sing. present, non-3d	take
VBZ	verb, 3rd person sing. present	takes
WDT	wh-determiner	which
WP	wh-pronoun	who, what
WP$	possessive wh-pronoun	whose
WRB	wh-abverb	where, when

Thus we can separate out the adjectives or Nouns or Adverbs whichever we want to use according to our use case.

4. Stemming

python, pythoned, pythoner, pythoning all have same root “python” but these 4 words take 4 different memory space. so why not we stem them down to root word and in this way it will take only take one memory space.

So, for this purpose, we use the stemming process.

In NLTK there is a famous stemmer function called PorterStemmer. Let’s use it in our program to understand better.

from nltk.stem import PorterStemmer as ps
example_wrods= ["python","pythoning","pythoned","pythoner"]
for w in example_wrods:
    print(ps.stem(w))
Output:

python
python
python
python

As we can see all the words stem down to the root word.

5. Frequency Distribution

Sometimes we want to know the occurrence of a word in an article or most common 15 words in an article, then we use the FreqDist() function from NLTK.

Let’s see how to use it.

import nltk

example= “Thank you very much. Mr. Speaker, Mr. President, distinguished members of Congress, honored guests and fellow citizens. May I congratulate all of you who are members of this historic 100th Congress of the United States of America. In this 200th anniversary year of our Constitution, you and I stand on the shoulders of giants–men whose words and deeds put wind in the sails of freedom.”

words=[]
for lines in sent_tokenize(example):
    for word in word_tokenize((lines)):
        words.append(word)
words_dist= (nltk.FreqDist(words))

print(words_dist.most_common(5))
Output:

[('of', 8), ('.', 4), (',', 4), ('you', 3), ('and', 3)]

Here are the 5 most common words in this list, As we can see all the five words are stopwords and this is why we need to toss out the stopwords.

6. Wordnet

Wordnet is a huge collection of synsets, meanings, definition, examples, synonyms, antonyms etc

Let’s understand with the help of an example.

from nltk.corpus import wordnet

#Synsets
words= wordnet.synsets("big")


'''
output: 

[Synset('large.a.01'), Synset('big.s.02'), Synset('bad.s.02'), 
Synset('big.s.04'), Synset('big.s.05'), Synset('big.s.06'), 
Synset('boastful.s.01'), Synset('big.s.08'), Synset('adult.s.01'), 
Synset('big.s.10'), Synset('big.s.11'), Synset('big.s.12'), 
Synset('big.s.13'), Synset('big.r.01'), Synset('boastfully.r.01'), 
Synset('big.r.03'), Synset('big.r.04')]
'''




#Definition
print(words[0].definition()) #Synset('large.a.01')


'''
output: 

above average in size or number or quantity or magnitude or extent
'''


#Examples
print(words[0].examples()) #Synset('large.a.01')


'''
output: 

['a large city', 'set out for the big city', 'a large sum', 
'a big (or large) barn', 'a large family', 'big businesses', 
'a big expenditure', 'a large number of newspapers', 
'a big group of scientists', 'large areas of the world']
'''


#Synonyms and Antonyms
synonyms=[]
antonyms=[]
for words in wordnet.synsets("big"):
    for word in words.lemmas():
        synonyms.append(word.name())
        if word.antonyms():
            antonyms.append(word.antonyms()[0].name())
print("synonyms:   {}".format(set(synonyms)))
print("\n")
print("Antonyms:   {}".format(set(antonyms)))

'''
output:

synonyms:   

{'magnanimous', 'heavy', 'braggy', 'fully_grown', 'with_child',
 'self-aggrandising', 'adult', 'large', 'prominent', 'bragging', 
'self-aggrandizing', 'expectant', 'grownup', 'freehanded', 'big',
 'great', 'bountiful', 'cock-a-hoop', 'braggart', 'crowing',
 'boastfully', 'boastful', 'swelled', 'handsome', 'bad', 'enceinte',
 'bounteous', 'vauntingly', 'openhanded', 'vainglorious', 'gravid',
 'full-grown', 'giving', 'bighearted', 'grown', 'liberal'}

Antonyms: 
{'little', 'small'}
'''

Here we learn that we can do many useful things from wordnet including finding synonyms and antonyms,

And this concludes this lesson, We learn the basics of NLTK and this knowledge is sufficient to work on any big data, I suggest you make a project using this knowledge to understand the concept better.

This is all from my side, In the next few lessons, we will make a project on Twitter Sentiment Analysis.

If you have any doubt, concern or suggerstion then feel free to drop below.

Thanks for reading 😀


Leave a Reply

Your email address will not be published. Required fields are marked *