Hands on NLP-Natural Language Processing

Akshith Kumar
7 min readMay 26, 2021

In detail with Bag of words, TF_IDF, RNN’s, GRU’s & LSTM’s.

Photo by Lazar Gugleta on Unsplash

Yes! Alexa is designed from NLP modelling.

But, how NLP works and how it is designed for modelling?

In this article, I will prepare data for NLP modelling. Moreover, I would explain everything in detail both in theory and code.

What is Natural Language Processing?

Natural Language Processing is the model of building to interact with machines in human languages. It is the subfield of linguistic, Computer Science & AI.

NLP is nothing but designing systems to interact by human commands. As commands are only text format, then model needs to be trained on text information. For doing that, NLP has few properties to make the text information properly ordered for modelling.

Once the text is ready, we can start our model trained with few techniques for better predictions for NLP model. Techniques like : RNN’s, GRU’s & LSTM. Besides, I’m showing few other techniques like Bag of Words, TF-IDF, Word2Vec with ML Algorithms.

Steps to follow in text preparation for modelling

  1. Tokenization

It is used in splitting the text into sentences & words. We can make use of tokenization to break an essay into sentences and sentences into words.

import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
text = 'I like to watch cricket news. But, I love to watch movies than cricket. Can we go to library by tomorrow.'sentence = sent_tokenize(text)
sentence
Output :
['I like to watch cricket news.',
'But, I love to watch movies than cricket.',
'Can we go to library by tomorrow.']

2. Stemming

It is used in cutting the words. Stemming is nothing but cut-shorting the words.

Three Types of Stemming :

  • PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
text='NLP is good and easy to learn with no efforts'words=word_tokenize(text)
words
ps = PorterStemmer()for words in words:
print(ps.stem(words))
Output :
words- ['NLP', 'is', 'good', 'and', 'easy', 'to', 'learn', 'with', 'no', 'efforts']
nlp
is
good
and
easi
to
learn
with
no
effort
  • lancasterStemmer
from nltk.stem import LancasterStemmer
lst=LancasterStemmer()
words_to_stem=['call','calling','caller','called']
for words in words_to_stem:
print(words+":"+lst.stem(words))
Output :
call:cal
calling:cal
caller:cal
called:cal
  • SnowballStemmer
from nltk.stem import SnowballStemmer
sbs=SnowballStemmer(lang)
words_to_stem=['call','calling','caller','called']for words in words_to_stem:
print(words+":"+sbs.stem(words))
Output :
call:call
calling:call
caller:caller
called:call

3. Lemmatization

It is similar to stemming, but gets meaning full words through lemmatization. As we can see the above codes with outputs which have no meaning. So by using lemmatization we could get meaning full words from the text.

from nltk.stem import WordNetLemmatizer
word_len=WordNetLemmatizer()
words_to_stem=['call','calling','caller','called']for words in words_to_stem:
print(words+":"+word_len.lemmatize(words))
Output :
call:call
calling:calling
caller:caller
called:called

4. Stopwords

Stopwords are the words which appears most regularly in the text, but for building NLP model these words are not necessary. By these stopwords in the text sometimes the model gets noisy over predictions. So we could remove these words with an syntax in NLP.

import nltk
from nlkt.corpus import stopwords
stopwords.words('english') ## We can use any specific languages

In above code, I used English to deal with English language. So we can specific any languages that we want to deal with.

5. POS Tags

It works as Parts Of Speech to the sentences in the text for any languages with tags allocation. And POS tags are not required when building the NLP model.

sample='Have u find the person your searching for past one week'
sample_tokens=word_tokenize(sample)
for words in sample_tokens:
print(nltk.pos_tag([words]))
Output :
[('Have', 'VB')]
[('u', 'NN')]
[('find', 'VB')]
[('the', 'DT')]
[('person', 'NN')]
[('your', 'PRP$')]
[('searching', 'VBG')]
[('for', 'IN')]
[('past', 'NN')]
[('one', 'CD')]
[('week', 'NN')]

6. Frequency distribution & n-grams

Frequency of words in the text. In other words, most common words that are repeated in the text. And n-grams are nothing but contiguous sequence of n-items(symbols, words & letters) in the text or speech. These n-grams are also called as shingles.

from nltk.corpus import gutenberg,webtext
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import ngrams
from nltk.probability import FreqDist
stopwords=stopwords.words('english')
filtered_words=[w for w in webtext.words('firefox.txt') if w not in stopwords and len(w)>3]
ngrams=ngrams(filtered_words,5)
fdist=FreqDist(ngrams)
fdist.most_common(20)
Output :
[(('launching', 'browser', 'window', 'binding', 'browser'), 6),
(('Error', 'launching', 'browser', 'window', 'binding'), 5),
(('allow', 'sites', 'removed', 'cookies', 'future'), 2),
(('sites', 'removed', 'cookies', 'future', 'cookies'), 2)]

Done with text preparation.

Hey man! machines can understand only numbers. So, what to do with strings??

Then we can do text encodings with the help of :

  1. Bag of words (Count Vectorizer)
  2. TF-IDF (Tfidf Vectorizer)
  3. Word2Vec

Bag of Words is the method of feature extraction in text data

TF-IDF is a statistical method which is used to judge the relevant occurrence of words in the collection of documents. TF-IDF gives importance to uncommon words.

Word2Vec is the most prominent technique used in NLP which has large data of text and each word is represented with a number called vector. Word2Vec gives importance to semantic information & relation between different words. Also represented as a vector of dimensions.

Note : Bag of words and TF-IDF approaches are not stored semantic information.

Most importantly these techniques are used in Machine Learning problems. When it comes to neural networks we have to learn little bit more about text encodings.

Now, let’s see in code

Import packages and load the data

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv('emotion-labels-train.csv')
df.head()
label={'joy':0,'fear':1,'anger':2,'sadness':3}df.label=[label[i] for i in df.label]
print(df.label)
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
wl=WordNetLemmatizer()
ps=PorterStemmer()
corpus=[]
for i in range(0,len(tweets)):
process=re.sub('[^a-zA-Z]',' ',tweets['text'][i])
process=process.lower()
process=process.split()

process=[wl.lemmatize(word) for word in process if word not in stopwords.words('english')]
process=' '.join(process)
corpus.append(process)

Using Bag of words

# Bag of Words or Count Vectorizerfrom sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=3000,ngram_range=(1,3))
x=cv.fit_transform(corpus).toarray()
Source: “Image by author”

Using decision tree in ML, I got an accuracy of 80 %.

Let’s check with neural networks and try to build a model with few techniques.

As, it is important to mention that text preparation is similar in both ML and neural networks models. Only difference is text encoding technique usages.

In simple words,

For ML use Bag of words, TF-IDF, Word2Vec, etc..

For neural networks use one hot encodings, pad sequences, word embeddings etc..

I think this clear all the doubts in NLP modelling.

Happy Faces!!

Photo by Ben White on Unsplash

What is One hot encoding ?

It is assigning of different numbers for words in the text

What is Pad Sequences ?

Making all the sentences in the document or text data into same length. In which the sentences would be of same length by padding with zeros. We can use post or pre padding for zeros.

Embeddings :

They work in the form of high dimensions that makes related words into one dimensional vectors.

Embedding(vocab_size, embedding_dim, input_length = max_length)
  • Embedding dimensions are the number of dimensions for the vector representation in word encoding.
# Building Simple RNN modelfrom tensorflow.keras.models import Sequential
embedding_vector=40
model=Sequential([
Embedding(vocab_size,embedding_vector,input_length=sent_len),
tf.keras.layers.SimpleRNN(64,activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(128,activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(64,activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(4,activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
Source: “Image by author”
# let's try with LSTM's and Bidirectional RNNmodel=Sequential([
Embedding(vocab_size,embedding_vector,input_length=sent_len),
LSTM(512,return_sequences=True),
LSTM(512,return_sequences=True),
Bidirectional(LSTM(256,activation='tanh')),
tf.keras.layers.Dropout(0.2),
Dense(256,activation='relu'),
tf.keras.layers.Dropout(0.3),
Dense(128,activation='relu'),
Dense(4,activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
Source: “Image by author”

As, I used only 10 epochs got 68%. Try to do, by changing the learning rate and parameter tuning also increasing the epochs for better accuracy.

Conclusion

Here with a conclusion pointing out the whole process in modelling.

Points to remember :

  • Preparing text for modelling using tokenizers, stemming, lemmatization and stopwords.
  • Machines can understand only numbers, So we use
  • For Machine learning problems : Bag of words, TF-IDF, Word2Vec etc..
  • For Deep Learning Problems : One hot encodings, Pad sequences, Word embeddings
from tensorflow.keras.preprocessing.text import Tokenizer
  • Tokenizer has “num_words” to tokenize, by that we can call “fit_on_texts” to tokenize the sentence.
  • “num_words” are used to specify the max number of words to be tokenize and pick most common ’n’ words.
  • Using “word_index” we can able to see indexing numbers assigned to the text
  • By “ texts_to_sequences” method we can make text into sequences which is important in NLP.
  • Through Pad sequences, the sequence can be padded in which all text can be assigned with same length to input model.

All the code and datasets used in this article can be accessed from my GitHub.

Thanks for reading and your interest.

If you like my article, click on clap icon…

Keep in Touch :

--

--