Last time we have seen some interesting libraries and applications of NLP (click here to go to part-1). So let's extend that here and dive straight into building a step by step NLP pipeline:
NOTE : I have attached the link to my IPython Notebook here and request you to have a look at it simultaneously while reading for better understanding: 👇https://github.com/ApoorvTyagi/NLP_Activities/blob/master/NLP-Starter.ipynb
1. Sentence Tokenization
Suppose we are given a paragraph, the only way to understand it is to first understand all the sentences in it and to understand a sentence we need to understand each word in it, this is exactly what Sentence Tokenization means. In NLP we can use NLTK or SpaCy for tokenization, we can also use deep learning framework Keras for doing tokenization.It is used for spelling correction, processing searches, identifying parts of speech, document classification etc.
2. Stopword Removal
In a sentence there are often words like 'is', 'an', 'a', 'the', 'in' etc. which are unnecessary for us as they rarely adds any value to the information retrieval of a sentence, so it's better remove these words and save our computation. NLTK and many other libraries like SpaCy and Gensim supports stopword removal.To remove stop word from a sentence you have to first tokenize it and remove the word if it exist in the list of stopwords provided in the corpus module of the library.
3. Stemming
Stemming as the word suggest is the process of slicing out the affixes from the end and the beginning of the word in order to get the root word.Most words tend to become a completely new word when we attach a affix to it for example,
Play, Played, Plays, Playing; all these word are different form of the same word 'Play'. So in order to compare the words directly we first apply stemming to them.
NLTK provides two types of stemmer: Porter stemmer and snowball stemmer, in our implementation we have used porter stemmer.
4. Lemmatization
Things which we can't achieve with stemming can be done using lemmatization. This is the process of figuring out the most basic form of each word in the sentence.For example, verbs in past tense are changed into present (e.g. “went” is changed to “go”) and synonyms are unified (e.g. “best” is changed to “good”), hence standardizing words with similar meaning to their root. Lemmatization also takes into consideration the context of the word in order to solve other problems like disambiguation, which means it can discriminate between identical words that have different meanings depending on the specific context.
5. Similarity
Similarity is the process of comparing two sentences at a time and determining how similar the two are. We do this using the Vector Space Model.
The idea is to treat each sentence as vector. The coordinates of the vector are obtained by taking the term frequencies, as each term represents a dimension.
Two terms would mean 2-dimensions (X and Y).
Three terms would mean 3-dimensions (X, Y and Z) and so on.
So for example let's say we have 2 sentences (after tokenizing, stemming and stopword removal and lemmatization) -
- Sachin play cricket
- Sachin study
Now, the terms here are (Sachin, play, cricket, study). Let's sort them just to normalize them.
The terms after sorting, are (cricket, play, Sachin, study).
Now we can plot the two sentences in a 4-dimensional graph, by taking the coordinates from the term frequencies.
Meaning, sentence 1 "Sachin play Cricket" will have the coordinates (1, 1, 1, 0)
And sentence 2 "Sachin study" will have the coordinates (1, 0, 0, 1)
We can calculate similarity using:
(a) Euclidean Similarity
Given two points in an N-dimensional space, the distance between them is the square root of the sum of the difference of their squares.
In Euclidean similarity the smaller the value, the better as the two points must be closer.
(b) Cosine Similarity
In cosine similarity we just take the dot product to see how similar the two vectors are, as the dot product produces a representation of one vector on the other.
Here we just multiply the term frequency
In the case of Cosine Similarity, the larger the value, the better as the cosine function is decreasing, plus the larger the value the closer the two points must be.
6. Naive Bayes Classifier
Naive Bayes is a statistical multiclass classification technique based on Bayes Theorem. It is one of the simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.
It works by comparing the probability that a document D may belong to a class Cx based on the likelihood of class Cx being the correct class, and the likelihood that the terms in D belong to Cx also it does not factor in inter-term relationships, as it assumes each term uniquely and individually contributes to the probability of the document's final class.7. Hierarchical Clustering
Clustering is a way to identify documents similar to each other. Its usually an unsupervised learning problem, while classification is a supervised learning problem.
Basically, in an unsupervised problem, you don't tell the machine what its looking for - it finds that out itself. In a supervised problem, you tell the machine what to find, given what information.
Clustering is grouping up some data points and claiming that these data points are similar to each other under certain parameters.
The idea of Hierarchical clustering is simple, we build a "dendrogram" which is basically just a tree that describes the order of clustering based on the following chosen heuristic.
The idea of Hierarchical clustering is simple, we build a "dendrogram" which is basically just a tree that describes the order of clustering based on the following chosen heuristic.
- Single Linkage
- Complete Linkage
We in our case will use single linkage and here's the algorithm for this:
- Find the distance between every pair of points [takes O(n^2)]
- Then, join the pair whose distance is smallest - this form the first cluster
- Then recalculate the distance between all pairs, except whenever you're considering a cluster's distance with a point, you take the smallest value among the distance between the external point and all internal points of the cluster.
Happy NLPing 😉