Hi there,
You have been working in data mining and analytics area for sometime now and want to know how to get meaningful information from a bag of words in a document written in natural English understandable to humans automatically. If this is something that excites your mind then you can go ahead and read more.
Text mining is a process of extracting meaningful information from text written in natural English language automatically and programatically so as to automate process of information retrieval and generating analytics out of this unstructured data. With new ways getting devised daily to find out some information of relevance for generating decisive analytics, text mining is growing in prominence. So to do text mining one must be able to write code that can "read and understand" human language. This can be read from any source, be it a website, a social network profile or a word document.
The above statement gives us with two problem areas to solve.
1. Read data
2. Understand data
For the first part, reading data in any programming language is easy to do as I am pretty much sure now any programming language gives you with a feature to read data from any source. For the sake of this article I would concentrate on python as a language and its features.
So to read any data in python, one has to open a file pipe and read through it. So consider we have a text file "ML.txt" with some information on machine learning stored on a machine. To read from the file you have to open the file in python and read from it.
"""Sample code for the same goes here."""
""" This file contains text from a website"""
with open("ML.txt") as f: data = f.read();
Now the data variable in python will have data content from the text file ML.txt. So we have solved the
first step of reading data. Now to understand data we need to go through the process of splitting up bag
of words into separate words and then remove unnecessary words out of the bunch of words we have got.
From this we need to check the most occurring words and derive information from these words.
Let's try to note this down as points of what needs to be done.
1. Tokenize - Break the stream of strings or bag of words into separate words.
2. Remove stop words - Remove unnecessary words out of the list of words obtained from step above.
3. Remove special characters which does not give any meaning
4. Get the frequency distribution for the list of remaining words to get most occurring words.
So the python code for the above goes like this.
import nltk; import os; from nltk import chat; from nltk.corpus import stopwords from nltk.tokenize import WordPunctTokenizer; from nltk.corpus import movie_reviews; import re
""" This file contains text from a website"""
with open("ML.txt") as f: data = f.read(); #Removing special charactersreg_pattern = r'[^0-9|a-z|A-Z]' new_data = re.sub(reg_pattern,' ',data) new_data = new_data.decode('utf-8') # Tokenizeword_tokens = WordPunctTokenizer().tokenize(new_data) # Removing stop words
nostop_word_tokens =[x for x in word_tokens if x not in stopwords.words()] # Get frequency distributionfq = nltk.FreqDist(nostop_word_tokens) print(fq.most_common(5))
If we try to show whatever we discussed as a flow, it would look something like this.
The flow diagram if we can see as a block
diagram can be given below.
So now you know how text mining and natural language processing is done. We will see more on NLP in upcoming articles. Till then , have a nice day!