Text Mining Process

Text Mining is processing and analyzing unstructured text data.

Before Text Mining, the text data must be extracted from external resources. It can be done in many ways. Two options are been listed out here.

Crawl through a set of web sites, from a set of URL, download the web page and then parse the Document Object Model to text data thereby creating a set of documents.
Collection of Tweets invoked by Twitter API.

Text mining process comprises of the following steps:

Text Pre-Processing
Transformation of Text
Selection of Features
Data Mining
Evaluation
Applications

In this blog, only the first step is covered. (more blogs to follow…)

The Text corpus chosen is a Women’s Clothing E-Commerce dataset comprising of reviews given by customers. It’s has nine features or attributes, which offers a great learning by parsing the text through its multiple dimensions. In this blog, only the “Review_Text” (Reviews/Opinions given by the users) attribute is chosen for the process.

The data set can be downloaded from the following URL:

https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/downloads/womens-ecommerce-clothing-reviews.zip/1

The first step is Text-Pre-Processing which involves Text clean-up, Tokenization and POS tagging.

But before that let me show you how to access a particular attribute (column) in a csv file.

data = pd.read_csv(‘WC_Ecommerce_Reviews.csv’); //reads the input csv file
data_text = data[[‘Review_Text’]] //picks the deserved column
print(len(data_text)) //prints the total no. of rows

to_drop=[‘sl’, ‘CID’, ‘Age’, ‘Title’, ‘Rating’, ‘Rec’, ‘Pcount’, ‘Div’, ‘Department’, ‘Class’]
data.drop (to_drop, inplace=True, axis=1)

Now coming back to the 1^st step:

Text clean-up:

Data Decoding: standard UTF-8 is used for decoding process. This helps in keeping the entire dataset in standard encoding format – complex symbols are converted to understandable simple characters.

Code Snippet:

data = pd.read_csv(‘WC_Ecommerce_Reviews.csv’, encoding=’utf-8′);

to Lower Case: the text is converted into lower case.

Code Snippet:

for line in fileinput.input(“WC_Ecommerce_Reviews.csv”, inplace=1):

print(line.lower())

for eg: if the text is “This dress is &lt perfection! so &amp pretty and flattering”.

Then the output would be: “this dress is perfection! so pretty and flattering”.

Escaping HTML: dataset from a website might contain html entities like &lt, &amp, &gt, etc. In that case the data set has to get cleansed by removing these. (Python package html.parser is used)

Code Snippet:

from html.parser import HTMLParser
new_data=[]
html_parser = html.parser.HTMLParser()
new_data = html.unescape(data_text)

for eg: if the text is “this dress is &lt perfection! so &amp pretty and flattering”.

Then the output would be: “this dress is perfection! so pretty and flattering”.

Removal of Punctuation Marks:

Code Snippet:

new_data[‘Review_Text’] = new_data[”Review_Text”].str.replace(‘[^\w\s]’,”)

for eg: if the text is “this dress is perfection! so pretty and flattering”.

Then the output would be: “this dress is perfection so pretty and flattering”.

Tokenization: involves splitting the text into its constituent parts. The process breaks the required text into words or tokens. These tokens or words are separated by punctuations, line breaks or whitespaces.

Code Snippet:

tokens=[]

line=0

while (line < len(new_data)):

tokens.append(nltk.word_tokenize(new_data[“RT”][line]))

line=line+1

for eg: if the text is – “this dress is perfection so pretty and flattering”.

Then the output would be:

Tokens = [‘this’, ‘dress’, ‘is’, ‘perfection’, ‘so’, ‘pretty’, ‘and’, ‘flattering’]

Removal of Stop Words: these are the words which are not necessary. If removed, doesn’t affect the sentiment of the text. Examples include ‘a’, ‘an’, ‘the’, ‘if ‘and so on.

Code Snippet:

Tokens_Removed = []

stop_words = set(stopwords.words(‘english’))

i=0

while(i<3):

j=0

while(j<len(tokens[i])):

if tokens[i][j] not in stop_words:

Tokens_Removed.append(tokens[i][j])

j=j+1

i=i+1

for example:

Tokens = [‘this’, ‘dress’, ‘is’, ‘perfection’, ‘so’, ‘pretty’, ‘and’, ‘flattering’]

The output would be [‘dress’, ‘perfection’, ‘pretty’, ‘flattering’]

Removal of Stemming Words (optional): these are related words. For example: ‘running’ is a stemming word of the word ‘run’. The below code shows how to remove stemming words:

Code Snippet:

from nltk.stem import PorterStemmer

stemming = PorterStemmer()

Stemming_Removal = [stemming.stem(word) for word in Tokens_Removed]

For example: for the input, [‘dress’, ‘perfection’, ‘pretty’, ‘flattering’], the output would ‘be [‘dress’, ‘perfection’, ‘pretti’, flatter’]

POS (Part of Speech) Tagging: this is mapping every token to its corresponding part of speech. With POS tagging, one will come to know what the sentence is all about by its meaning. Also called as Grammatical Tagging.

Code Snippet:

tagged_set=nltk.pos_tag(Tokens_Removed)

print(tagged_set)

Example: [(‘Absolutely’, ‘RB’), (‘wonderful’, ‘JJ’), (‘silky’, ‘JJ’), (‘sexy’, ‘NN’), (‘comfortable’, ‘JJ’), (‘Love’, ‘NNP’), (‘dress’, ‘NN’), (‘pretty’, ‘RB’), (‘happened’, ‘VBD’), (‘find’, ‘JJ’), (‘store’, ‘NN’), (‘glad’, ‘NN’)]

POS Tag List

Ref: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go ‘to’ the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-adverb, when

That’s all for today!!

In my next blog, the 2^nd Text Mining Process, Text Transformation would be discussed!!

Text Mining Process

Join the Conversation

Leave a comment

Cancel reply

Share this:

Related

Join the Conversation

Leave a comment

Cancel reply