Text Mining Process

Text Mining is processing and analyzing unstructured text data.

Before Text Mining, the text data must be extracted from external resources. It can be done in many ways. Two options are been listed out here.

  • Crawl through a set of web sites, from a set of URL, download the web page and then parse the Document Object Model to text data thereby creating a set of documents.
  • Collection of Tweets invoked by Twitter API.

Text mining process comprises of the following steps:

  1. Text Pre-Processing
  2. Transformation of Text
  3. Selection of Features
  4. Data Mining
  5. Evaluation
  6. Applications

In this blog, only the first step is covered. (more blogs to follow…)

The Text corpus chosen is a Women’s Clothing E-Commerce dataset comprising of reviews given by customers. It’s has nine features or attributes, which offers a great learning by parsing the text through its multiple dimensions. In this blog, only the “Review_Text” (Reviews/Opinions given by the users) attribute is chosen for the process.

The data set can be downloaded from the following URL:

https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/downloads/womens-ecommerce-clothing-reviews.zip/1

The first step is Text-Pre-Processing which involves Text clean-up, Tokenization and POS tagging.

But before that let me show you how to access a particular attribute (column) in a csv file.

  1. data = pd.read_csv(‘WC_Ecommerce_Reviews.csv’);           //reads the input csv file
  2. data_text = data[[‘Review_Text’]]  //picks the deserved column
  3. print(len(data_text))            //prints the total no. of rows

Or

  1. to_drop=[‘sl’, ‘CID’, ‘Age’, ‘Title’, ‘Rating’, ‘Rec’, ‘Pcount’, ‘Div’, ‘Department’, ‘Class’]
  2. data.drop (to_drop, inplace=True, axis=1)

Now coming back to the 1st step:

  1. Text clean-up:

Data Decoding: standard UTF-8 is used for decoding process. This helps in keeping the entire dataset in standard encoding format – complex symbols are converted to understandable simple characters.

Code Snippet:

data = pd.read_csv(‘WC_Ecommerce_Reviews.csv’, encoding=’utf-8′);

to Lower Case: the text is converted into lower case.

Code Snippet:

for line in fileinput.input(“WC_Ecommerce_Reviews.csv”, inplace=1):

    print(line.lower())

for eg: if the text is “This dress is &lt perfection! so &amp pretty and flattering”.

Then the output would be: “this dress is perfection! so pretty and flattering”.

Escaping HTML: dataset from a website might contain html entities like &lt, &amp, &gt, etc. In that case the data set has to get cleansed by removing these. (Python package html.parser is used)

Code Snippet:

  • from html.parser import HTMLParser
  • new_data=[]
  • html_parser = html.parser.HTMLParser()
  • new_data = html.unescape(data_text)

for eg: if the text is “this dress is &lt perfection! so &amp pretty and flattering”.

Then the output would be: “this dress is perfection! so pretty and flattering”.

Removal of Punctuation Marks:

Code Snippet:                       

new_data[‘Review_Text’] = new_data[”Review_Text”].str.replace(‘[^\w\s]’,”)

for eg: if the text is “this dress is perfection! so pretty and flattering”.

Then the output would be: “this dress is perfection so pretty and flattering”.

Tokenization: involves splitting the text into its constituent parts. The process breaks the required text into words or tokens. These tokens or words are separated by punctuations, line breaks or whitespaces.

Code Snippet:

tokens=[]

line=0

while (line < len(new_data)):

               tokens.append(nltk.word_tokenize(new_data[“RT”][line]))

                        line=line+1

for eg: if the text is – “this dress is perfection so pretty and flattering”.

Then the output would be:

 Tokens = [‘this’, ‘dress’, ‘is’, ‘perfection’, ‘so’, ‘pretty’, ‘and’, ‘flattering’]

Removal of Stop Words: these are the words which are not necessary. If removed, doesn’t affect the sentiment of the text. Examples include ‘a’, ‘an’, ‘the’, ‘if ‘and so on.

            Code Snippet:

            Tokens_Removed = []

stop_words = set(stopwords.words(‘english’))

i=0

while(i<3):

    j=0

    while(j<len(tokens[i])):

        if tokens[i][j] not in stop_words:

            Tokens_Removed.append(tokens[i][j])

        j=j+1

    i=i+1

for example:

Tokens = [‘this’, ‘dress’, ‘is’, ‘perfection’, ‘so’, ‘pretty’, ‘and’, ‘flattering’]

The output would be [‘dress’, ‘perfection’, ‘pretty’, ‘flattering’]

Removal of Stemming Words (optional): these are related words. For example: ‘running’ is a stemming word of the word ‘run’. The below code shows how to remove stemming words:

Code Snippet:

from nltk.stem import PorterStemmer

stemming = PorterStemmer()

Stemming_Removal = [stemming.stem(word) for word in Tokens_Removed]

For example: for the input, [‘dress’, ‘perfection’, ‘pretty’, ‘flattering’], the output would ‘be [‘dress’, ‘perfection’, ‘pretti’, flatter’]

POS (Part of Speech) Tagging: this is mapping every token to its corresponding part of speech. With POS tagging, one will come to know what the sentence is all about by its meaning. Also called as Grammatical Tagging.

Code Snippet:

tagged_set=nltk.pos_tag(Tokens_Removed)

print(tagged_set)

Example: [(‘Absolutely’, ‘RB’), (‘wonderful’, ‘JJ’), (‘silky’, ‘JJ’), (‘sexy’, ‘NN’), (‘comfortable’, ‘JJ’), (‘Love’, ‘NNP’), (‘dress’, ‘NN’), (‘pretty’, ‘RB’), (‘happened’, ‘VBD’), (‘find’, ‘JJ’), (‘store’, ‘NN’), (‘glad’, ‘NN’)]

POS Tag List

Ref: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

  • CC coordinating conjunction
  • CD cardinal digit
  • DT determiner
  • EX existential there (like: “there is” … think of it like “there exists”)
  • FW foreign word
  • IN preposition/subordinating conjunction
  • JJ adjective ‘big’
  • JJR adjective, comparative ‘bigger’
  • JJS adjective, superlative ‘biggest’
  • LS list marker 1)
  • MD modal could, will
  • NN noun, singular ‘desk’
  • NNS noun plural ‘desks’
  • NNP proper noun, singular ‘Harrison’
  • NNPS proper noun, plural ‘Americans’
  • PDT predeterminer ‘all the kids’
  • POS possessive ending parent’s
  • PRP personal pronoun I, he, she
  • PRP$ possessive pronoun my, his, hers
  • RB adverb very, silently,
  • RBR adverb, comparative better
  • RBS adverb, superlative best
  • RP particle give up
  • TO to go ‘to’ the store.
  • UH interjection errrrrrrrm
  • VB verb, base form take
  • VBD verb, past tense took
  • VBG verb, gerund/present participle taking
  • VBN verb, past participle taken
  • VBP verb, sing. present, non-3d take
  • VBZ verb, 3rd person sing. present takes
  • WDT wh-determiner which
  • WP wh-pronoun who, what
  • WP$ possessive wh-pronoun whose
  • WRB wh-adverb, when

That’s all for today!!

In my next blog, the 2nd Text Mining Process, Text Transformation would be discussed!!

Join the Conversation

2 Comments

Leave a comment

Design a site like this with WordPress.com
Get started