Please enable JavaScript.
Coggle requires JavaScript to display documents.
AUTOMATED DOCUMENT CLASSIFICATION FOR NEWS ARTICLE IN BAHASA INDONESIA…
AUTOMATED DOCUMENT CLASSIFICATION FOR NEWS ARTICLE IN BAHASA INDONESIA BASED ON TERM FREQUENCY INVERSE DOCUMENT FREQUENCY (TF-IDF) APPROACH
Meta
Author
Ari Aulia Hakim
Alva Erwin, Advisor
Kho I Eng, Co-Advisor
Year
2014
Research Question
How accurate is the TF-IDF algorithm in classifying document?
How much the effect of human’s interference to the TF-IDF algorithmin classifying document?
Information
What is
Information explosion
An era where the amount of information published increases rapidly and that massive amounts of information cannot be maintained easily(Kadiri & Adetoro, 2012)
Unstructured data
Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well
Usually refers to information that doesn't reside in a traditional row-column database
Text Mining
Text mining itself is a study which can help people in managing data or information in the form of text; it can help to get important information from the text automatically.
text mining’s field studies, text categorization can simplify people’s task to define the topic of the article and can group the articles based on their topics automatically
Text classification
Text categorization can improve the query to manage the data that stored in the database (Basnur & Sensuse, 2010) the need of automated classification machine will be bigger in the future since the number of the data increases in a short time (Basnur & Sensuse, 2010)
Arni Darliani Asy’arie(2009) developed classification software for news articles in Bahasa Indonesia by using the Naïve Bayes Classifier algorithm.
Topics
Beauty
Business
Food and Drink
Economy
Education
Entertainment
Football
Health
Life Style
Automotive
Politic
Property
Sport
Technology
Travel
Methodology
This research will create an article classifier which can classify articles based on predefined category
Related
Sriram (Sriram & Fuhry, 2010) conducted the text classification by using tweets written in English in Twitter. he used naïve bayes algorithm to create a classifier that will categorize thetweets based on the predefined categories (news, opinion, deals, events and privatemessages).
Term Frequency–Inverse Document Frequency
In categorizing articles, all of the words from the article have to be modified into numbers, because, computer cannot recognize words, they do not know the meaning of a word and they do not know how much a word can describe a category
TF-IDF is one of the most recognized algorithms in text mining research (Xia & Chai, 2011).
The number will be given is depending on how much a word can represent or describe one or several categories; the bigger the weight means the bigger that related words can represent that categories and all of the weight will be stored in the words’ weight dictionary.
TF-IDF is an algorithm that combines the calculation of term frequency (tf) and inverse document frequency (idf) by multiplying them
Based on Gebre’s study(Gebre & Zampier, 2012), term frequency is the number of times a word can be found in an essay or document and idf is the computation from log of the inverse probability of word being found in any of essay.
tfidf(t,d,D) = tf(t,d) X idf (t,D)
TF-IDF can produce a great weight for words which exist in lexicon (Liu & Yang,2012) because it considers both term frequency and also inverse document frequency, so this algorithm still can produce a good word’s weight dictionary for several different categories while the term frequency of a term is high, but that term can only be found in a little part of document.
most recognized word weighting’s algorithm (Liu & Yang, 2012).
Normalized Term Frequency–Inverse Document Frequency
Documents exist today has different size and some of them has a significant difference
To prevent the contrast result and the complexity, tf-idf result has to be normalized, so that the total sum of the weight from all of the predefined categories is 1 (Vembunarayanan, 2013).
TF-IDF can produce a great weight for words which exist in lexicon (Liu & Yang,2012)
Stop-words removal and Tokenization
Stop-words removal is one of the methods that has been used since the early day of information retrieval research (Al-Shalabi, 2004).
Mainly used to eliminate several things from a sentence or text documents such as; the term that appear the most and the least and unimportant words or words that do not have specific meaning, for example in English are; “the”, “a” or “an”
The words mentioned before actually do not have a specific meaning and cannot represent the text file where they can be found
Tokenization is a little similar to the stop-words removal, but this process only remove the non-string variable found in a document.
Title Based Extraction
In text mining researches, usually, it needs a big text document as an input. Then, the bigger the input file, the longer the process time and the more complexity it will get.
According to (Gupta & Lehal, 2010), title based extraction can help to summarize the article based on the title of the document.
To reduce the process time and the complexity, title based extraction can be implanted in the research
So, in this step, all of the sentences that contain one or more words from the title willbe stored for further processing
Methodology
Research is divided into two parts
Preprocessing phase
Document Preparation
To make a text classifier by implementing TF-IDF, at first the lexicon or word list must be prepared
To create a lexicon, the classifier needs some text document as its base. In this phase the articles have been labeled or categorized based on their categories, the bigger lexicon will produce more precise result
500 articles in Bahasa Indonesia have been collected for each category placed in 15 different folders
Lexicon Construction
All of the documents that have been prepared in the previous step will be processed. Every unique word in the document will be listed in the lexicon
Several steps that have to be performed, and those are; tokenization, stop-words removal and then supervised words removal.
Most of the text mining researches usually involve words or sentences, and then to process those words or sentences any further, they have to be split to a single word,this process called segmentation (Tokenizing, 2011)
Word Weighting
Each word in the lexicon will have a weight in each category based on how much the word can represent a category
TF-IDF algorithm
Processing phase
Input file
To see the accuracy of the classifier, it has to be tested by using unlabeled or uncategorized text document. The input file will be categorized by using TF-IDF classifier.
12.000 articles (800 for each category) have been crawled by using python script
Before performing the test, all of the data has been validated and grouped by the best match category. This validation involved 53 persons and it took about 89 hours (total of the time that people use to categorize the articles).
Processing and Output Creation
At first, all of the sentences in the article that contain one or more words from the tittle will be stored in order to find the important sentence for further processing
Those sentences have to be tokenized and all of the stop-words that can be found in the articles have to be removed, each of the word that can be found in the articles will also be placed in an arraylist.
After the training set has been listed in an arraylist, and each word in the article will be compared with the word in the lexicon, if the word exists in the lexicon, the word will be represented by using its weight in each category in a matrix form
At the end, all of the matrix will be totalized, and the script will find the index that contains the maximum value of computation result, and the result can be generated.
Then, for the next file, the arraylist that accommodate the word from the articles haveto be emptied, so it can give the true result, and it will also save the processing time,since it will only write over the same memory, it does not create a new arraylist object.
Saad (2010)
Created 7 classifiers for Arabic document with 7 different algorithms,and those are; Decision trees (DT), K Nearest Neighbors (KNN), Support Vecto rMachines (SVMs), Naïve Bayes (NB), and NB variants (Naïve Bayes Multinomial(NBM), Complement Naïve Bayes (CNB), and Discriminative Multinomial NB(DMNB)
Ten different categories,and those are; economic, history, education and family, religious, sport, health,astronomy, law, folktale and also cooking recipes.
He collected 22,429 articles that related to those topics, from different source,such as; BBC Arabic and CNN Arabic, to create a word dictionary.
In the dictionary creation phase, he performed several processes so the words in the articles can be listed in the lexicon.
He excluded some contents of the articles, such as; punctuations, numbers,words that consist of 2 – 3 letters and also stop words. This was done because those characters cannot give explicit information about the topic of the articles and would reduce the processing time.
He turned all of the character into its base or root form. After that, he represented the words by using numbers by using the occurrence, term frequency(tf), and also the inverse document frequency of a word (idf)
He counted the weight for each word by using term frequency inverse document frequency or tf-id. The lexicon and the word weight were used for the implementation of those algorithm listed above.
After the dictionary was created, he implemented all of the algorithms in Rapid Miner.Then, the results were SVM produced the best output, the accuracy of this algorithmis 94.11%, followed by DMNB at the second position by 92.33%, and the KNN at the last position by 62.47
Goals
Problem
The amount of information spread both in the digital and printed media has grown significantly
Most common information that can be found are unstructured data, that situation may lead people to the information explosion era
Objective
Make automated classification software to classify news articles in Bahasa Indonesia by implementing TF-IDF algorithm.
Results
Lexicon and Words’ Weight Dictionary
Accuracy of TF-IDF Semantic Analysis in Classifying Documents