Please enable JavaScript.

Coggle requires JavaScript to display documents.

AUTOMATED DOCUMENT CLASSIFICATION FOR NEWS ARTICLE IN BAHASA INDONESIA…

- - - - An era where the amount of information published increases rapidly and that massive amounts of information cannot be maintained easily(Kadiri & Adetoro, 2012)
    - - Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well
      - Usually refers to information that doesn't reside in a traditional row-column database
  - - - Preprocessing phase
        
        Document Preparation
        
        To make a text classifier by implementing TF-IDF, at first the lexicon or word list must be prepared
        
        To create a lexicon, the classifier needs some text document as its base. In this phase the articles have been labeled or categorized based on their categories, the bigger lexicon will produce more precise result
        
        500 articles in Bahasa Indonesia have been collected for each category placed in 15 different folders
        
        Lexicon Construction
        
        All of the documents that have been prepared in the previous step will be processed. Every unique word in the document will be listed in the lexicon
        
        Several steps that have to be performed, and those are; tokenization, stop-words removal and then supervised words removal.
        
        Most of the text mining researches usually involve words or sentences, and then to process those words or sentences any further, they have to be split to a single word,this process called segmentation (Tokenizing, 2011)
        
        Word Weighting
        
        Each word in the lexicon will have a weight in each category based on how much the word can represent a category
        
        TF-IDF algorithm
      - Processing phase
        
        Input file
        
        To see the accuracy of the classifier, it has to be tested by using unlabeled or uncategorized text document. The input file will be categorized by using TF-IDF classifier.
        
        12.000 articles (800 for each category) have been crawled by using python script
        
        Before performing the test, all of the data has been validated and grouped by the best match category. This validation involved 53 persons and it took about 89 hours (total of the time that people use to categorize the articles).
        
        Processing and Output Creation
        
        At first, all of the sentences in the article that contain one or more words from the tittle will be stored in order to find the important sentence for further processing
        
        Those sentences have to be tokenized and all of the stop-words that can be found in the articles have to be removed, each of the word that can be found in the articles will also be placed in an arraylist.
        
        After the training set has been listed in an arraylist, and each word in the article will be compared with the word in the lexicon, if the word exists in the lexicon, the word will be represented by using its weight in each category in a matrix form
        
        At the end, all of the matrix will be totalized, and the script will find the index that contains the maximum value of computation result, and the result can be generated.
        
        Then, for the next file, the arraylist that accommodate the word from the articles haveto be emptied, so it can give the true result, and it will also save the processing time,since it will only write over the same memory, it does not create a new arraylist object.