Please enable JavaScript.
Coggle requires JavaScript to display documents.
Designing an Indonesian Part of speech Tagset and Manually Tagged…
Designing an Indonesian Part of speech Tagset and Manually Tagged Indonesian Corpus
Meta
Author
Arawinda Dinakaramani
Fam Rashel
Andry Luthfi
Ruli Manurung
Year
2014
Result
Indonesian POS tagset consisting of 23 tags and an Indonesian corpus of over 250.000 lexical tokens that have been manually tagged using this tagset.
The results of this work are a POS tagset for Indonesian and a manually tagged Indonesian corpus using this tagset. These results can be used for further work on Indonesian NLP, such as developing statistical and rule-based automatic POS taggers.
Info
POS
The part of speech (POS) of a word, also known as its grammatical category, is an indicator of the syntactic and morphological behaviour of a particular lexical item.
Knowing whether a word is a noun, verb, adverb, or adjective can provide valuable linguistic insight, and thus POS tagging, the process of assigning a POS tag to a word, is a fundamental process that supports almost all NLP applications.
TESTING AND REVISIONS OF TAGSET
Method
TAGSET
In particular, two POS tagsets that we base our design on are from Adriani et al. and Larasati et al. [2][6]
We first analyzed and compared POS tagsets from various previous works and consulted authoritative Indonesian grammar references [4][5]
Larasati et al. do not define a POS tagset per se, but instead develop an Indonesian morphological analyzer that uses 19 lemma tags.
Tagset requirement
Linguistically valuable
The tagset should be able to characterize morphological and syntactic behaviour of lexical tokens appearing in the context of a text that would be useful
Simplicity
Given that we aim to manually tag a substantial corpus of roughly 250.000 words, we would like to maintain a small number of tags in the tagset, so as to reduce the cognitive load on the annotators
Automatically refinable
Our goal is to encapsulate the level of linguistic decision that requires human judgment, where any further refinement could plausibly be done by an automatic procedure sometime in the future.
For example, one could imagine a very broad single category of Noun, where any nominal word was assigned that same tag. However, to distinguish between, say proper nouns and regular nouns, would require human judgment.
Thus, we maintained separate tags for these two categories. On the other hand, one could imagine specifying separate tags for singular vs. plural nouns.
Given that plurality is clearly marked through the morphological process of reduplication, this distinction can be further refined in the future through an automatic analysis, e.g. using Indonesian morphological analyzers [8], and thus we do not maintain separate tags for this
Adriani et al. proposed two variants of tagsets, one with 37 tags and a reduced one with 25 tags, both of which are originally based on the Penn Treebank tagset [7] hence the use of the same abbreviations for the tag labels.
DATA DESCRIPTION
In developing our tagset and manually tagged corpus, we used the Indonesian sentences from the IDENTIC parallel corpus [9].
This corpus consists of 27.325 pairs of Indonesian and English sentences originating from the Penn Treebank corpus that were translated into Indonesian, newspaper articles in economy, international news, science, and sports from the PAN Localization project output1 , and movie subtitles
Goals
Problem
There has been some previous work on Indonesian POS tagsets and automatic POS taggers [1][2][3], but to our knowledge the tagsets have only been tested on very small manually tagged corpora of around 15.000 words.
Objective
We present our efforts in designing a linguistically principled POS tagset for the Indonesian language, which involves a detailed study and analysis of existing tagsets and manual tagging of an Indonesian corpus of over a quarter of a million words.