Designing an Indonesian Part of speech Tagset and Manually Tagged…
Designing an Indonesian Part of speech Tagset and Manually Tagged Indonesian Corpus
We present our efforts in designing a linguistically principled POS tagset for the Indonesian language, which involves a detailed study and analysis of existing tagsets and manual tagging of an Indonesian corpus of over a quarter of a million words.
There has been some previous work on Indonesian POS tagsets and automatic POS taggers , but to our knowledge the tagsets have only been tested on very small manually tagged corpora of around 15.000 words.
This corpus consists of 27.325 pairs of Indonesian and English sentences originating from the Penn Treebank corpus that were translated into Indonesian, newspaper articles in economy, international news, science, and sports from the PAN Localization project output1 , and movie subtitles
In developing our tagset and manually tagged corpus, we used the Indonesian sentences from the IDENTIC parallel corpus .
Given that plurality is clearly marked through the morphological process of reduplication, this distinction can be further refined in the future through an automatic analysis, e.g. using Indonesian morphological analyzers , and thus we do not maintain separate tags for this
Thus, we maintained separate tags for these two categories. On the other hand, one could imagine specifying separate tags for singular vs. plural nouns.
For example, one could imagine a very broad single category of Noun, where any nominal word was assigned that same tag. However, to distinguish between, say proper nouns and regular nouns, would require human judgment.
Our goal is to encapsulate the level of linguistic decision that requires human judgment, where any further refinement could plausibly be done by an automatic procedure sometime in the future.
Given that we aim to manually tag a substantial corpus of roughly 250.000 words, we would like to maintain a small number of tags in the tagset, so as to reduce the cognitive load on the annotators
The tagset should be able to characterize morphological and syntactic behaviour of lexical tokens appearing in the context of a text that would be useful
Larasati et al. do not define a POS tagset per se, but instead develop an Indonesian morphological analyzer that uses 19 lemma tags.
We first analyzed and compared POS tagsets from various previous works and consulted authoritative Indonesian grammar references 
In particular, two POS tagsets that we base our design on are from Adriani et al. and Larasati et al. 
Adriani et al. proposed two variants of tagsets, one with 37 tags and a reduced one with 25 tags, both of which are originally based on the Penn Treebank tagset  hence the use of the same abbreviations for the tag labels.
TESTING AND REVISIONS OF TAGSET
Knowing whether a word is a noun, verb, adverb, or adjective can provide valuable linguistic insight, and thus POS tagging, the process of assigning a POS tag to a word, is a fundamental process that supports almost all NLP applications.
The part of speech (POS) of a word, also known as its grammatical category, is an indicator of the syntactic and morphological behaviour of a particular lexical item.
The results of this work are a POS tagset for Indonesian and a manually tagged Indonesian corpus using this tagset. These results can be used for further work on Indonesian NLP, such as developing statistical and rule-based automatic POS taggers.
Indonesian POS tagset consisting of 23 tags and an Indonesian corpus of over 250.000 lexical tokens that have been manually tagged using this tagset.