Please enable JavaScript.

Coggle requires JavaScript to display documents.

Designing an Indonesian Part of speech Tagset and Manually Tagged…

- - - - In particular, two POS tagsets that we base our design on are from Adriani et al. and Larasati et al. [2][6]
      - We first analyzed and compared POS tagsets from various previous works and consulted authoritative Indonesian grammar references [4][5]
      - Larasati et al. do not define a POS tagset per se, but instead develop an Indonesian morphological analyzer that uses 19 lemma tags.
      - Tagset requirement
        
        Linguistically valuable
        
        The tagset should be able to characterize morphological and syntactic behaviour of lexical tokens appearing in the context of a text that would be useful
        
        Simplicity
        
        Given that we aim to manually tag a substantial corpus of roughly 250.000 words, we would like to maintain a small number of tags in the tagset, so as to reduce the cognitive load on the annotators
        
        Automatically refinable
        
        Our goal is to encapsulate the level of linguistic decision that requires human judgment, where any further refinement could plausibly be done by an automatic procedure sometime in the future.
        
        For example, one could imagine a very broad single category of Noun, where any nominal word was assigned that same tag. However, to distinguish between, say proper nouns and regular nouns, would require human judgment.
        
        Thus, we maintained separate tags for these two categories. On the other hand, one could imagine specifying separate tags for singular vs. plural nouns.
        
        Given that plurality is clearly marked through the morphological process of reduplication, this distinction can be further refined in the future through an automatic analysis, e.g. using Indonesian morphological analyzers [8], and thus we do not maintain separate tags for this
      - Adriani et al. proposed two variants of tagsets, one with 37 tags and a reduced one with 25 tags, both of which are originally based on the Penn Treebank tagset [7] hence the use of the same abbreviations for the tag labels.
    - - In developing our tagset and manually tagged corpus, we used the Indonesian sentences from the IDENTIC parallel corpus [9].
      - This corpus consists of 27.325 pairs of Indonesian and English sentences originating from the Penn Treebank corpus that were translated into Indonesian, newspaper articles in economy, international news, science, and sports from the PAN Localization project output1 , and movie subtitles