Part of Speech Tagging and Chunking with Conditional Random Fields.…
Part of Speech Tagging and Chunking with Conditional Random Fields.
Task of assigning grammatical classes to words in a natural language sentence.
Subsequent processing stages (such as parsing) become easier if the word class for a word is available.
He reckons the current account deficit will narrow to only # 1.8 billion in September . can be divided as follows: P [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in [NP September ] .
Conditional Random Fields
For training for the POS tagger we use the hindi morph analyzer to get the root-word and possible pos tags for every word in the corpus.
Along with the root-word and suggested pos tags other information like suffixes, word length indicator and presence of special characters is added to the training data .
The data is then trained using “CRF++, Yet Another CRF package” on a set detailed features and their combination.
In the first phase chunk tags(Chunk Boundary-Chunk Label) are assigned to each word in the training data and the data is trained to predict the corresponding B-L tag. We use only the local context of words and their POS categories to train.
We first extract chunk boundary and chunk Label markers for
each word in the corpus .
Training for chunker is done in two phases
Next the chunk label markers(L) from the B-L chunk tags are extracted and added to the training data along side the words and the POS categories. Now in the second phase we train the system on the above feature template for predicting the chunk boundary markers(B).
Finally chunk label markers(L) from the first phase and the chunk boundary markers from the second phase are combined together to obtain the chunk tag.
Building a complete system for POS tagging and chunking for hindi