Please enable JavaScript.
Coggle requires JavaScript to display documents.
Towards a Universal Grammar for NLP (Design principles (language-specific…
Towards a Universal Grammar for NLP
General info
Goal
facilitate multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective
develop guidelines for
cross-linguistically consistent treebank annotation
for many languages called Universal Dependencies (UD), provide a lingua franca for grammatical annotation
mainly practical p.o.v.
not a linguistic theory, orientation towards surface syntax + some aspects of deep syntax
useful for the development of multilingual systems
not necessarily an optimal parsing representation - parsers might have to use different representations internally first
Motivation
such system has not been created so far
there are either language-specific resources (not easily generalizable)
or general statistical models that can be applied to any language (do not capture the peculiarities of an individual language)
Major obstacle
annotation standards vary across languages due to descriptive grammatical traditions
Problems arise
applications have to have a specialised interface for each language
difficult to find out the reasons for performance differences -> difficult to evaluate system performance
cannot assume a consistent representation of linguistic categories and structures across languages -> makes statistical parsing hard
Also - need for treebanks since unsupervised parsing is not accurate
Annotated data for many languages are limited
Builds on earlier initiatives
Interset - multilingual tagset conversion
Google Universal POS Tags
HamleDT - Harmonized multi-language dependency treebank
Universal Dependency Treebanks
Universal Stanford Dependencies
Dependency-based representations
a simple and transparent encoding of predicate-argument structure
allow for efficient processing
Universal grammar
Roger Bacon: grammar is one and the same in all languages
Speculative grammars in Middle Ages (factor - the word)
Port-Royal grammar of Arnauld and Lancelot (factor - human mind)
theories of Noam Chomsky (factor - innate language faculty)
assumption that all human languages are species of a common genus since it has been shaped by a factor common to all humans
Design principles
basic structure: sentences are segmented into words, words are described by morphological properties and linked by syntactic relations
basic unit of analysis -
syntactic word
-> clitics are treated separately, contractions are split
morphological description - 3 parts: lemma/base form, universal POS tag, morphological features (attribute - value pairs)
Tagset - revised Google Universal POS Tagset
morphological attributes and values - based on Interest
Dependency relations - 40 universal from the Universal Stanford Dependencies
Principle of maximising parallelism
Don’t annotate the same thing in different ways!
Don’t make different things look the same!
Don’t annotate things that are not there!
language-specific extensions
in the morphological features - can select a subset of universal features and add language-specific features
in syntactic relations - define language-specific subtypes of the universal relations
Word segmentation
Syntactic word can be assigned a single consistent morphological description with a
unique lemma, part-of-speech tag and morphological feature set,
as well as a
single syntactic function
in relation to other words of the sentence
multiword
certain fixed multiword annotations should be treated as single words in the annotation
but are annotated using special dependency relations, instead of collapsing multiple tokens into one
Documentation should provide the principles of word segmentation for each language
Morphological annotation: 3 components
lemma
represents the semantic content of the word
determined by language-specific dictionaries
POS tag
represents the abstract lexical category of the word
universal inventory of 17 tags from the revised Google Universal POS tagset
The tagset must be used in all treebanks
not all tags should be used
extensions are not allowed
set of features
represents lexical and grammatical properties associated with the lemma or the particular word form
Form Name=Value
a word can have any number of features separated by a vertical bar
there is a universal inventory, but it is not exhaustive
language-specific features can be added
layered features for the cases when the feature is marked more than once on the same word
Syntactic annotation
consists of
typed dependency relations
between words, with a special relation
root
for words that do not depend on any other word
for each sentence, basic dependencies form a rooted tree to represent the backbone of the syntactic structure
enhanced dependencies in a general directed graph (e.g. secondary predication, control structures)
relations are meant to capture a
set of broadly
observed grammatical functions
that work across languages
relations hold primarily between
content words
function words attach as direct dependents of the most closely related content word (e.g. adpositions as dependents of nouns, auxiliary verbs as dependents of main predicates)
punctuation marks attach to the head of the clause or phrase to which they belong
maximizes parallelism be- tween languages because content words vary less than function words between
40 syntactic relations
the main principle - distinction between (types of dependents):
Nominal phrases
, primarily denoting entities but also used for other things
Clauses
headed by a predicate (verb, adjective, adverb, or a predicate nominal)
Miscellaneous
other kinds of modifier words, which may allow modifiers but which do not expand into rich structures like nominal phrases and clauses
head - a clausal predicate or a nominal, root has no head
also distinguishes between
core arguments
(subjects, objects, clausal complements) and
other dependents
, doesn't distinguish btw adjuncts from oblique arguments
root
- used for independent words, usually the predicate of a main clause
also: relations that occur with any type of head, e.g. lexical relations (compounding, fixed multiword expressions, names) and coordination + other (list, foreign etc.)
possible to add language-specific subtypes
form
uni:spec
, where uni is one of the 40 universal relations, and spec is a descriptive label
function words
normally do not
have dependents
multiple function words are related as siblings
they modify the syntactic category of their head
Exceptions
Multiword function words
: will superficially look like a function word with dependents (e.g.
in spite of, because of, by and large
)
Coordinated function words:
the first conjunct will then be the head of both the conjunction and the second conjunct (e.g.
to and from, if and when
)
Promotion by head elision:
When the natural head of a function word is elided, the function word will be “promoted” to the function normally assumed by the content word head. (e.g. Bill could not answer, but Ann could. I know how.)
Also: certain types of function words can take a restricted class of modifiers, mainly negation and light adverbials (e.g.
not every, exactly two, right then
)