Please enable JavaScript.
Coggle requires JavaScript to display documents.
Text Mining - Coggle Diagram
Text Mining
-
-
-
-
Annotation Formats
Understanding documents
Documents rarelt have a simple structure - columns, pictures, advertisements, figures
They are meant to be human-readable, but are not easy for automated systems.
-
-
-
Boundary Notation
-
-
BIO: B-Begin, I-Inside, O-Outside
-
Limitations - cannot handle hierarchical or structured annotations e.g. nested entities, relations, events
-
-
Stand-off annotations
-
-
-
Strengths
-
Can handle structured, nested and overlapping annotations
-
DSV
JSON
Tokenisation
The Task
Break input, usually sentences, into tokens
-
We also need to consider some other types of tokens such as encodings of contractions and compounds/multi-words
Challenges
-
ASCII-fication or romanisation of texts including transilterations (which may not be the correct text)
Results from Optical Character Recognition (OCR) may be poor - requires preprocessing to correct errors, but they may appear as correct text
-
-
-
-
-
Sentence Segmentation
The Task
-
-
It is not enough to detect the full stop, as while it is an end-of-sentence (EOS) marker, it can also be an and of abbreviation marker.
Challenges
Variation in delimiters, or EOS characters. These typically include ".!?", but how about ";,-".
Domain dependence - in places such as biology papers, protein names start with lowercase characters, which may confuse sentence segmentation approaches.
Approaches
-
-
Hand-crafted rules (e.g. check whether the word following an EOS delimiter startes with an uppercase latter)
-
-
-
Segmenters
OpenNLP detects that a punctuation character marks the end of a sentence or not, defining a sentence as the longest white-space trimmed character sequence between two punctuation marks
spaCy uses a statistical sentencizer component and a rule-based component which can be put into a pipeline.
Distributional Semantics
-
-
-
Pros
The meaning of a word is represented by a vector (named word vector), therefore
-
-
-
When representing this as a vector, the higher the value of the co-size, the closer the angle between the vectors
-
-