Please enable JavaScript.
Coggle requires JavaScript to display documents.
Authorship Attribution (Stylometric Features - measurable properties of a…
Authorship Attribution
Stylometric Features - measurable properties of a text that can be used to characterize the text
Syntactic
Patterns used to form sentences
relative frequencies of different syntactic constructions
e.g. the frequencies of short sequences of parts-of-speech (POS), or combinations of POS and other classes of words
language-dependent because it requires NLP tools, features produce noisy data due to parser errors
approaches
frequencies of rewrite rules
combines both the syntactic class of each word and the info about how the words are combined into phrases
require an accurate fully-automated parser
detect sentence and phrase boundaries and use phrase type counts and phrase type lengths as features (e.g. noun phrase count, VP length etc.)
Analysis-level measure - specific not only to the language but also to the tool used: tool analyses the text in several steps, from simple cases to the combination of outcomes to produce complex results. Feature - percentage of text each step the tool was able to analyse
ordered streams of syntactic labels extracted from a partial parser (e.g. NP consisting of DET, ADJ, Noun) and use bigram frequencies
POS tag frequencies and POS tag n-gram frequencies
analysis of syntactic errors - with a spell-checker
Structural
concern the structure of a document
e.g. how sentences are organized within paragraphs and paragraphs within documents, presence/absence of greetings and farewell remarks and their position within the e-mail body
Lexical
text is viewed as a sequence of tokens (word, number, punctuation mark)
e.g. sentence length, word length, spelling errors (letter omissions and insertions), formatting errors (all caps)
Advantage - can be applied to any corpus and any language, only need a tokenizer
Vocabulary richness functions
- quantify the diversity of the vocabulary of a text. BUT - are dependent on text length -> unreliable to use alone
E.g.
TTR
,
Yule's K measure
(text measure of vocabulary repetitiveness, the smaller the measure, the richer the vocabulary),
number of hapax legomena
(words occurring once),
number of hapax dislegomena
(words occurring twice)
two approaches
bag of words
representing text by vectors of word frequencies disregarding contextual information
most common words are
function word
s, they are reliable for AA because topic-independent and used unconsciously (minimised risk of being deceived), the precise choice of FW list is not that important
simple and successful - use the most frequent words in the corpus as features
the size of the set can vary: 100, 250, even any word that has been used at least twice
first dozens are function words, then some content words as well
another approach - represent words in an abstract form: combine frequency, length, and some characters from the token
n-grams
advantage - take the context into account
the results are not always better than with the bag-of-words features
Disadvantages: dimensionality increases sufficiently, features become very sparse (capture little info), probable that they capture content info rather than style
Content-specific
Frequency of content specific keywords
Character features
individual alphabets, frequency of special characters, avg number of character per word or sentence, capital letter proportion, letter frequencies, punctuation mark counts
text is viewed as a sequence of characters
Character
n-grams
- various frequencies
might be useful for capturing lexical, grammatical and orthographic preferences
work well in different languages and different areas of attribution + useful for profiling
But: many n-grams are associated with particular content words and roots
dimensionality is considerably increased, capture redundant info, many n-grams are needed to represent a long word
important to define the
n
: large would capture context and content info + thematic information and it increases the dimensionality; small represent syllable-like info, but not contextual info.
selection of n is language-dependent (larger ones preferable for languages with longer words, such as German), for English - up to 4-grams
tolerant to noise (e.g. spelling errors), good solution for oriental languages where tokenisation is not trivial
compression-based approaches
can be said to use this type of representation since they describe the characteristics of text based on repetitions of character sequences
Complexity measures
e.g. average word length (or more generally, word length distribution) in terms of syllables or letters and average number of words in sentence
when proved inadequate - TTR and other vocabulary richness features
Content words
patterns of lexical choice, e.g. large vs big
modeling the relative frequencies of content words
Sequences and collocations of content words also can be useful
Other
morphological analysis for languages with rich morphology (e.g. Greek, Hebrew)
Frequencies of punctuation habits or orthographic / syntactic errors and idiosyncrasies. Problem - availability of accurate spell-checkers is problematic
for HTML documents: font colour counts and font size counts
~ 1000 different measures proposed so far
Semantic
Use of semantic dependency graphs
represent binary semantic features (number and person of nouns, tense and aspect of verbs etc) and semantic modification relations (relations btw node and its daughters)
Based on WordNet
info about synonyms and hypernyms, causal verbs
Also:
latent semantic analysis
of lexical features to detect similarities btw words
Based on a set of functional features
certain words or phrases are associated with semantic information, e.g. conjunction, elaboration, enhancement
Feaure selection and extraction
are applied to reduce the dimensionality and avoid overfitting on the training data
criteria for selecting features
most important -
frequency
: the more frequent, the more stylistic variation it captures
the
instability
of features: stability - availability of synonyms for certain characteristics, unstable features are more likely to indicate stylistic choices of the author
Methods / Techniques
based on how the training texts are handled
profile-based approaches
(concatenating all texts per author)
probabilistic models
based on Bayes theorem - the assumption that the occurrences of the features are mutually independent
the conditional probabilities are estimated by the concatenation xa of all available training texts of the author a and the concatenation of all the rest texts, respectively
compression models
all text per author are concatenated and compressed. Then the unseen text is added to each author file and compressed . The difference in bit-wise size indicates the similarity of the unseen text to the other texts by the author. Calculates the cross-entropy
RAR compression algorithm is the most accurate
Normalized Compressor Distance (NCD)
- assesses the similarity between a pair of documents by measuring the improvement achieved by compressing an information-rich document using the information found in the other document
simple training process, just the extraction of profiles for the candidate authors
Common n-grams (CNG) model
the profile is composed by the most common character n-grams of that text
the method computes the dissimilarity btw two profiles by calculating the relative difference btw their common n-grams
disadvantage - the CNG distance function fails when the training corpus is imbalanced (it favours shorter profile texts)
Solution -
Simplified profile intersection
- counts the amount of common n-grams of the two profiles (only problematic when one author has more text than the others, it will favour that author)
may produce more reliable representations for short texts, does not require a segmentation of a long text
instance-based approaches
(texts are represented individually as a separate instance of authorial style)
machine learning classifiers and clustering algorithms
training texts are represented as labeled numerical vectors and learning methods are used to find boundaries between classes (authors) that minimize some classification loss function
e.g. k-means, decision tree,
SVM
, back-propagation neural networks
each text is a vector in a multivariate space
SVM allows to avoid overfitting even with a large feature set -> one of the best solutions
effectiveness is diminished when classes are imbalanced
re-balance the training set by segmenting texts
text re-sampling, i.e. using some text parts more than once
inter-textual distance / similarity-based models
if the vocabulary used in two texts is similar, both texts are closer and it is possible that there were written by the same person
Delta measure
, chi-square distance and Kullback- Leibler Divergence
calculation of pairwise similarity measures btw unseen text and all the training texts, then using nearest-neighbour algorithm to find the most likely author
Burrows delta
- calculates the z-distributions of a set fo function words, then calculates deviations from the norm for each document . Then the difference btw a set of training texts by one author and the z-score of the test text is calculated. The smaller the delta, the greater the similarity
training text length is important, performance decreases with decreasing text size
Meta-learning models - unmasking (for longer texts)
can handle high-dimensional sparse data with machine learning algorithms, combine different types of features + can handle structural features
Hybrid approaches
the training texts were represented separately, but then the vectors were averaged to produce an author profile
the distance btw the test text and author profile was calculated by a weighted feature-wise function with parameters for the difference between the feature values of the unseen text profile and the author profile, the feature importance for the unseen text, and the feature importance for the particular author
based on the number of features used
Unitary Invariant Approach
(traditional human expert-based methods)
a
single numeric function
of a text is sought to discriminate between authors, this approach has proven to be
unreliable
the writing of each author could be characterized by a unique curve expressing the relationship between
word length and relative frequency of occurrence
Mendenhall
(1887) - the authorship of texts attributed to Bacon, Marlowe, and Shakespeare;
Mascol
(1888a,1888b) - the authorship of the gospels of the New Testament
early 20th century - the search for invariant properties of textual statistics (
Zipf
, 1932)
Yule (1944) considered sentence length as a potential method for authorship discrimination
Multivariate Analysis Approach
statistical multivariate discriminant analysis
is applied to word frequencies and related numerical features
Mosteller and Wallace’s work (1964) on the authorship
of the
Federalist Papers
: new approach - Naive Bayes with frequencies of a set of function words as features
taking documents as points in some space, and assigning a questioned document to the author whose documents are “closest” to it, according to an appropriate distance measure
Burrows Delta is a reliable distance measure: documents are represented by a profile of the relative frequencies of the most frequent words, a profile is a feature vector, the
Burrows delta
corresponds to the manhattan distance btw the vectors, also reliable -
Cosine delta
based on the cosine similarity
other methods - PCA, ANOVA, a probabilistic distance measure such as K-L divergence between Markov model probability distributions of the texts
Machine Learning Approach
modern machine learning methods are applied to sets of training documents to
construct classifiers
that can be applied to new anonymous documents
Training texts are represented as labeled numerical vectors, and learning methods are used to find boundaries between classes (i.e., authors) that minimize some classification loss function
neural networks, k-nearest neighbour, Naive Bayes,
SVM, Bayesian regression
Evaluation
Until the late 1990s objective evaluation was hard to perform
testing ground - literary works with disputed authorship
too long texts, not stylistically homogeneous, different topics
small candidate authors set
inspection was mainly intuitive, based on visual inspection of scatterplots
lack of suitable benchmark data -> comparison of methods not possible
After 1990s
large volume of electronic texts became available
Information Retrieval research provided methods for representing and classifying large volume of texts
machine-learning algorithms allow to handle multi-dimensional and sparse data
standard methodologies for comparing the performance of algorithms were devised
NLP tools for analysing and representing the style (e.g. syntax)
Also: factors influencing the performance of different methods are investigated (training text size, number of candidate authors, distribution of training texts over the candidate authors)
General info
definition
the process of examining the characteristics of a piece of work in order to draw conclusions on its authorship
First studies
19th cent - studies fo Shakespeare's plays (1887)
most influential - 1964, study of the Federalists Papers, 85 papers published anonymously to convince New Yorkers to ratify the American Constitution. Authorship of 12 papers is heavily contested
1st half of 20th cent - statistical studies
Application areas
literary texts, program codes, online messages, blogs, forensic analysis, intelligence, civil law (copyright disputes)
Parameters: training and test corpus size, length of texts, number of candidate authors, distribution of the training corpus over the authors; age, nationality, gender, text should be written in the same period, topics, genres.
Types of tasks
Authorship profiling or characterization
determines the author’s profile (gender, age, occupation, educational background, cultural background and language familiarity)
Gender
Content word features are a good predictor
Differ in the use of determiners and prepositions (male), pronouns (female) + in content: male writers use words related to technology, female writers - words related to personal life and relationships
Age
The best performance - combination of style and content features (over 77%)
Style: young writers omit apostrophes, older writers use more determiners and prepositions; Content: teenagers - topics related to school and mood, topics related to work and social life for writers in their 20s, and to family life for those in their 30s
Native Language
Content features give the best results (82%)
Measuring errors and idiosyncrasies is useful
The strongest features are those that are
underrepresented
, e.g. omissions of definite articles by naive speakers of Slavic languages
greater frequency of particular words common in the native language, e.g.
indeed
for French,
however
for Bulgarian
Personality (Neuroticism)
Style features perform the best, content features can make the performance worse
Neurotics tend to refer to themselves, use pronouns for subjects rather than objects; non neurotics tend to be less precise and less concrete, refer to how things should be done
Authorship Verification or Similarity detection
- just one suspect, compares multiple pieces of work and determines whether or not they are produced by a single author without necessarily identifying the author (e.g. plagiarism detection)
long-text verification problem
- whether two long texts are by the same author
Koppel et al 2012 -
unmasking method
, a technique for measuring the depth of differences btw two documents
to remove, by stages, the most useful features and to gauge the speed with which cross-validation accuracy degrades after each iteration
for the true author the degradation will be sudden and dramatic, for the other authors it will be slow and smooth
doesn't work for short texts
this happens because the works of one author are different in a small number of features due to thematic, genre or purpose differences, chronological stylistic drift or deliberate attempts by the author to mask their identity
naive approaches
impostors method
assemble a collection of works by other authors, use a 2-class learner (e.g. SVM), A vs not-A. Then chunk the available anonymous text and run the chunks through the model
if most chunks are classified as A, then it is the author
BUT: it is reasonable to discard the authorship of A if most chunks are classified as not-A, but the reverse is not true. Any author whose style is closer to A than to not-A will be falsely classified as A.
based on cross-validation accuracy
doesn't work well
train a machine learning algorithm and then perform cross-validation. If the accuracy is high, authors are different, if low (i.e. not better than chance) - the same author.
fundamental problem of authorship attribution
- whether two short texts are by the same author
Koppel et al 2012: generate a set of
impostors
, then use the method for short text AA with many candidates: use a
similarity-based approach
and measure with a randomly selected subset of features for 100 trials, only if the author of the other text is returned as a most probable author in
11 trials out of 100
we can assume they are the real author
significantly more difficult than basic attribution
Authorship attribution or identification
- the identification of the real author of a disputed anonymous document
a text categorization or text classification problem (multi-class single-label categorisation task)
data cleaning, feature extraction and normalization, documents as feature vectors, training (develop a classification model) and testing (validate the developed model by) sets
simple AA problem
- small, closed set of candidate authors, a lot of training text by each author
similarity-based methods
a metric is used to
computationally measure the similarity between two documents
the anonymous document is attributed to that author whose known writing (considered collectively as a single document) is most similar
focused on the choice of features for document representation
machine-learning methods
the known writings of each candidate author (considered as a set of distinct training documents) are used to construct a classifier that can then be used to categorize anonymous documents
represent each of a set of training documents as a numerical vector, then use a learning algorithm to find a formal rule, known as a classifier, that assigns each such training vector to its known author
Results
: best
methods
are SVM and Bayesian logistic regression; large sets of very simple
features
are more accurate than small sets of sophisticated features
Best stylistic features - POS + FW
many-candidates problem / needle-in-a-haystack problem
- which among thousands of candidate authors is the author of a given text
for short documents
similarity-based approach
with robustness of the similarity taken into account to filter false positives
Koppel et al 2012: represent each text as a vector, features: the 100 000 most frequent character 4-grams, use a cosine similarity, return the author with the closest writing - this will yield 46% accuracy
improvement: allow for "
Don't know
" answers
feature randomisation technique
: check if a given author proves to be most similar to the test snippet for many different randomly selected feature sets of fixed size
if the proportion of times where one author is a top match is above a threshold - choose as the author, if not - uncertain result
OR
Meta-learning
: use an SVM to learn a meta-model that decides whether a given pair of text and author is reliable. Based on a reliability score
if the reliability score is above a predetermined threshold, the author is determined, otherwise the answer is "Don't know"
if the author is not in the candidate set, the performance degrades
Koppel et al 2009: 4 variants of tf-idf representations based on style and content features and a cosine measure. Content features give up to 56% accuracy, style feature set only 6%
Adversarial Attacks
AA methods are used as evidence in court -> need for reliable and validated methods
it is assumed that authors did not try to die their identity or change their writing style in any way, but this might not always be the case
Types
obfuscation attacks
- hiding one's identity, changing the writing style. Participants reported using shorter sentences and less descriptive words
imitation attacks
- imitating someone else's writing style. Participants tried using descriptive and grim language.
Goal
- discover how robust AA methods are against attacks
Setup
15 participants with no background in linguistics
~5000 words of pre-existing sample writing from a formal source (essays, reports etc) -> consistent style, limit the errors
the sample was split into 500 word passages
obfuscation task - write 500 words to describe their neighbourhood trying to change their writing style
imitation task - given a 2500 word sample from The Road by Cormac McCarthy and asked to write about their day from a 3rd person perspective
Methods
Neural Network
9 features: number of different words, lexical density, Gunning-Fog readability index, character count without whitespace, average syllables per word, sentence count, average sentence length, and an alternative readability measure
metrics generated automatically using the Textalyser tool (it ignores the numbers, applies a stop list of common words, ignores words shorter than 3 characters)
average accuracy of 78,5% in 16 trials of repeated random sub-sampling validation
Synonym-based classifier
based on the vocabulary of the author, the way they choose synonyms
Model 1 calculates a vector for words based on the number of synonyms in WordNet and the shared frequency of the word in the test text and a training sample. The author of the text with the highest value is the most probable author.
Model 2 expands by adding a stop word list and takes into account the overall frequency of a word in all of the available text (the second addition was not useful)
The accuracy of Model 1 + the stop word list was 91.67%
Attribution was made based on the average of the highest match value of a single text and the highest match value of all samples of the author
Statistical using the Signature Stylometric System
features - word lengths, letter usage, and punctuation
compares a sample text with each author profile and sums the Chi-square result of the two texts
the minimum value shows the most probable author
testing is not automated -> less robust, but shows around 95% accuracy
Evaluation
random sets of 2,3,4,5 authors were tested with all the three methods and then the methods were tested against each attack
the accuracy of all three methods without attacks were much higher than chance
with the obfuscation attack: in only three cases methods did better than chance, Signature (62,5%) and Synonym (56,3%) on 2 authors and Neural network with 5 authors (25%)
with the imitation task: methods perform very poorly, far below chance, average accuracy below 5% and only once 10% performance was achieved.
all three methods largely attributed authorship to the intended victim, only once the attribution rate was less than 60%. The worst performance - the Synonym method (McCarthy was mentioned as the author in 87,5% to 93,75% of the cases)