Please enable JavaScript.

Coggle requires JavaScript to display documents.

Authorship Attribution (Stylometric Features - measurable properties of a…

- - - - frequencies of rewrite rules
        
        combines both the syntactic class of each word and the info about how the words are combined into phrases
        
        require an accurate fully-automated parser
      - detect sentence and phrase boundaries and use phrase type counts and phrase type lengths as features (e.g. noun phrase count, VP length etc.)
      - Analysis-level measure - specific not only to the language but also to the tool used: tool analyses the text in several steps, from simple cases to the combination of outcomes to produce complex results. Feature - percentage of text each step the tool was able to analyse
      - ordered streams of syntactic labels extracted from a partial parser (e.g. NP consisting of DET, ADJ, Noun) and use bigram frequencies
      - POS tag frequencies and POS tag n-gram frequencies
      - analysis of syntactic errors - with a spell-checker
  - - - E.g. TTR, Yule's K measure (text measure of vocabulary repetitiveness, the smaller the measure, the richer the vocabulary), number of hapax legomena (words occurring once), number of hapax dislegomena (words occurring twice)
    - - bag of words
        
        representing text by vectors of word frequencies disregarding contextual information
        
        most common words are function words, they are reliable for AA because topic-independent and used unconsciously (minimised risk of being deceived), the precise choice of FW list is not that important
        
        simple and successful - use the most frequent words in the corpus as features
        
        the size of the set can vary: 100, 250, even any word that has been used at least twice
        
        first dozens are function words, then some content words as well
        
        another approach - represent words in an abstract form: combine frequency, length, and some characters from the token
      - n-grams
        
        advantage - take the context into account
        
        the results are not always better than with the bag-of-words features
        
        Disadvantages: dimensionality increases sufficiently, features become very sparse (capture little info), probable that they capture content info rather than style
  - - - might be useful for capturing lexical, grammatical and orthographic preferences
      - work well in different languages and different areas of attribution + useful for profiling
      - But: many n-grams are associated with particular content words and roots
      - dimensionality is considerably increased, capture redundant info, many n-grams are needed to represent a long word
      - important to define the n: large would capture context and content info + thematic information and it increases the dimensionality; small represent syllable-like info, but not contextual info.
      - selection of n is language-dependent (larger ones preferable for languages with longer words, such as German), for English - up to 4-grams
  - - - represent binary semantic features (number and person of nouns, tense and aspect of verbs etc) and semantic modification relations (relations btw node and its daughters)
    - - info about synonyms and hypernyms, causal verbs
      - Also: latent semantic analysis of lexical features to detect similarities btw words
    - - certain words or phrases are associated with semantic information, e.g. conjunction, elaboration, enhancement
  - - - most important - frequency: the more frequent, the more stylistic variation it captures
      - the instability of features: stability - availability of synonyms for certain characteristics, unstable features are more likely to indicate stylistic choices of the author
- - - - probabilistic models
        
        based on Bayes theorem - the assumption that the occurrences of the features are mutually independent
        
        the conditional probabilities are estimated by the concatenation xa of all available training texts of the author a and the concatenation of all the rest texts, respectively
      - compression models
        
        all text per author are concatenated and compressed. Then the unseen text is added to each author file and compressed . The difference in bit-wise size indicates the similarity of the unseen text to the other texts by the author. Calculates the cross-entropy
        
        RAR compression algorithm is the most accurate
        
        Normalized Compressor Distance (NCD) - assesses the similarity between a pair of documents by measuring the improvement achieved by compressing an information-rich document using the information found in the other document
      - simple training process, just the extraction of profiles for the candidate authors
      - Common n-grams (CNG) model
        
        the profile is composed by the most common character n-grams of that text
        
        the method computes the dissimilarity btw two profiles by calculating the relative difference btw their common n-grams
        
        disadvantage - the CNG distance function fails when the training corpus is imbalanced (it favours shorter profile texts)
        
        Solution - Simplified profile intersection - counts the amount of common n-grams of the two profiles (only problematic when one author has more text than the others, it will favour that author)
      - may produce more reliable representations for short texts, does not require a segmentation of a long text
    - - machine learning classifiers and clustering algorithms
        
        training texts are represented as labeled numerical vectors and learning methods are used to find boundaries between classes (authors) that minimize some classification loss function
        
        e.g. k-means, decision tree, SVM, back-propagation neural networks
        
        each text is a vector in a multivariate space
        
        SVM allows to avoid overfitting even with a large feature set -> one of the best solutions
        
        effectiveness is diminished when classes are imbalanced
        
        re-balance the training set by segmenting texts
        
        text re-sampling, i.e. using some text parts more than once
      - inter-textual distance / similarity-based models
        
        if the vocabulary used in two texts is similar, both texts are closer and it is possible that there were written by the same person
        
        Delta measure, chi-square distance and Kullback- Leibler Divergence
        
        calculation of pairwise similarity measures btw unseen text and all the training texts, then using nearest-neighbour algorithm to find the most likely author
        
        Burrows delta - calculates the z-distributions of a set fo function words, then calculates deviations from the norm for each document . Then the difference btw a set of training texts by one author and the z-score of the test text is calculated. The smaller the delta, the greater the similarity
      - training text length is important, performance decreases with decreasing text size
      - Meta-learning models - unmasking (for longer texts)
      - can handle high-dimensional sparse data with machine learning algorithms, combine different types of features + can handle structural features
    - - the training texts were represented separately, but then the vectors were averaged to produce an author profile
      - the distance btw the test text and author profile was calculated by a weighted feature-wise function with parameters for the difference between the feature values of the unseen text profile and the author profile, the feature importance for the unseen text, and the feature importance for the particular author
  - - - a single numeric function of a text is sought to discriminate between authors, this approach has proven to be unreliable
      - the writing of each author could be characterized by a unique curve expressing the relationship between word length and relative frequency of occurrence
      - Mendenhall (1887) - the authorship of texts attributed to Bacon, Marlowe, and Shakespeare; Mascol (1888a,1888b) - the authorship of the gospels of the New Testament
      - early 20th century - the search for invariant properties of textual statistics (Zipf, 1932)
      - Yule (1944) considered sentence length as a potential method for authorship discrimination
    - - statistical multivariate discriminant analysis is applied to word frequencies and related numerical features
      - Mosteller and Wallace’s work (1964) on the authorship
        of the Federalist Papers: new approach - Naive Bayes with frequencies of a set of function words as features
      - taking documents as points in some space, and assigning a questioned document to the author whose documents are “closest” to it, according to an appropriate distance measure
      - Burrows Delta is a reliable distance measure: documents are represented by a profile of the relative frequencies of the most frequent words, a profile is a feature vector, the Burrows delta corresponds to the manhattan distance btw the vectors, also reliable - Cosine delta based on the cosine similarity
      - other methods - PCA, ANOVA, a probabilistic distance measure such as K-L divergence between Markov model probability distributions of the texts
    - - modern machine learning methods are applied to sets of training documents to construct classifiers that can be applied to new anonymous documents
      - Training texts are represented as labeled numerical vectors, and learning methods are used to find boundaries between classes (i.e., authors) that minimize some classification loss function
      - neural networks, k-nearest neighbour, Naive Bayes, SVM, Bayesian regression
  - - - testing ground - literary works with disputed authorship
      - too long texts, not stylistically homogeneous, different topics
      - small candidate authors set
      - inspection was mainly intuitive, based on visual inspection of scatterplots
      - lack of suitable benchmark data -> comparison of methods not possible
    - - large volume of electronic texts became available
      - Information Retrieval research provided methods for representing and classifying large volume of texts
      - machine-learning algorithms allow to handle multi-dimensional and sparse data
      - standard methodologies for comparing the performance of algorithms were devised
      - NLP tools for analysing and representing the style (e.g. syntax)
      - Also: factors influencing the performance of different methods are investigated (training text size, number of candidate authors, distribution of training texts over the candidate authors)
- - - - Content word features are a good predictor
      - Differ in the use of determiners and prepositions (male), pronouns (female) + in content: male writers use words related to technology, female writers - words related to personal life and relationships
    - - The best performance - combination of style and content features (over 77%)
      - Style: young writers omit apostrophes, older writers use more determiners and prepositions; Content: teenagers - topics related to school and mood, topics related to work and social life for writers in their 20s, and to family life for those in their 30s
    - - Content features give the best results (82%)
      - Measuring errors and idiosyncrasies is useful
      - The strongest features are those that are underrepresented, e.g. omissions of definite articles by naive speakers of Slavic languages
      - greater frequency of particular words common in the native language, e.g. indeed for French, however for Bulgarian
    - - Style features perform the best, content features can make the performance worse
      - Neurotics tend to refer to themselves, use pronouns for subjects rather than objects; non neurotics tend to be less precise and less concrete, refer to how things should be done
  - - - Koppel et al 2012 - unmasking method, a technique for measuring the depth of differences btw two documents
        
        to remove, by stages, the most useful features and to gauge the speed with which cross-validation accuracy degrades after each iteration
        
        for the true author the degradation will be sudden and dramatic, for the other authors it will be slow and smooth
        
        doesn't work for short texts
        
        this happens because the works of one author are different in a small number of features due to thematic, genre or purpose differences, chronological stylistic drift or deliberate attempts by the author to mask their identity
      - naive approaches
        
        impostors method
        
        assemble a collection of works by other authors, use a 2-class learner (e.g. SVM), A vs not-A. Then chunk the available anonymous text and run the chunks through the model
        
        if most chunks are classified as A, then it is the author
        
        BUT: it is reasonable to discard the authorship of A if most chunks are classified as not-A, but the reverse is not true. Any author whose style is closer to A than to not-A will be falsely classified as A.
        
        based on cross-validation accuracy
        
        doesn't work well
        
        train a machine learning algorithm and then perform cross-validation. If the accuracy is high, authors are different, if low (i.e. not better than chance) - the same author.
    - - Koppel et al 2012: generate a set of impostors, then use the method for short text AA with many candidates: use a similarity-based approach and measure with a randomly selected subset of features for 100 trials, only if the author of the other text is returned as a most probable author in 11 trials out of 100 we can assume they are the real author
  - - - data cleaning, feature extraction and normalization, documents as feature vectors, training (develop a classification model) and testing (validate the developed model by) sets
        
        simple AA problem - small, closed set of candidate authors, a lot of training text by each author
        
        similarity-based methods
        
        a metric is used to
        computationally measure the similarity between two documents
        
        the anonymous document is attributed to that author whose known writing (considered collectively as a single document) is most similar
        
        focused on the choice of features for document representation
        
        machine-learning methods
        
        the known writings of each candidate author (considered as a set of distinct training documents) are used to construct a classifier that can then be used to categorize anonymous documents
        
        represent each of a set of training documents as a numerical vector, then use a learning algorithm to find a formal rule, known as a classifier, that assigns each such training vector to its known author
        
        Results: best methods are SVM and Bayesian logistic regression; large sets of very simple features are more accurate than small sets of sophisticated features
        
        Best stylistic features - POS + FW
        
        many-candidates problem / needle-in-a-haystack problem - which among thousands of candidate authors is the author of a given text
        
        for short documents
        
        similarity-based approach with robustness of the similarity taken into account to filter false positives
        
        Koppel et al 2012: represent each text as a vector, features: the 100 000 most frequent character 4-grams, use a cosine similarity, return the author with the closest writing - this will yield 46% accuracy
        
        improvement: allow for "Don't know" answers
        
        feature randomisation technique: check if a given author proves to be most similar to the test snippet for many different randomly selected feature sets of fixed size
        
        if the proportion of times where one author is a top match is above a threshold - choose as the author, if not - uncertain result
        
        OR Meta-learning: use an SVM to learn a meta-model that decides whether a given pair of text and author is reliable. Based on a reliability score
        
        if the reliability score is above a predetermined threshold, the author is determined, otherwise the answer is "Don't know"
        
        if the author is not in the candidate set, the performance degrades
        
        Koppel et al 2009: 4 variants of tf-idf representations based on style and content features and a cosine measure. Content features give up to 56% accuracy, style feature set only 6%
- - - - 9 features: number of different words, lexical density, Gunning-Fog readability index, character count without whitespace, average syllables per word, sentence count, average sentence length, and an alternative readability measure
      - metrics generated automatically using the Textalyser tool (it ignores the numbers, applies a stop list of common words, ignores words shorter than 3 characters)
      - average accuracy of 78,5% in 16 trials of repeated random sub-sampling validation
    - - based on the vocabulary of the author, the way they choose synonyms
      - Model 1 calculates a vector for words based on the number of synonyms in WordNet and the shared frequency of the word in the test text and a training sample. The author of the text with the highest value is the most probable author.
      - Model 2 expands by adding a stop word list and takes into account the overall frequency of a word in all of the available text (the second addition was not useful)
      - The accuracy of Model 1 + the stop word list was 91.67%
      - Attribution was made based on the average of the highest match value of a single text and the highest match value of all samples of the author
    - - features - word lengths, letter usage, and punctuation
      - compares a sample text with each author profile and sums the Chi-square result of the two texts
      - the minimum value shows the most probable author
      - testing is not automated -> less robust, but shows around 95% accuracy