Please enable JavaScript.

Coggle requires JavaScript to display documents.

01. Information Retrieval, Machine Learning, Pre-trained Models, Semi…

- - - - PageRank
        
        Alternative link-analysis method used by Google
        
        Does not attempt to capture the distinction between hubs and authorities.
        
        Ranks pages just by authority.
        
        Applied to the entire web rather than a local neighborhood of pages surrounding the results of a query.
        
        Initial PageRank Idea
        
        Just measuring in-degree (citation count) doesn't account for the authority of the source of a link
        
        Initial page rank equation for page p:
        
        Nq is the total number of out-links from page q.
        
        A page, q, “gives” an equal fraction of its authority to all the pages it points to (e.g. p).
        
        c is a normalizing constant set so that the rank of all pages always sums to 1
        
        Example on page 34
        
        Problem with Initial Idea
        
        A group of pages that only point to themselves, but are pointed to by other pages act as a "rank sink" andabsorb all the rank in the systems.
        
        Rank Source
        
        Introduce a "rank source" E that continually replenishes the rank of each page, p, by a fixed amount E(p):
        
        PageRank Algorithm
        
        Let S be the total set of pages.
        
        Let (for some 0<a<1, e.g. 0.15)
        
        Initialise
        
        Until ranks do not change (much) (convergence)
        
        Normalise：
        
        Initial Algorithm
        
        Iterate rank-flowing process until convergence:
        
        Let S be the total set of pages:
        Initialise:
        
        Until ranks do not change (much) (convergence)
        
        For each (normalise):
        
        Example on convergence on page 36
      - Authorities and Hubs:
        
        Authorities are pages that are recognised as providing significant, trustworthy, and useful information on a topic.
        
        In-degree (number of pointers to a page) is one simple measure of authority.
        
        Hubs are index pages that provide lots of useful links to relevant content pages.
        
        Authorities and In-Degree
        
        Even within the base set S for a given query, the nodes with highest in-degree are not necessarily authorities (may be popuplar pages like Yahoo or Amazon)
        
        True authority pages are pointed to by a number of hubs (i.e. pages that pint to lots of authorities).
        
        Iterative Algorithm
        
        Use an iterative algorithm to slowly converge on a mutually reinforcing set of hubs and authorities.
        
        Maintain for each page p ∈ S (p is an element of set S):
        
        authority score: ap (vector a)
        
        hub score: hp (vector h)
        
        Each page has authority and hub scores.
        
        HITS Update Rules
        
        After initialise the score, then need to update the score.
        
        Authorities are pointed to by lots of good hubs
        
        Hubs point to lots of good authorities:
        
        q is the pages point to p.
        
        e.g.:
        a4=h1+h2 + h3
        h4=a5+a6+a7
        
        HITS Iterative Algorithm
        
        After update both auth. and hub scores,
        
        Normalise a:
        
        Normalise h:
        
        HITS for Clustering
        
        An ambiguous query can result in the principal eigenvector only covering one of the possible meanings.
        
        Non-principal eigenvectors may contain hubs & authorities for other meanings.
        
        e.g. jaguar:
        
        video game (principal eigenvector)
        
        football team (2nd non-princ. eigenvector)
        
        automobile (3rd non-princ. eigenvector)
        
        Convergence
        
        Algorithm converges to a fix-point if iterated indefinitely.
        
        Define A to be the adjacency matrix for the subgraph defined by S.
        
        Authority vector, a converges to the principal eigenvector,
        
        Hub vector, h, converges to the principal eigenvector,
        
        Hubs and Authorities
        
        Together they tend to form a bipartite graph
        
        HITS
        
        Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web.
        
        Based on mutually recursive facts:
        
        hubs point to lots of authorities.
        
        Authorities are pointed to by lots of hubs.
        
        HITS Algorithm
        
        Computes hubs and authorities for a particular topic specified by a normal query.
        
        first determines a set of relevant pages for the query called the base set S.
        
        Analyse the link structure of the wb subgraph defined by S to find authority and hub pages in this set.
        
        Base limitations
        
        To limit computational expense:
        
        Limit number of root pages to the top 200 pages
        retrieved for the query
        
        To eliminate purely navigational links:
        
        Eliminate links between two pages on the same host
        
        To eliminate 'non-authority-conveying' links:
        
        Allow only m (m @ 4-8) pages from a given host as
        pointers to any individual page
        
        Constructing a Base Subgraph
        
        For a specific query Q, let the set of documents returned by a standard search engine (e.g. VSR) be called the root set R.
        
        Initialise S to R.
        
        Add to S all pages pointed to by any page in R.
        
        Add to S all pages that point to any page in R.
      - Citation Analysis Method
        
        Bibliographies or references are links that can be viewed as a graph. The structure of this graph can provide interesting information about the similarity of documents and the structure of information.
        
        The impact factor of a journal J in year Y is the average number of citations to a paper published in J in year Y-1 or Y-2. Measure of how often papers in the journal are cited
        by other scientists. Does not consider the quailty of the citing article.
        
        Citation vs Links
        
        Many links are navigational, such as portals not content providers.
        
        Company website may not be point to their competitors, citations to relevant literature is enforced by peer-review.
        
        Co-Citation
        
        An alternate citation-based measure of similarity
        
        It is opposite of the bibliographic coupling.
        
        Number of documents that cite both A and B.
        
        Bibliographic Coupling
        
        Measure of similarity of documents
        
        The bibliographic coupling of two documents A and B is the number of documents cited by both A and B.
        
        Size of the intersection of their bibliographies
  - - - Evaluation of ranked retrieval sets
        
        Consider the rank of retrieved documents
        
        Relevant documents on high ranks
        
        Other ranking measures
        
        Mean average precision: roughly the average area under the precision-recall curve
        
        Precision at K: average precision at top k documents
        
        Recall at K: Average recall at top k documents.
        
        Receiver operating characteristics (ROC) curve.
        
        By analogy, Higher the AUC, better the model is at measuring the classification problem at different settigns. The larger the AUC (area under the curve) the better the seprability between the binary classification.
        
        Normalised discounted cumulative gain (NDCG):
        
        measuring the quality of a set of search results. Very relevant results are more useful than somewhat relevant results which are more useful than irrelevant results (cumulative gain).
        
        Mean Reciprocal Rank
        
        MRR = Averaged inverse rank of the first correct answer
        
        Find the rank of system
        
        Find the relevant docs
        
        Find the rank of the relevant doc
        
        Find reciprocal rank or inverse rank
        
        MRR is the average of the inverse rank values.
        e.g. 3 documents, rank1=3, rank2=2, rank1=1,
        Mean reciprocal rank = 1/3(1/3+1/2+1) = 0.61
        
        Interpolated Precision-Recall Curve
        
        Zigzag line is hard to interpret
        
        Interpolate precision: fill furrows
        
        How is plot on page 32 completed?
        
        Precision-recall curve
        
        Compute recall and precision at each rank k
        
        What is top rank? Is it top precision value?
        
        Plot (recall, precision) points until recall reaches 1
        
        Calculate precision and recall of rank 1 value.
        
        Plot (Recall, Precision) on the Precision-recall curve
        
        Continue with rank 2 and so on.
        
        Plot till recall reaches 1, this is also when all relevent documents are retrieved.
      - Evaluation of unranked retrival sets (Boolean Retrieval):
        
        Rank of document is not important;
        
        Retrieved (returned) documents vs. Not retrieved documents
        
        F-Measure
        
        A single measure that trades off precision vs recall
        
        a > 0.5 emphasise precision
        
        a < 0.5 emphasise recall
        
        F1-measure: a harmonic mean of precision and recall (a = 0.5)
        
        Precision, Recall and Accuracy
        
        Precision: fraction of retrieved documents that are relevant
        
        precision = No. of relevant docs retrieved / No. retrieved docs = tp/(tp+fp)
        
        Recall: fraction of relevant documents that are retrieved
        
        Recall = No. of relevant docs retrieved / No. relevant docs = tp/(tp+fn)
        
        Accuracy: fraction of relevant documents that are correct
        
        Accuracy = No. of correctly classified docs / No. of total docs = (tp+tn)/(tp+tn+fp+fn)
        
        Example on precision, recall, and accuracy.
        
        System with high precision and recall is always preferable, but hard to achieve.
        
        Accuracy is not for IR
        
        Contingency Table:
        A summary table of retrieval result:
        
        True positive: relevant documents returned;
        
        False positive: irrelevant documents returned;
        
        False negative: relevant document not returned;
        
        True negative: irrelevant document not returned;
    - - Build Large Test Collection:
        Recent collections do not have relevance judgements on all possible pairs of (query, document)
      - Relevance Judgement
        
        Relevance is assessed relative to an information need, not a query.
        
        A relevant document is if it addresses the information needed, and it is not important the document contains all query terms or not.
  - - - Score Function of Vector Space Model
      - Query can also be used as a vector, like it was used for tf (term frequency)
      - Vector Representation of documents in reality is very sparse. Latent Semantic Analysis can be used.
      - Angle Difference
        
        Cosine Similarity:
        
        Numerator: inner product, (X1 x X2 + Y1 x Y2)
        
        Denominator: product of
        Euclidean lengths, (distance of the vector X x distance of the vector Y)
        
        Cosine similarity example for two dimensional vectors or higher dimensional vectors.
        
        tf-idf
        
        tf
    - - TF-IDF
        
        The ft-idf weight of term t in document d is as follows:
        
        Score of document d given query q = (t1, t2,..., tm), the query contains the terms:
        
        TF-IDF Example: the TF-IDF can have different result as TF alone. The higher score is more relevent.
        
        Limitation of tf-idf scoring
        
        tf-idf still heavily relies on the frequency of terms.
        
        Score linearly increases with respect to frequency of term.
        
        After a certain frequency, the absolute frequency is not important.
        
        Sublinear tf scaling:
        
        Use logarithmically weighted term frequency (wf):
        
        Logarithmic term frequency version of tf-idf:
        
        Limitation of tf-idf/wf-idf scoring
        
        Both tf-idf and wf-idf tend to give higher score for longer documents, i.e. when d is larger.
        
        Maximum tf normalisation
        
        Let tfmax(d) be the max frequency of document d
        
        Normalised term frequency is defined as:
        
        Max value of ntf is 1
        
        Min value of ntf is a
        
        This approach has limitation too.
        
        Document Frequency
        
        The numbers in the collection that contain term t.
        
        High frequency -> not important
        
        Low frequency -> important
        
        Collection frequency is not used, as there can be more frequency of a term in one document, but not in the whole collection. So collection frequency does not represent the term frequency in a document.
        
        Inverse Document Frequency
        
        Let dft be the number of documents in the collection that contain a term t. The inverse document frequency (IDE) can be defined as follows:
        
        N is the total number of documents.
        
        Example: The higher the IDF value, the higher the importance for the term.
      - Rank by Term Frequency
        
        Term frequency is the number of occurrences of term t in document d.
        
        q is a set of query terms (t1, t2,...tm), term frequency score of documents given query q is:
        
        TF Rank example, higher score has a higher rank.
    - - Sorting with Weighted Fields
        
        Score (d,t), how relevent is document d given query term t.
        
        Use weights for different fields.
        
        Example query: Hamlet
  - - - Intersection:
        
        AND operation
        
        Compare index
        
        Move arrow with small number
        
        If two numbers are same then move the number into answer list.
        
        Continue until operation finished.
    - - Inverted Index Construction:
        
        Documents to be indexed;
        
        Tokenizer: break down sentences into words as token stream.
        
        Linguistic modules: removal of characteristics of the words, such as plural, capital letter, etc. Ended up with a list of terms.
        
        Indexers: put the list into inverted index
        
        Indexer Step 3:
        
        Multiple term entries in a single document are merged.
        
        Split into Dictionary and Postings;
        
        Doc. frequency information is added,
        
        Frequency is how many documents the word appeared.
        
        Indexer Step 2:
        
        Sort tuples by terms (and then docID)
        
        Indexer Step 1: Token Sequence
        
        Scan documents for indexable terms;
        
        Keep list of (token, docID) pairs;
        
        Initial stages of Text Processing:
        
        Tokenisation;
        
        Normalisation;
        
        Stemming;
        
        Stop words;
        
        Stemming and Lemmatisation
        
        {run, running, ran} -> run
        
        Stemming turns tokens into stems
        
        Stems does not need to be real words.
        
        Lemmatisation turns words into lemmas, which are dictionary entries.
        
        Stopwords Removal & Normalisation
        
        Stopwords removal: removal of the most comon words in a language. These words are not always useful and reduce the number of postings.
        
        Normalisation: use different word to represent of the term. Keep equivalence class.
  - - - English POS categories
        
        Noun
        
        Verb
        
        Adjective
        
        Adverb
        
        Preposition 介词
        
        Determiner
        
        Coordinating conjunction
        
        Particle 助词
- - - - Data acquistion:
        
        Text cleaning: missing value, spelling correction,
        
        Pre-processing
        
        Feature engineering
        
        Modeling
        
        Evaluation
        
        Deployment
        
        Monitoring and model updating
- - - - Gradient Descent
        
        Start by picking completely random values for w and b, e.g. by sampling N(0,1).
        
        Then compute the gradients dL/dw and dL/db
        
        Then update:
        w := w-a*dL/dw
        b := b-a*dL/db
        
        In the L vs w graph, the smaller the L value, the better the model.
        
        Keep doing these updates over and over until w and b stop chaning.
        
        The role of learning rate
        
        The learning rate a is hyper-parameter you set.
        
        Too small a, need to take many steps.
        
        Too large a, may not converge.
        
        Update to you to choose a good value of a.
    - - Classification: loss
        
        Compute loss between our predicted probabilities and one-hot encoding of class label y, i.e. the probability of the correct class should be 1, all others should be 0. The output is between 0 and 1.
      - Step 1: apply exponetial function to all values. This makes them positive.
        
        Step 2: divide each value by the sum of all values. This makes them sum to 1.
        
        We can pretend that our values are probabilities.
      - Classification: cross-entropy
        
        In practice, using cross entropy as the loss function for classification gives better results (softmax).
        
        Cross entropy example.
        
        y is either 0 or 1.
      - Classification: prediction
        
        To make a prediction for a new data point, apply your model to compute probabilities for each class. Predict the class with the largest probability.
        
        When predict a class in this way, the decision boundsries are straight lines.
- - - - Pairwise mutual information:
        
        log of probability of words appearing together and divided by the product of individual words.
        
        Does cat and dog appears in a similar context:
        
        Drawbacks:
        
        Sparsity problem
        
        Hign dimensionality
        
        Solutions:
        
        Matrix factorisation (SVD)
        
        Singular Value Decomposition
        
        Word Representation (word vector or word embeding)
        
        Word2Vec method:
        
        Step 1: given center word wt, predict contect words in window j ∈ [-m, m], by P(w t+j|w t)
        
        Step 2: Move the window to the right of wt
        
        Step 3: move the window to the left of wt
        
        Loss function (average cross entropy):
        
        is equal to
        
        T is the number of the possible context windows
        
        How to calculate P(wt+j|wt; ⊖):
        
        vo: context (observed) word vector
        
        vc: center word vector
      - Word Co-occurrence Matrix: to check the number of times the words occur together.
  - - - 𝜙 is some function that is non linear and (mostly) differentiable.
        
        𝜙 is called the ‘activation function’.
        
        NN(x) is a hidden vector if it is in the middle.
        
        Input layer is the first layer.
        
        Output layer is the last layer.
        
        All other layers are 'hidden layers'.
        
        More hidden layer means more versions activation functions.
        
        More neurons can approximate more camplicated functions. , therefore better modelling of the data.
        
        If too much neurons are added, this can result fitting noise in the training dataset to cause overfitting.
      - Key idea of neural networks
        
        Stack multiple linear regression models.
        
        Output is used for the input of next model.
      - Rectified Linear Unit (ReLU)
        
        ReLU(x) = max(x,0)
        
        Fast to compute
        
        Works well in practice
        
        Each ReLU neurons represents on ReLU graph
      - Deep neural network
        
        Feed-forward neural network
        
        Feature Learning
        
        One way of looking at the deep neural network as feature learning.
        
        i.e. the first layer (h1) is to transform its input to a linearly separable output, so the final layer can do its job properly.
        
        So the straight line (as decision boundary) can separate its classes in the final layer. If the model is good, the straight line should separate the classes.
        
        Stochastic Gradient Descent
        
        Standard GD, the gradients are computed on the loss of the entire training dataset, whic is a problem if dataset is very large.
        
        For SGD, at each update we randomly sample a batch of B data points and compute gradients only on the loss of these points.
        
        SGD is much faster to compute.
        
        SGD acts as a regulariser, models trained with SGD usually generalise better, withtout overfitting the training dataset.
        
        Backpropagation on Computation Graph
        
        Forward pass:
        for i = 1, ...., N. Compute vi as a function of Pa(vi).
        
        Backward pass:
        for i = N-1, ..., 1
        
        Compute te loss (forward) ???
        
        Compute the derivatives (backward) ???
        
        Gradient Descent
        
        Simple Regression Model
        
        Chain Rule:
        
        =>
        
        =>