Please enable JavaScript.

Coggle requires JavaScript to display documents.

AI (DL (NLP (word2vec app (algorithms (direct prediction (skip gram (loss…

- - - - DL will save people from building ML models
      - hardcode->ML->DL
    - - phonological analysis/OCR
        
        morphological analysis(word root)
        
        syntactic analysis(sentence stop)
        
        semantic interpretation
        
        disclosure processing
    - - representation
        
        localist/discrete representation
        
        one hot vector
        
        cons
        
        no relation indication
        
        too long
        
        distributional similarity(neighbour, context)
        
        distributional representation
        
        staying close means similar, direction represent meaning
      - algorithms
        
        direct prediction
        
        skip gram
        
        loss function
        
        neg avg sum over words over window log cond prob
        
        prod over words over window cond prob
        
        softmax as probability -> transfer to loss function as neg log -> same as cross entropy, and we want to minimize it
        
        also can plus the regularization term
        
        no interest in position/distance
        
        probability softmax form of two vector set(context and center, easy for math), given center, predict prob for context, over all vocab
        
        softmax
        
        exponential is making things positive
        
        use exponential is making bigger things dominate, close to max, but still not one-hot, thus we call it soft max
        
        d is vector dim, V is vocab size, param size:2dV
        
        minimise loss function : GD
        
        derivative: chain rule, distributed derivative
        
        gradient: for each individual log, the gradient on center word is (actual context vector minus expected context vector under the center vector)
        
        cost function: u_o v_c dot product minus log of sum of all vocab in u dot product v_c
        
        probelm
        
        update vector so sparse, only contains word in the windoe
        
        can use sparse techniques, only record non-zero updates
        
        calculation of denominator of probability(for SGD, actually is the calculating of expected outside word) is needing all words in vocab
        
        Negative Sampling:
        use objective function: log of sigmoid u_o dot v_c plus sum of random selected (10)vocab words neg sigmoid word dot v_c
        
        1 more item...
        
        CBOW(Continuous Bag Of Words)
        
        model
        
        given context in a window, calculate the probability of the center word
        
        loss function
        
        neg sum over corpus log softmax center word dot product average of context words(softmax is on center word this time)
        
        average on context words since we need P(center word | context1, context2,...)
        
        we use log and softmax expression since we are making cross entropy of hot vector and softmax vector of center words as the loss function
        
        count based
        
        co-occurence matrix
        
        pros
        
        easy to calculate the n-gram count
        
        cons
        
        too sparse
        
        solution
        
        reduce dimentionality
        
        SVD
        
        hacks
        
        make too frequent words have limited count, say 100
        
        the distance in n-gram can matters the count weight
        
        use Pearson correlation(covariance/variances) instead of count
        
        count based vs. direct prediction
        
        count based
        
        pros
        
        fast train
        
        efficient statistics
        
        cons
        
        mostly on word similarity
        
        giving too much weight on very frequent word
        
        direct prediction
        
        pros
        
        can work on complex models
        
        cons
        
        slow, scale on corpus
        
        inefficient statistics
        
        GloVe(Global Vector) combine the two
        
        cost function
        
        1/2 sum over i and j in the co-occurrence matrix, with limited upper bound, co-occurrence value times square of (ui vj dot product minus log of that co-occurrence value)
        
        we get U and V, but actually co-occurrence matrix is symmetric, thus we sum them up tp get result X
      - evaluation
        
        intrinsic
        
        properties
        
        fast to calculate
        
        easy to understand
        
        maybe not useful in overall improvement
        
        methods
        
        max cosine similarity to a1-b1+ax
        
        human judgement
        
        resources
        
        http://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
        
        http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
        
        extrinsic
        
        properties
        
        slow to calculate
        
        need to take care to change one subsystem at a time to know who is making the improvements
      - use cases
        
        good for classification as well as the word representation
        
        but for other tasks such as sentiment analysis, this would not be that good
      - combine with deep learning
        
        ML
        
        R(C*d) C is number of classes, d is vector size
        
        DL
        
        R(Cd+Vd)
        
        whether to retrain the word vector or not
        
        small data set, do not
        
        large dataset, use following methods to retrain
        
        highly probable for overfitting
        
        window classification
        
        concatenate the vectors in a window, W will be of dimension C *(2w+1)d
        
        trains models on softmax cross entropy
        
        sum over all windows neg log prob of class given window
        
        gradient wrt x_window
        
        WT(y-t), R(2w+1 d)
        
        f=Wx, f_y=Wx[y], y^=softmax(f),y^_y=softmax(f)[y], t is one-hot, sigma=y^-t
        
        gradient wrt W
        
        DL
        
        z=Wx, a=f(z) non linear elementwise, at last any scoring function
        
        max margin loss function=max(0,1-(s-sc)), s should be >1 further from sc
        
        forward feed
        
        backward propagation
        
        U(score)1 times dim of 2nd layer, W dim2 times dim1, x dim1, b dim2
        
        derivatives
        
        s to U: a
        
        s to Wij: Uif'(zi)xj
        
        s to bi: Uif'(zi)
        
        s to x: WTUif'(zi)
        
        why we call back propgation is bcoz update to x can be calculated from other updates
        
        more layers the same calculus
        
        another 3 aspects of backward propagation
        
        easy f=(x+y)z, backprop is just chain rule, and passing the derivatives
        
        2 more items...
        
        sigmoid, circuit
        
        flow graph, sum of derivative of all the above layers
        
        1 more item...
        
        automatic differentiation, wrt to symbolic expression, instead of get value directly
        
        back to the 2 layers named entity model
        
        1 more item...
    - - dependency structure
        
        verb always the main word
        
        you can refer preposition to noun a few steps backward, but it will most probably in nested structure
        
        the combination is in catalan numbers
      - annotated treebank
      - source of information
        
        bilexical affinities(word vector distributed representation)
        
        dependency distance
        
        intervening material(verb,punctuation)
        
        valency of head(how many dependants and on which side)
      - dependency parsing
        
        define ROOT
        
        tree structure
        
        if have crossing lines(the linear order affect this), then it is non-projective
        
        transition based dependency parser
        
        arc based
        
        stack, buffer, dependency arcs
        
        shift, left arc, right arc
        
        finish state: stack has root, buffer empty
        
        how to predict the movements
        
        MaltParser
        
        supervised ML
        
        4 more items...
        
        DL
        
        3 more items...
    - - flow graph
        
        node(operations with any inputs,outputs)
        
        variable(we need to tune)
        
        placeholder(input)
        
        mathematical operations
        
        edge(tensors)
        
        code
        
        lazy evaluation
        
        create nodes and the graph
        
        initialize variables in session
        
        prediction->loss function, label ground truth as another placeholder
        
        optimizer,minize(loss function): backpropagation, our session.run is on it
        
        variable scope reuse
        
        every time we run a loop, there will be one update to W,b
    - - language model
        
        word order
        
        word choice(translation)
        
        traditional language model
        
        n-Markov assumption, joint probability based on the previous words
        
        count-based probability, unigram/bigram...
        
        n-gram count based probability will cause too much CPU,RAM when n is large
      - RNN
        
        h_t-1(W_hh) , xt(W_hx)(non-linear) -> ht (softmax, W_S)->y^t
        
        single (hidden)layer neural network
        
        cost function is still cross entropy but classes is what is the next word, and actually this is unsupervised ML
        
        for overall loss function we use 2^cross entropy, which is perplexity
        
        vanishing/exploding gradient
        
        simpler form: W_hh f(h_t-1) + Whx x[t], W_S h_t
        
        gradient: sum of all E_t, dE_t/dW=sum of all k in 1~t dEt/dyt dyt/dht dht/dhk dhk/dW, dht/dhk is accumulative
        
        dht/dhk=product over dht/dht-1......
        
        each one is Dh*Dh Jacobian
        
        dhk/dW is also products in some recursive form
        
        matrix norm will vanish/explode
        
        but in practice we only accumulate backprop in one run
        
        solution
        
        exploding gradient
        
        clipping
        
        vanishing gradient
        
        identity initialization+ReLu
        
        softmax is huge to compute
        
        classify class first, then do softmax in this class
      - application
        
        sentiment
        
        opinion mining
      - bidirectional RNN
        
        we have two h_t, on updated from h_t-1, another updated from h_t+1, and the y^ is calculated from concatenation of two h_t
        
        deep bidirectional RNN
        
        F1 metrics: harmonic mean of precision and recall
    - - statistical model
        
        french->english: argmax_e(p(f|e) p(e))
        p(f|e) is known translation model for pieces of language, and p(e) is language model
      - RNN
        
        encoder decoder
        
        last node in RNN will contain all the info for the whole sentence and can calculate first translated word and next...
        but usually RNN will not remember too long sentence
        
        improvement
        
        different W for encoder and decoder
        
        h_t can also get input from last vector in encoder, and the y^_t-1
        
        deeper
        
        bidirectional
        
        reverse order is better
        
        problem
        
        only use the last one of the encoder to predict all, but it can only remember 4 to 5 words
        
        GRU
        
        model
        
        update gate and reset gate, sigmoid and different weights between them
        
        reset gate will reset the prev h_t-1, if it is close to 0, and generate intermediate h~_t
        
        update gate will choose to just use h_t-1 or also take take the intermediate memory h~_t, the result is final memory h_t
        
        update gate can help on vanishing gradient
        
        LSTM
        
        model
        
        sigmoid: input gate(whether current word matters), forget gate(whether forget prev), output gate(whether just not output currently)
        
        memory cell(tanh)
        
        final memory: prev final memory elem-wise dot forget gate + current mem elem-wise dot input gate
        
        hidden layer: output gate elem-wise dot tanh(final mem)
        
        pointer sentinel model
    - - just use one model
        
        encoder decoder model
      - goodness
        
        end to end training
        
        distributed representation
        
        better context utilization
        
        more fluent
      - bad
        
        black box, no semantic syntatic
      - attention
        
        instead of last state, we use a pool of state. just like the random access memory.
        
        attention is just one kind of this memory, there are also read/write memories
        
        alignment
        
        decoder
        
        exhaustive search
        
        ancestral sampling(probabilistic distribution)
        
        greedy search(argmax)
        
        beam search
        
        h_t-1 with all memory h_s, calculate some score, then use distributed representation to give them weight, them use weighted average state, instead of last state
        
        score can just be dot product, and we are using softmax to convert it to distribution, and we do weighted averages
      - more