Please enable JavaScript.

Coggle requires JavaScript to display documents.

CPSC 422 (Last 25% [Deterministic + Stochastic Hybrid] (Markov Logic #…

- - - - Part 1
        
        Used to [formulas]
        
        Estimate prob. to sentences
        
        Estimate prob. of parse tree
        
        Goal: Assign a probability to parse trees and to sentences
        
        Each grammar rule is augmented with a conditional probability
      - Part 2
        
        CKY parsing (key steps)
        
        Problem
        
        Solution
        
        Structural dependencies
        
        Split non-terminal
        
        Lexical dependencies
        
        Add lexical dependencies to the scheme by making use of the notion of the head of a phrase
        
        The head of an NP is its noun
        
        The head of a VP is its verb
        
        The head of a PP is its preposition
        
        Vanilla PCFGs assume independence of non-terminal expansions. Not a valid assumption. There are structural and lexical dependencies
      - Is a rule system
    - - Symbols/Examples (NP, V, PP, etc)
      - Define a formal language
      - Generative Formalism
        
        Generates strings in the language
        
        Reject strings not in the language
        
        Create structures (trees) from sentences
      - Parsing
        
        Key problem: ambiguity. Structural ambiguity
        
        Find all trees that cover all and only the words in the input
      - Search strategies
        
        Top-down or goal-directed
        
        Bottom-up or data-directed
        
        Top-down vs Bottom-up/Combining them
      - 4-tuple (non-term., term., productions, start)
      - productions is a set of rules
        
        just keep substituting stuff until terminals reached
    - - DEF. corpora in which each sentence has
        been paired with a parse tree
      - Heads in trees/Head finding
      - Acquiring Grammars/Probabilities
        
        Grammar: read it off the parse trees
  - - - Part 1
        
        Example
        
        Together with a C of constants
        
        A set of pairs (F, w) where F is a formula in first order logic, w is a real number (weight)
        
        It defines a Markov network with :
        
        one feature/factor for each substitution vars with constants of each formula F in the MLN, with the corresponding weight w
        
        one binary node for each substitution of var with constant of each predicate in the MLN
        
        Formula
      - Part 2
        
        How MLN's generalize FOL
        
        MLN/MAP inference
        
        Computing Probabilities (Important)
- - - - Types
        
        Q-Learning #
        
        Exploration strategies
        
        ε-greedy
        
        choose random action with probability ε and choose best action with probability 1 - ε
        
        soft-max
        
        \( \frac{e^{\frac{Q[s,a]}{τ}}}{\sum_a e^{\frac{Q[s,a]}{τ}}} \)
        
        if τ is high, each action has approx same prob. of being chosen (exploration)
        
        as τ -> 0, best action is always chosen (exploitation)
        
        Representation
        
        <s0 , a0 , r1 , s1 , a1 , r2 , s2 , a2 , r3 ,.....>
        
        <s, a, r, s’>
        
        \( Q[s,a] = Q[s,a] + \frac{1}{k}((r + γ max[a']Q[s' ,a']) - Q[s,a]) \)
        
        k = number of times that slot has been updated
        
        SARSA
        
        e.g. act greedily 80% of the time, 20% randomly
        
        Representation:
        
        <s,a,r,s’,a’ > (Called an "experience")
        
        \( Q[s,a] = Q[s,a] + \frac{1}{k}((r + γ Q[s' ,a']) - Q[s,a]) \)
        
        k= number of iterations
        
        learns value of the policy being followed
        
        Difference between Q-learning and SARSA:
        
        Q-Learning learns its policy independently from its actions (off-policy learning)
        
        SARSA learns its policy based on its actions (on-policy learner)
      - Transition model is not known. RL agent LEARNS it.
      - Reward model is not known. RL agent LEARNS it.
      - RL learns an optimal policy
      - Difference between the Q/SARSA:
        
        Q learning learns optimal policy, but because it does so without taking exploration into account, doesn't do well while agent is exploring
        
        SARSA has better on-line performance (reward per episode), because it learns to stay away from the cliff while exploring
        
        Q-value update
      - Average through time
      - Temporal difference learning/estimation
      - IMPORTANT SEE THIS SLIDE
      - By learning what to do in each state, rather than the complete policy as in search based methods, learning becomes linear rather than exponential.
      - Q-learning picks argMax(Q) because it doesn't know reward/transition model. Can't do weighted average
      - Need Q[a,s] table for Q-values. And also another Q[a,s] table to record how many times you've updated the corresponding cell in the corresponding Q value table.
    - - Transition probabilities to next states: P(s' | s, a')
      - Reward function R(s) or R(s, a) or R(s, a, s')
      - MDPs computes an optimal policy
      - Value Iteration
      - Noisy ACTIONS. Agent has probability distribution over what state it's in
      - <INCLUDE DATA STRUCTURES NEEDED (Tables)> Slide 9 Lec 3
      - Number of possible policies: \( |A|^{|S|} \)
        
        |A| = Number of actions in each state
        
        |S| = number of nonterminal states
      - \( r*{argmax}_a \sum_{s'} P(s'|s,a) * V \)
    - - E.g.: A sensor perceives the number of adjacent walls in a location with a 0.1 probability of error.
      - Fully specified by:
        
        Dynamics (transition model): P(s' | a, s)
        
        Sensor (observation) model: P( e | s )
        
        Reward function R(s)
        
        State belief update: \(b'(s') = aP(e' | s')\sum_s P(s' | a, s) b(s) \)
      - Agent cannot tell for sure where it is in the state space, all it can have are beliefs based on probability. Noisy observations/evidence
      - Look Ahead Search
      - Belief State: Probability distribution over the set of possible STATES
      - HAS NOISY ACTIONS, NOISY SENSOR MODEL, REWARDS
        (e.g. robot
        senses if there is a wall beside it with probability of error)
- - - - Approximate inference techniques
        
        Likelihood weighting
        
        Monte Carlo Markov Chain (MCMC) - Gibbs Sampling
        
        Markov Blanket
        
        Hoeffding's inequality: LEARN FORMULA
        
        Rejection Sampling
        
        Forward Sampling
      - Belief networks are for approximate inference. (Var elimination = exact inference)
    - - CRF
        
        Special case of Markov Networks where all \( X_i \) are observed
        
        Formula looks like: \( exp(w_o + \sum^n_{i=1}w_ix_i)\)
        
        Logistic Regression: naive markov net (CRF)
        
        Linear chain CRF
        
        Skip Chain CRF
        
        Assume that you always observe a set of variables X = {X1…Xn} and you want to predict one or more variables Y = {Y1…Yk}.
        A CRF is an undirected graphical model whose nodes corresponds to
        X ∪ Y.
      - Markov Networks
        
        Var. elimination process same as BNET
        
        Gibbs Sampling
        
        One factor for each maximal clique (Φ)
        
        Factor representation: Φ
    - - Probabilistic temporal inferences
        
        Filtering: *compute the posterior distribution over the current state given all evidence to date. \( P(X_t | e_{0:t}) \)
        
        Particle Filtering
        
        Prediction: posterior distribution over a future state given all evidence to date \( P(X_{t+k+1} | e_{0:t}) \)
        
        Most Likely Sequence of States (Viterbi)
        
        Smoothing (forward-backward): compute the posterior distribution over a past state given all evidence to date \( a* P(X_k | e_{0:k} ) * P(e_{k+1:t} | X_k) \)
      - Starts with a Markov chain, and adds a noisy observation/evidence about the state at each step
      - Fully specified by:
        
        Dynamics: \( P(X_{t+1} | X_t) \)
        
        Sensor model: \( P(E_t|S_t) \)
        
        Initial conditions: \( P(X_0) \)
      - NOISY OBSERVATIONS. THERE ARE NO ACTIONS IN HMM. ONLY STATES (Xt) and EVIDENCE (e).
  - - - Whereas propositional logic assumes the world contains facts (e.g. up_switch2, up_switch3), FOL assumes the world contains
        
        Relations (e.g. up(switch2))
        
        Functions (e.g. connected_to(switch2, wire1)
        
        Objects (e.g. switch2)
        
        Logical operators
        
        Equality (two terms refer to same object)
        
        Quantifiers
      - Syntax
      - Resolution Procedure
      - LOGICAL CONSEQUENCE
      - Related to CNF
    - - Lexical relations
        
        Synonym
        
        Antonymy
        
        Homonymy (same form/sound, different meaning)
        
        Hyponymy (subclass)
      - In AI an Ontology is a specification of what individuals
        and relationships are assumed to exist and what
        terminology is used for them
        
        What types of individuals
        
        What properties of the individuals
      - Semantic similarity/distance
        
        Calculation should include
        
        Gloss
        
        Glosses can be extended to, for example, synonyms, hyponyms.
        
        Number of matched words between two definitions
        
        The distance between the two concepts in the
        underlying hierarchies / graphs
        
        Methods
        
        Distance: Path-length
        
        \( sim_{path}(c_1, c_2) = \frac{1}{pathlen(c_1,c_2)} \)
        
        Based on is-a/hypernyms hierarchies
        
        Probability of a concept/sense and its info content
        
        \( P(c) = \frac{count(c)}{N} \)
        
        Prob. of word
        
        \( IC(c) = -logP(c) \)
        
        Information Content
        
        Concept Distance: info content
        
        \( P(c) = \frac{\sum_{c_i \epsilon subsenses(c)} count(c_i)}{N} \)
        
        IC(c) = -logP(c)
        
        LCS(c1,c2)
        
        Lowest Common Subsumer
        
        \( sim_{resnik} (c_1,c_2) = -logP(LCS(c_1, c_2)) \)
        
        Include Jiang-Conrath distance
      - Databases containing all lexical relations among all words
        
        Wordnet
        
        Framenet
        
        Semantic role labelling