Please enable JavaScript.
Coggle requires JavaScript to display documents.
CPSC 422 (Last 25% [Deterministic + Stochastic Hybrid] (Markov Logic #…
CPSC 422
Last 25% [Deterministic + Stochastic Hybrid]
CFG/PCFG
Probabilistic Context Free Grammars (PCFG)
Part 1
Used to [formulas]
Estimate prob. to sentences
Estimate prob. of parse tree
Goal: Assign a probability to parse trees and
to
sentences
Each grammar rule is augmented with a conditional probability
Part 2
CKY parsing (key steps)
Problem
Solution
Structural dependencies
Split non-terminal
Lexical dependencies
Add lexical dependencies to the scheme by making use of the notion of the
head
of a phrase
The head of an NP is its noun
The head of a VP is its verb
The head of a PP is its preposition
Vanilla PCFGs assume independence of non-terminal expansions. Not a valid assumption. There are structural and lexical dependencies
Is a rule system
Context free grammar [CFG]
Symbols/Examples (NP, V, PP, etc)
Define a formal language
Generative Formalism
Generates strings in the language
Reject strings not in the language
Create structures (trees) from sentences
Parsing
Key problem: ambiguity. Structural ambiguity
Find all trees that cover all and only the words in the input
Search strategies
Top-down or goal-directed
Bottom-up or data-directed
Top-down vs Bottom-up/Combining them
4-tuple (non-term., term., productions, start)
productions is a set of rules
just keep substituting stuff until terminals reached
Treebanks
DEF. corpora in which each sentence has
been paired with a parse tree
Heads in trees/Head finding
Acquiring Grammars/Probabilities
Grammar: read it off the parse trees
Markov Logic
#
Related to Markov Networks
Give each formula a weight
The more formulas in the KB a possible world satisfies the more it should be likely
If a possible world satisfies a formula its log probability should go up proportionally to the formula weight.
Logical KB: A set of soft constraints on the set of possible worlds. When a world violates a formula, the world becomes less probable, not impossible
Markov Logic Network (MLN)
#
Part 1
Example
Together with a C of constants
A set of pairs (F, w) where F is a formula in first order logic, w is a real number (weight)
It defines a Markov network with :
one feature/factor for each substitution vars with constants of each formula F in the MLN, with the corresponding weight w
one binary node for each substitution of var with constant of each predicate in the MLN
Formula
Part 2
How MLN's generalize FOL
MLN/MAP inference
Computing Probabilities (Important)
Combines FOL with probability
. FOL is certain which world it will be in, Markov logic is probability it will be in that world
Probabilistic Relational Model
Planning
Stochastic
Reinforcement Learning (RL)
Types
Q-Learning
#
Exploration strategies
ε-greedy
choose random action with probability ε and choose best action with probability 1 - ε
soft-max
\( \frac{e^{\frac{Q[s,a]}{τ}}}{\sum_a e^{\frac{Q[s,a]}{τ}}} \)
if τ is high, each action has approx same prob. of being chosen (
exploration
)
as τ -> 0, best action is always chosen (
exploitation
)
Representation
<s0 , a0 , r1 , s1 , a1 , r2 , s2 , a2 , r3 ,.....>
<s, a, r, s’>
\( Q[s,a] = Q[s,a] + \frac{1}{k}((r + γ max[a']Q[s' ,a']) - Q[s,a]) \)
k = number of times that slot has been updated
SARSA
e.g. act greedily 80% of the time, 20% randomly
Representation:
<s,a,r,s’,a’ > (Called an "experience")
\( Q[s,a] = Q[s,a] + \frac{1}{k}((r + γ Q[s' ,a']) - Q[s,a]) \)
k= number of iterations
learns value of the policy being followed
Difference between Q-learning and SARSA:
Q-Learning learns its policy independently from its actions (off-policy learning)
SARSA learns its policy based on its actions (on-policy learner)
Transition model is not known. RL agent LEARNS it.
Reward model is not known. RL agent LEARNS it.
RL
learns
an optimal policy
Difference between the Q/SARSA:
Q learning learns optimal policy, but because it does so
without taking exploration into account
, doesn't do well while agent is exploring
SARSA has better on-line performance (reward per episode), because it learns to stay away from the cliff while exploring
Q-value update
Average through time
Temporal difference learning/estimation
IMPORTANT SEE THIS SLIDE
By learning what to do in each state, rather than the complete policy as in search based methods, learning becomes linear rather than exponential.
Q-learning picks argMax(Q) because it doesn't know reward/transition model. Can't do weighted average
Need Q[a,s] table for Q-values. And also another Q[a,s] table to record
how many times
you've updated the corresponding cell in the corresponding Q value table.
Markov Decision Process: MDP
Transition probabilities to next states: P(s' | s, a')
Reward function R(s) or R(s, a) or R(s, a, s')
MDPs
computes
an optimal
policy
Value Iteration
Noisy ACTIONS. Agent has probability distribution over what state it's in
<INCLUDE DATA STRUCTURES NEEDED (Tables)> Slide 9 Lec 3
Number of possible policies: \( |A|^{|S|} \)
|A| = Number of actions in each state
|S| = number of nonterminal states
\( r*{argmax}_a \sum_{s'} P(s'|s,a) * V \)
Partially Observable MDP [POMDP]
E.g.: A sensor perceives the number of adjacent walls in a location with a 0.1 probability of error.
Fully specified by:
Dynamics (transition model): P(s' | a, s)
Sensor (observation) model: P( e | s )
Reward function R(s)
State belief update: \(b'(s') = aP(e' | s')\sum_s P(s' | a, s) b(s) \)
Agent cannot tell for sure where it is in the state space, all it can have are
beliefs
based on probability.
Noisy observations/evidence
Look Ahead Search
Belief State:
Probability distribution over the set of possible STATES
HAS NOISY ACTIONS, NOISY SENSOR MODEL, REWARDS
(e.g. robot
senses if there is a wall beside it with probability of error)
MISC
Value of information
: The utility of the network with an arc from X to D minus the utility of the network without the arc.
Value of control:
Tthe utility of the network when you make X a decision variable minus the utility of the network when X is a random variable
Query
Stochastic
Belief Networks
Approximate inference techniques
Likelihood weighting
Monte Carlo Markov Chain (MCMC) - Gibbs Sampling
Markov Blanket
Hoeffding's inequality: LEARN FORMULA
Rejection Sampling
Forward Sampling
Belief networks are for approximate inference. (Var elimination = exact inference)
Probabilistic Graphical Models
CRF
Special case of Markov Networks where all \( X_i \) are observed
Formula looks like: \( exp(w_o + \sum^n_{i=1}w_ix_i)\)
Logistic Regression: naive markov net (CRF)
Linear chain CRF
Skip Chain CRF
Assume that you always observe a set of variables X = {X1…Xn} and you want to predict one or more variables Y = {Y1…Yk}.
A CRF is an undirected graphical model whose nodes corresponds to
X ∪ Y.
Markov Networks
Var. elimination process same as BNET
Gibbs Sampling
One factor for each maximal clique (Φ)
Factor representation: Φ
HMM
Probabilistic temporal inferences
Filtering: *compute the posterior distribution over the current state given all evidence to date. \( P(X_t | e_{0:t}) \)
Particle Filtering
Prediction: posterior distribution over a future state given all evidence to date \( P(X_{t+k+1} | e_{0:t}) \)
Most Likely Sequence of States (Viterbi)
Smoothing (forward-backward): compute the posterior distribution over a past state given all evidence to date \( a* P(X_k | e_{0:k} ) * P(e_{k+1:t} | X_k) \)
Starts with a Markov chain, and adds a noisy observation/evidence about the state at each step
Fully specified by:
Dynamics: \( P(X_{t+1} | X_t) \)
Sensor model: \( P(E_t|S_t) \)
Initial conditions: \( P(X_0) \)
NOISY OBSERVATIONS
. THERE ARE NO ACTIONS IN HMM. ONLY STATES (Xt) and EVIDENCE (e).
Deterministic
First Order Logics
Whereas propositional logic assumes the world contains facts (e.g. up_switch2, up_switch3), FOL assumes the world contains
Relations (e.g. up(switch2))
Functions (e.g. connected_to(switch2, wire1)
Objects (e.g. switch2)
Logical operators
Equality (two terms refer to same object)
Quantifiers
Syntax
Resolution Procedure
LOGICAL CONSEQUENCE
Related to CNF
Ontologies
Lexical relations
Synonym
Antonymy
Homonymy (same form/sound, different meaning)
Hyponymy (subclass)
In AI an Ontology is a specification of what individuals
and relationships are assumed to exist and what
terminology is used for them
What types of individuals
What properties of the individuals
Semantic similarity/distance
Calculation should include
Gloss
Glosses can be extended to, for example, synonyms, hyponyms.
Number of matched words between two definitions
The distance between the two concepts in the
underlying hierarchies / graphs
Methods
Distance: Path-length
\( sim_{path}(c_1, c_2) = \frac{1}{pathlen(c_1,c_2)} \)
Based on is-a/hypernyms hierarchies
Probability of a concept/sense and its info content
\( P(c) = \frac{count(c)}{N} \)
Prob. of word
\( IC(c) = -logP(c) \)
Information Content
Concept Distance: info content
\( P(c) = \frac{\sum_{c_i \epsilon subsenses(c)} count(c_i)}{N} \)
IC(c) = -logP(c)
LCS(c1,c2)
Lowest Common Subsumer
\( sim_{resnik} (c_1,c_2) = -logP(LCS(c_1, c_2)) \)
Include Jiang-Conrath distance
Databases containing all lexical relations among all words
Wordnet
Framenet
Semantic role labelling
Temporal Rep
Full Resolution
SAT