Please enable JavaScript.
Coggle requires JavaScript to display documents.
Artificial Intelligence and Games - Coggle Diagram
Artificial Intelligence and Games
Week 1: Introduction
Characteristics of Games
You must make decisions in real time
There are other agents who you do not control
The benefits (or costs) of decisions you make depend on what the other agents do
Non-cooperative Game Theory
It is every individual for themself
This implements Greedy AI
Strategy + Interactions with other Strategies = Games
Formal definition of Games
A number of agents interact with each other. These are known as players
At the end of their interactions there is an outcome that can be described numerically - each player recieves a pay-off, which could be negative.
There are clearly defined rules as to which decisions each agent can make at which point in time. When all decisions have been made the outcome is uniquely determined.
Two basic assumptions
The numeric avalue associated with a particular outcome adequately reflects the worth of that outcome to all the players.
Each player is concerned with maximising their own outcome, without regard of the outcome for the other players.
(For this course unit, we assume that players are not allowed to pool their payoff and cooperate in affecting the outcome.)
A* Heuristic Search
Djikstra's Algorithm
The goal of this algorithm is to find the shortest path from a source to all the other nodes in the graph
Priority queue: implements a best first search, searching the nodes closest to the source first
Path representation: each node in the queue maintains properties of the current shortest path, its distance to the source and its parent
Relaxation: when a shorter path is found for a node in the priority queue, the distance and its parent is updated.
Properties
Completeness: If there is a path it will be found
Optimal: The algorithm finds the shortest path
Uninformed: Does not use any information one might have about the location of the target
Finding the shortest path: when a node is popped from the queue, the shortest path is found.
The A* algorithm is an informed and efficient way of finding shortest paths on graphs. Heuristics (rules which are applied because they are found to have worked by trial and error) are used to search game trees and make strong players
To take into account information about the location of a goal, we require a heurstic function: h(x) = estimated distance from node x to goal
Greedy Heuristic Search
Uses a heuristic approximation for the distance to the goal
Greedily searches, moving the the unvisited available state nearest to the goal, according to the heuristic
Properties
Not complete - may not find a path, even if it exists (can be made complete with backtracking)
Doesn't necessarily find the optimal path
Fast
Informed search - uses domain knowledge to produce and evaluate a good heuristic
A* combines the two algorithms
g(x): the distance from the source s to the node x - Djikstra's algorithm
h(x); the heuristic, a estimate of the distance of node x to goal t - the Greedy heuristic
A*: searches ordered on their sum f(x) = g(x) + h(x) - the total estimated distance from source to goal through node x
Three properties make a good heuristic
Admissable: It must underestimate the distance
Monotonic: Satisfies a triangle inequality
Infromative: The closer h(x) is to the true distance to the goal from x, the more informative
(Possibly do more research on these)
Week 2: Represenation of Games and Strategies
Definition 1: A game is given by
a finite set of players
a tree
for each node of the tree, a player who owns that node
for each edge of the tree, an action that labels it
for each player the coresponding information sets - sets of decision nodes for that player such that for every two node in some information set for Player i it is the case that: the actions taken by Player i to reach the nodes are identical and the actions labelling the edges out of the nodes are identical
for each leaf node and each player there is a pay-off
This representation of games as game trees is called extensive form representation, which is more important from the view of implementing games.
Handling chance
Elements of chance include dice, dealing cards
We have a new player called Chance or Nature
Each action they take has an associated probability
The player gets no payoff
Handling Imperfect Information
In many games, players do not know what other players know, hence they do not know where they are in the game tree.
Information sets
The set of nodes a player could be at given the information it has
Action taken must be the same for all nodes in the information set
Definition 2
A game is a 2-player game if there are two players (not counting Nature)
A game is of perfect information if all the information sets in the game tree have size 1 - all players know at all times what node they are un. Otherwise the game is of imperfect information.
A game is zero-sum if at every leaf of the game tree the sum of the pay-off for all the players is 0;
A game is without chance if no node in the game tree is controlled by Nature, otherwise it is a game with chance
The consequence of this diffinition are that positions which would otherwise be considered the same are distinguished. Otherwise we would have a game graph, which loses other information and assumes the plater would always use the same strategy, which we want to maximise.
Strategies
A strategy for Player i:
Definition 3: A fully specified strategy for Player i in a game is given by choosing for each of i's nodes an action allowed by the game tree such that for all nodes in the same information set the same action is chosen.
We get the number of fully specified strategies by: for each decision point of the player, count the number of choices and then multiply all the numbers gained. (check this out again)
You can overspecify a player's actions as many branches would never happen.
Definiton 4: A pure strategy for Player i is given by a subtree of the game tree`with the following properties:
the root of the game tree belongs to the subtree
whenever it is Player i's turn at a node that belongs to the subtree, exactly one of the available moves belongs to the subtree and for all nodes in the same information set for Player i, the same action is chosen
whenever it is not Player i's turn at a node that belongs to the subtree, all of the available moves belong to the subtree,
Representation of Games in Normal Form
Definition 5: A game in normal form is given by:
A finite list of players 1, ..., i
for each player, a list of valid strategies for the players, 1 to ni for Player i
for each player i, a payoff function pi
Instead of using a tree, a table of strategies is used. It is particularly suitable for simultaneous play games, as they are imperfect infromation games.
Normal form is mathematically convenient, but computationally inefficient
It requires the entire game in the table: a graph can be explored without storing the entire graph in memory
Normal form requires more parameters to describe the game
Strategy spaces can be enormous
You can play a game by allowing the players to choose a strategy, play according to them and reading the results off a table.
To determine the outcome:
If there is chance involved then one follows the unique play that the strategies under consideration determine until a final position has been reached.
If there is chance involved, then there may be more than on common path. We calculate the expected pay-off by, for each node reachable, calculate the probability of reaching that node by multiplying all the probabilities that appear along the path by the pay-off and adding all the numbers of the probabilistic pay-offs.
Extensive Form vs Normal Form
Extensive form
game represented a game tree: these are huge
You can create the game tree as you need it. This form is more compact and closer to actually playing a game.
The main disadvantage is that reasoning is difficult. It is possible to get proofs by simple induction on the height of the game tree.
Normal Form
game represented as a table of strategies: strategy spaces are huge
Advantages
decisions have been stratified, complexity has been subsumed, pay-offs are easy to compute, and the problem of chosing a strategy is simplified
Disadvantages
size, as computing is not always feasible, modelling this way does not make sense when learning about games, generating all strategies is expensive
It is worth trying to keep the description of a game as small al possible, so we can leave out equivalent possibilities
Measures of game size
Number of board posiitions which can occur in a game
Number of decision nodes in a game tree. >= item 1
Number of possible games = number of terminal nodes
Number of strategies. This is the sum of items 2 and 3 and roughly branching factor raised to the depth of the tree
With realistic sized games, we work with trees. Normal form is useful for theory and very small games.
Definition 6: In a 2-person, zero-sum game a strategy for a player
Is winning if, when following that strategy, the player always gets a pay-off greater than 0
Ensures a draw if, when following that strategy, the player always recieves a pay-off greater than or equal to 0
Theorem 1.10 Consider a 2-person, zero-sum game with imperfect information and without chance such that every play ends after a finite number of moves. Then one of the following is the case:
Player 1 has a winning strategy
Player 2 has a winning strategy
Players 1 and 2 both have strategies which ensure a draw
When such a strategy is found, the game is solved
Levels of Game Solution
Ultra-weak: Proving which player can force a win, or draw for either without providing a strategy
Weak: Provides the strategy whereby one player can win or either can draw, starting at the beginning of the game
Strong: Providing the strategy which produces perfect play from any point in the game, even if mistakes have been made early.
Infinity
Game trees can be infinite in 2 ways
There is at least one play that can be continued at any stage
There is at least on decision point where a player has an infinite number of choices
Week 3: Solving Games
Best Response
The best response is the strategy s*1 for Player 1 with the property that:
For all strategies s2, ..., si for players 2, ..., i respectively and
For all strategies s, for player 1 it is the case that p1(s*1, s2, ..., si) >= p1(s1, s2, ..., si)
Proposition 2.1: If a player has a winning strategy in a 2-person zero-sum game with pay-offs in {-1, 0, 1}, then that is a best response to all other players strategies.
The best response is a strategy that gives the highest pay-off
Existence of best response strategies
In a finite strategy space, there definitely is. Check all strategies and play (one of) the best one(s).
In an infinte strategy space, perhaps not. (But there is a counter-example in the notes)
Nash Equilibrium
Definition 7: For a game that has 2 players we say that (s
1, s
2, ..., s*l) is a Nash Equilibrium point for a game if
Each strategy s*i is a strategy for Player i and
Each strategy is the best response to all the other strategies
If the players are playing the Nash then no player has any incentive to change its strategy unilaterally, as their payoff can only get worse.
Proposition 2.2: For a 2-player, zero-sum game with perfect information and no chance:
Solving the game means finding the Nash equilibrium
Each is playing best response to the other
Definition 8: Let a subgame of a given game be obtained by choosing any position in the game tree an considering it the root of the game tree, namely the one given by the part of the original tree which is below it.-
Considering equilibria as solutions to non-zero sum games is problematic as they do not always constitute the optimal outcome for both players. This issue can be addressed by:
Looking at the systems arising from a number of agets playing such games and as if there are stable states
Measuring success differently eg minimising regret
Allowing the agents playing the game to negotiate with each other
Properties of equilibria in 2-person zero-sum games
We can use formal notation such as a matrix or table with ij or mn where the first is the rows, and player 1
Proposition 2.7: For a 2-person, zero-sum game in normal form the strategy pair (i,j) is an equilibrium point iff the corresponding payoff is maximal in its column and minimal in its row.
Proposition 2.8: Let (i
, j
) be an equilibrium point for a 2-person, zero sum game in normal form with m strategies for P1 and n for P2, and a payoff matrix. Given the condition in the notes, the game has an equilibrium point.
Collary 2.9: All equilibrium poinys in a 2-person, zero-sum game lead to the same payoff.
Definition 9: For a 2-person, zero-sum game the (unique) payoff at an equilibrium point is the value of the game
Proposition 2.10: Let (i,j) and (i',j') be equilibria for a 2-person, zero-sum game. Then (i,j') and (i',j) are also equilibria for that game.
Proposition 2.11: A 2-person, zero-sum game has an equilibrium point iff there exists a value such that it is the highest pay-off Player 1 can guarantee and -v is the highest payoff Player 2 can guarantee. v is the value of the game/
Proposition 2.12: Every 2-person, zero-sum game of perfect information in which every play is finite has at least on equilibrium point.
The Minimax Approach
Player 1: for each of their strategies identifies the opponent strategy which gives the lowest payoff to Player 1. Plays the strategy which maximises this.
Player 2: for each their stratgeies identifies the opponent stratgey which gives the highest payoff to Player 1. Play the startegy which minimises this.
If we find the same pair of strategies, this strategy pair is a Nash Equilibrium
There is a more formal definition in the slides.
This is a worst case analysis tool. The goal is to minimise the harm your oponent does to you.
Dominance
Definition 12: We say that a strategu s for player i is dominated by strategy s' for the same player if for all choices of strategies by other players it is the case that the payoff for Player i when playing strategy s is less than or equal to that when playing strategy s'.
Strategies which are dominated can be removed.
In 2-person, zero-sum games, finding one equilibrium is sufficient to tell us how to play the game. Otherwise, equilibria can vary vastly, so finding one is unlikely to lead to good play.
Proposition 2.17: Let G' be a game that results in the game G by removing a pure strategy j for Player i which is dominated by some other strategy. Any equilibrium point for the game G' can be turned into an equilibrium point for the game G by putting the probability that Player i plays strategy j to be 0.
In all other cases reducing the size is a prepatory step before applying methods outlined. This may remove some equilibria.
Mixed Strategies
A mixed strategy is a strategy for a player in which a player
plays probablistic combinations of pure strategies
recieves a probabilistic combination of payoffs
(Look in the notes for how to calcuate the expected payoff)
Normal form
Assign a probablity qi to the pure strategy i
Where qi is between 0 and 1, and the sum of all probabilites of all strategies is 1
Choose strategy i with the probability qi
Get the appropriate payoff with the probability qi
Extensive form
At each node where a player has a decision, assign a probability function to each possible choice.
Definition 11: For a game in extensive form a mixed strategy for Player i is given by a probability distribution over the available choices for each decision point belonging to i in such a way that all the decision points in the same information set have matching probabilities for all possible actions.
Mixed strategies are needed when there is hidden information.
Mixed strategies and Nash Equilibria
Probabilistic combinations of pure strategies
Nash equilibrium: all games with a finite number of players and finite numbers of moves have at least one Nash equilibrium (mixed or pure)
Finding Nash equilibria - general case
General sum games: Best algorithm seems to be the Lemke-Howson algorithm which can be exponential in time
Zero-sum games in normal form: Generally can be solved using linear programming
Common features of mixed strategy Nash equilibra
If one player plays its component of the Nash equilibrium, the payoff is indifferent to what the other player does.
The other player must also play its component of the Nash equilibrium to force the first player to play its component and not take advantage.
Mixed strategies solve the problem of non-existence of equilibrium points (There is a theorem and proposition for this.)
The drawback is: it works well when both players know enough to play their optimal strategies, otherwise one player can exploit anothers weakness.
While we can use the original definition of equilibrium, we now have the problem of as soon as one player has at least two strategies, there are infinitely many mixed strategies for at least one player, so how do we check for equilibirum? Well doing so for the pure ones will suffice. There is a proposition and lemma to verify this.
Definition 10: For a game in normal form a mixed strategy for a player consists of a probability distribution for all the strategies available to that player
Any distribution of probabilities over all the outcomes that can be achieved wih the one form of mixed strategy can be achieved with the other.
Key Ideas
Introduce probabilities over pure strategies
Set all the expected payoffs against each opponents pure stratgies to be equal to find the value of the probabilities
With 2-action, 2-player games, Nash equilibria can be found graphically.
Calculating Equilibrium
This is difficult and theree is sometimes no algorithm. The following concerns normal form:
2-person, zero-sum
poly-time algorithm that finds all equilibria
can be reformulated to linear programming and here are three different algorithms
2-person, general sum
Lemke-Howson finds one equilibrium point of a given name game
wost case is exponential and belongs between P and NP
n-person games
more difficult and not much known, no known algorithm that works for all
alternatives to equilibria that are easier to calculate are
approximate equilibrium which guarantees the pay-off for all players is at most a given small number bellow the pay-off at an actual equilibrium point
correlated equilibrium where players are allowed to correlate their choices with each other
Week 4: Solving Games. Minimax search. Solving Big Games
The minimax rule
There are two players: Player 1 is the MAX player and Player 2 is the MIN player
The value V(J) of a node J is
V(J) if the node is terminal
The maximum value of its children if it is a MAX node
The minimum value of its children if it is a MIN node
Minimax search
This is a depth first search on the game tree
The graph is built in real time
The goal is to evaluate the chilren of the current node to find the best node
When it is your turn to play:
Find the values of all child nodes of your current position
Move to the child of your current node with the best value
Proposition 2.12 describes an algorithm that finds the value fo a 2-person, zero-sum game by recursively determinin the value of each subgame. We note the following:
There is no reason to restrict the algorithm to two players
If it is non-zero sum we have to run it separately for each player.
For each subtree this algorthm calculates the largest expected payoff Player i can guarantee for themself.
This algorithm does not apply to games of with imperfect information. It can be adjusted to cope, but the result is less useful is the uncertainty is a result of chance (take expected payoff) or another player's move (take the worst-case)
Removing the assumption of two-players
Treat all opponents as a single MIN player
This does not produce an equilibrium anymore as it assumes all opponents are working against P1, but it still produces a worst-case analysis for Player 1.
Removing the assumption of no chance
With a chance node v, and children nodes v1, ..., vn, the value of the chance node is the expectation of all its chilren: v = q1v1 + ... + qnvn
This is sometimes called the expectimax
Removing the assumption of perfect information
Uses a method called "Counterfactual Regret Minimisation" - beyond the scope of this course
Win-Loss Trees
(there is an image of this in the notes)
Value of a MAX node
WIN if any of its children are WIN nodes
LOSS if all of its children are LOSS nodes
The value of a MIN node is the opposite
Therefore if J is a MAX node, evaluate its children with
A child evaluates to WIN, then label J as WIN and leave the remaining children unevaluated
All children have been evaluated (and they are all LOSS nodes)
And the inverse for MIN nodes
As this does not have to search the entire tree, it can be much faster than minimax
alpha-beta pruning
Each node contains a range [alpha, beta] where
alpha is the maximum lower bound of the value of the node
beta is the minimum upper bound of the value of the node
when beta <= alpha prune the subtree containing that node
Start with range [∞, -∞]
MAX nodes change alpha, MIN nodes change beta
The range is passed down the tree unchanged and a child can change the range of its parent
Note that alpha-beta pruning is particularly effective if the first child investigated is the best move for the Player who makes it, as it allows for pruning of the rest.
It is advantageous to try promising moves first, but it is unliley that the game tree is small enough, so we have a heuristc that tries to guess the true value of the node
In a 2-person, zero-sum game, we may use alpha-beta pruning to find equilibrium strategies, as well as the payoff at equilibrium points. If there is a winning strategy for one of the players, then this algorithm will find it.
Iterative deepening
searching to a predefined depth means not all moves can be explored, so the order of the search is vital
many programs carry out shallow searches, deepening the level one by one. This is combined with others to make use of the information gained.
if nothing else, we can use this information to decide the order in which the search moves.
Modified alpha-beta search
We only consider moves that lead to a value of at least alpha on a MAX node and at most beta on a MIN node, to allow for more pruning.
If alpha is above beta, stop the search and report the value back.
This technique is known as aspiration search, and can allow us to search twice as deeply in the same amout of time in the best case.
(Re-read this section of the notes.)
Move ordering
We must first find good moves to fully exploit alpha-beta, but that is the point of alpha-beta, so we can use clues to speed up the process.
If we have previously searched positions, we have some idead of which moves lead to good positions
We can also have some idea of moves that are typically good eg 'killer moves' end the game.
We can also use Principle Variation Search (PVS) where everything is compared against the principle variation (the first move searched) as this leads to more pruning.
Selective Extension
Game playing programs search to a greater depth whenever
there is a reason to belive tht the current value for a position is inaccurate
when the current line of play is particularly important
What to do in large games
Often the game tree is so large it is not possible to perform minimax search, because you cannot reach terminal nodes except towards the end.
We use an evaluation function (or heuristic) to approximate the value of the deepest nodes we can reach if they are not terminal nodes.
Evaluation functions
aka board evaluation functions, heuristic
Introduce an evaluation function, which is an approximation to V(J). Use it just like the value.
Search the tree to a given depth or until terminal nodes are found.
Use U or the evaluation function to evaluate the node.
Propagate the value up the minimax tree
Unlike A*, notions of admissible or monotonic do not apply
Approximate payoff reachable from this node
More positive evaluation function means Player 1 is more likely to win, negative for Player 2
How to find good heuristics
This requires knowledge of the game
You need to make sure the heuristic is correctly normalised against value returned at actual terminal nodes
Often trial and error use given two heuristics
play them against each other multiple times
play them against other heuristics
the one that wins the most is likely better
It is sometimes necessary to combine heuristics and weight them
Method I: hand tweaking - often just by guesswork or by manual tweaking of weights
Method II: stochastic hillclimber - start with a set of weights and then repeat
perturb the weights slightly
compare the heuristic using the original weights to that using the perterbed weights by playing the two against each other in multiple games
choose the best of the two
You can also learn through genetic algorithms and neural networks
If we limit the amount of time an evaluarion function has, we force it to make a decision while ignoring possibly important information. Some allow a limited further search.
Game-playing Programs
We must implement a way of representing the current position and generating legal moves - this involves minimax with alpha-beta pruning.
Typically you use alpha-beta pruning to search the game tree, to a specified depth, use a heuristic to determine an approximate value for that node and use that value as if it were the real one
Heuristics assign a score to a given position based on various criteria. For such a program to work well, the evaluation function must be as accurate as possible and the program must search the tree to the lowest depth possible
The limiting factor is speed, hence the processes of generating the game tree and running the evaluation function must be optimised. The internal representation and doing and undoing moves must also be fast
Hash tables are used to keep track of positions that have occured in play, as well as bitboards. The advantages of bit boards are:
bit-wise operations are fast
bitboards required more than once only have to be computed once
several moves can be computed at the same time
The disadvantage of bitboards are that it is more complicated to turn one of possible moves into a list
Further Issues
Only looking at a limited number of moves ahead leads to surprising moves/problems
Winning positions that don't lead to wins - the algorithm needs encouragement to force progress
The horizon effect - happy with bad moves as long as the consequences are below the current search horizon
This can be overcome be adding knowledge to the program, increasing the overall depth of the search or selectively searching deeper
Additional programming components - commercially available game-playing programs use libraries so some moves can be read off, not calculated incluing opening libraries, well known sequences and endgame
Week 5: Learning in Games I
When is learning not preferred over minimax search?
when good heuristics cannot be found
when the branching factor is very high
when equilibria is hard to compute
games of incomplete information
general-sum games
games with many players
games with chance
Learning through self play
Two independent learning agents play games against each other, staring from naive states, they incrementally improve, forming an arms race
Each player in turn
Plays an action
Observes the new game position and any rewards
Strengthens moves leading to wins; supresses those leading to losses
By playing many games, learns to become a strong player
This is hard because we learn while playing. Feedback from the outcome of the gaem reveals that you did something right or wrong, but not what, so you don't know what move was crucial to the outcome
These are characteristics of reinforcement learning
Reinforcement Learning
Learning from rewards
Observe the situation/state
Perform an action
Recieve a reward, positive or negative
By doing this repeatedly, we learn to maximise the positive reward and minimise the negative
It is mainly used when the best response is not known, so they must be discovered through trial and error
The reinforcement learning problem
Learning where only the quality of the response or actions can be known but not what the correct actions are.
Alternatively, numerical rewards can be observed
The reinforcement information may be available only after a sequence of actions has been taken.
The environment may include opponents, who might also be learning
RL Concepts
Rewards: Denoted rt at time t. Can be positive, negative or 0 (win, lose, draw/no result)
State: The current position, board tate, agent location etc, denoted st
Policy: What to do in any situation. What action to take in a situation or state: pi(At, St) sometimes with a transition function Env(St+1| StAt)
Reinforcement learning: Learn an effective policy simply by taking actions and observing rewards.
The differences with game theory are: game theory has strategy, pay-off, node/board position and heuristic, while RL has policy, reward, state and function approximation
The exploration-exploitation trade-off
Exploration: find new states or new actions which may lead to higher rewards
Exploitation: perform those actions which have worked in the past
We need learning algorithms that do both.
The epsilon-greedy policy
This balances exploration and exploitation: Most of the time you take the best decision (based on current knowledge) and occasionally take a random one
Epsilon is a small number which dictates the probability of choosing a random action
As the agent learns and improves, it can be beneficial to decrease epsilon over time
Application to immediate reward RL
The agents can observe the current state st of the world at time t
It chooses an action at to perform
It recieves a reward rt at time t, which is a function of the states and action
A "one-shot" game, it does not have to play more actions to recieve the reward
(There is an example of the Tabular RL approach in the notes)
Time Dependent or Constant Learning Rate?
Constant: Use in a changing environment. A decreasing learning rate loses ability to respond to changes.
Decreasing learning rate: Use in an unchanging environment. Produces a more precise estomate of the average reward.
Tabular learning
The agent is represented with a big table of:
state-action pairs, Q(st, at) that are the expected reward for taking action at from state st at time t (Q-learning)
state values V(st), the expected reward of a state st assuming a policy for taking the action from that state (state-value learning)
(There is a learning algorithm in the notes)
The need for Q-function approximation
There are lots of states an it will take too long to learn. Solutions are
A hand-coded representation of poker hands which ises a single representation fro equivalent hands
Use a supervised learning function approximation, with input begin a representation of a hand and the action and the desired output being the expected reward, using a neural network, SUM, linear or polynomial regression etc.
In supervised learning, there are examples which are inputs labelled with desired outputs. The goal is to learn to give appropriate outputs for new inputs.
Immediate reward RL
At time t the agent is in state st
It chooses and takes an action at from a set of actions
It then recieves a reward rt
The goals are
To learn the expected rewards
To use that to find the action that maximises the expected rewards from each state by exploring and exoploiting
Method I - Tabular approach: Maintain a table Q(s, a)
The current estimate of the expected reward for taking action a from state s
This requires that every state action pair be visited multiple times
The learning equation uses a constant learning rate, alpha between 0 and 1: Q(st, at) <- Q(st, at) + alpha[rt - Q(st, at)]
(There is an algorithm for this in the notes)
After training - in use
First we train the agent, and then we use it, observing the state, determining possible actions and looking in the table for the action which prodices the highest reward. This is a "greedy" policy.
Method II - Function approximation: A supervised regression model R(s, a|W)
Uses a set of learning parameters, i.e. weights W = (w1, ..., wa)
This can be generalised to unseen state-action pairs
Function approximator
Enter the current state, and each action (to find the best one)
Take the chosen action and observe the reward
Train the mdoel to produce the reward as output for the given input
Gradient descent learning of the weights
One method to do an optimisation like this is gradient descent
It is an iterative, local improvement algorithm, like hill-climbing, but using information about the most improving direction.
Immediate reward RL is not applicable to game learning
In extensove form games, a sequence of actions is required to recieve a reward, but values are required for all decision nodes.
We have to learn with delayed rewards.
Reinforcement learning for games
Instead of playing to the end of the game and backing up the results to the moves made, we can learn to predict expected future rewards using one step look ahead
The value of a state: current plus predicted future reward from that state using a given policy from now on
The value of a move from a state: curent plus predicted future reward making that move from that state using a given policy from now on
The discounted future reward
Future reward: at time t, the future reward is rt + rt+1 + rt+2 + ...
Discounted future reward: the longer you wait, the less valued the reward is by a polynomial factor of gamma (look in written notes)
Learning the future reward
Value of a non-terminal node will be the expected future reward, discounting how long it takes to get it. However, we do not know the future rewards
We estimate the value of the best next state (Q-learning) using our current estimate V(st). Early in learning this will be rubbish.
The future rewards depend on the policy we are using
We will assume a "greedy" policy. No exploration, pure exploitation. During learning we use an exploring strategy like epsilon greedy.
After learing is over and in usem we use a purely exploiting strategy, called an off-policy approach
Tabular Q-learning
The Q-table predicts the immediate reward plus the expected (discounted) reward assuming the besta ctions are taken from this point onwards.
The function should be Q(st, at) = immdiate reward + estimate of future reward
Q(s, a) converges to
The expected future reward one gets by taking an action a from state s
discounted by the time to get it
assuming the best action is chose from this point forward
Tabular Q-learning is proven to converge to the correct answer if every state is visited, and every action is used an infinte number of times i.e. via learning
Application to games
Q-learning in a game setting works the same way except
When a player makes a move, the updated state is returned only after the opponent makes a move
When either player gets a reward, the opponent gets minus the reward, which is associated with the last move each made
We will have two learning agents playing games against each other many times and only after the learning phase is over can the agents play against real players
With function approximation
There will be too many nodes in a typical game tree, so we will define a parameterised approximation function - R(s, a|W) where W is a vector of parameters, then we use gradient descent to minimise
Problems
Q-learning can still be too slow. A heuristic speed up is often used.
This is called TD-lambda
TD-lambda Learning
TD(0) learniing with a look-up table is provable correct, but very slow
Initially only one state which led immediately to a reward has its value updated. Only through many sequences does learning work its way backwards towards the initial states
Idea: When reward recieved, assign credit or blame to the action
Just previous with a weight of 1, the one before with weight lambda x gamma, and before with a weight of (lambda x gamma)^2
Eligibility Traces
An efficient way of accounting for states in the learning space
Let e(n) denote the eligibility of node n. This is related to how recently in the sequence node n was visited. (Function in the written notes.)
Week 6: Learning in Games II
Value-based reinforcement learning for games
Learns the values of states or state action pairs
Learn to predict future rewards
This couples current decisisons to future results
Play multiple games in self-play to learn the true rewards and true future values
The value of a state-action pair
Immediate reward case: the mean of the reward obtained at each state employing each action
Delayed reward case: expected future reward, or expected discounted future reward
Trajectory-based approaches
This doesn't search the entire tree
Instead the learner plays many games against a similar learner
Each game played is a trajectory through the game tree, and with each trajectory, the agent learns about the values of the nodes visited
With function approximation, the agent can also learn about values of nodes not visited if they are similar to nodes which were visited
Function approximation uses supervised learning
Q-learning
The Q-table or Q-function predicts the immediate reward + the expected (discounted reward), assuming the best actions are taken from this part
This uses the greedy policy, off policy as there is a different policy when learning
TD or TD(0) learning
Q-learning is a one of a wider class of algorithms called Temporal Difference learning
The goal is to predict future reward, given a particular policy for choosing subsequent actions
The policy is the choice of action from a given state, at = pi(st)
Other TD algorithms learn to predict the expected future reward using other policies eg epsilon greedy and an on-policy method
TD(0) for state estimates
It is common to learn the value of a state V(s), rather tahn a state-action pair
Predictor Vpi(st): expected future reward of being in state st assuming policy pi is used in all subsequent moves.
This approach is highly compatible with game learning
V(s) is a board evaluation function, which estimates future wins or losses
Given the position, seach over all legal actions to find the one which results in state s
with the highest value V(s
)
Deep Q-learning
This uses a deep neural network to approximate the Q-function. It is often a convolutional neural network which is very effective at image processing
Sometimes just any neural network is called Deep
CNNs learns local feature detectors, and are weight constrained to learn the same feature at different parts of the image
Experience Replay
It is an assumption of ML, statistics etc that data is independent and stationary, but in games, subsequent actions are not independent and in video games, subsequent data frames are highly correlated, so learning is less effective.
To mitigate this
Keep a database of state, action, next state and reward
Use the Q-function to choose actions in real time
Do the learning non-sequentially by drawing batches from the database
Monte Carlo Methods
In Monte Carlo learning, you learn while you play and in between moves
Monte Carlo board evaluation
This is a simple way to generate a heuristic function. Often called playouts. To evaluate the board position v, do the following N times
Play the game randomly to the end and return the average payoff
This estimates the strength of the board position by the average payoff of the nodes accessible to it
The downside is the thime cost to perform the N playouts per board position
Monte Carlo Tree Search (MCTS)
A specific realisation used for Go and other games where no heuristic was available
Elements of MCTS
A game tree is built incrementally but asymmetrically
For each iteration, a tree policy is used to find the most urgent node to expand, balancing exploration and exploitation
A simulation is ran from the selected node using a default policy
Update the search tree according to the result. This updates the value of the selected node and all its ancestors
Return the best child of the current root node
General approach to find the next move
The current board position is v0
While time remaining
Use Tree Policy(v0) to find the next node to expand vl
Use Default Policy(vl) to simulate from vl to a terminal node T
Backup (vl, T)
Return Move = BestChild(v0)
Four phases of MCTS
Selection: Starting at the root, child selection policy is recursively applied to traverse the game tree, looking for the most important expandable node. A node is expandable if it is non-terminal and has unvisited children.
Expansion: Increase the size of the current partial tree (typically by one node) by expanding nodes
Playout (simulation): Run a simulation from the added node by random self-play
Back-up (back propagation): The statistics of the added node(s) and its ancestors are updated
UCT - preferred selection method
An exploring strategy that always chooses unselected children (or less selected children)
In the long run, children with lower tree values will be chosen less and the best child more often
Each node v stores two quantities: Q(v) - the sum of all payoffs received and N(v) - the count of the number of times node visited
Q(v)/N(v) is an estimate of the value of the node
At parent node v, we would choose child v' which maximises UCT (formula in notes) which is the explore/exploit balance
Upper Confidence Bound (UCB)
The exploration-exploitation strategy is actually called UCB (formula in notes)
We choose the action with the largest UCB value
Ideally you want a strategy which explores a lot when little is known and stops exploring when the optimum is found
The best you can do is explore the best action/child exponentially more than the others.
Ending the search and making a move
When time runs out, the algorithm returns the move to be made. The possibilities are:
Max child (default): choose the child with the highest average reward
Robust child: Choose the child most often visited
Max-Robust Child: Choose the child maximal in both. If none exists, run longer.
Secure child: Choose a child which maximises a lower confidence interval
Comparison
TD learning needs to learn in self play before use and learns board value function or Q-table
MCTS learns while playing the game, uses explore-exploit (UCT) to select nodes to evaluate and uses random playout to evaluate them
Variation
Different methods can be used for the different phases
You could use heuristics for selection or rollouts, but they must be quick to calculate
Policy learning vs Value learning
We can use the value of state-action pair (Q-learning), value of state (V-learning) or policy learning, which learns the policy directly.
The advantages of policy learning are that we can observe the policies of expert players, which is essential woth an infinite state and or action space, and you can learn a probabilistic strategy in principle.