Please enable JavaScript.

Coggle requires JavaScript to display documents.

Artificial Intelligence and Games - Coggle Diagram

- - - - The goal of this algorithm is to find the shortest path from a source to all the other nodes in the graph
      - Priority queue: implements a best first search, searching the nodes closest to the source first
      - Path representation: each node in the queue maintains properties of the current shortest path, its distance to the source and its parent
      - Relaxation: when a shorter path is found for a node in the priority queue, the distance and its parent is updated.
      - Properties
        
        Completeness: If there is a path it will be found
        
        Optimal: The algorithm finds the shortest path
        
        Uninformed: Does not use any information one might have about the location of the target
      - Finding the shortest path: when a node is popped from the queue, the shortest path is found.
    - - Uses a heuristic approximation for the distance to the goal
      - Greedily searches, moving the the unvisited available state nearest to the goal, according to the heuristic
      - Properties
        
        Not complete - may not find a path, even if it exists (can be made complete with backtracking)
        
        Doesn't necessarily find the optimal path
        
        Fast
        
        Informed search - uses domain knowledge to produce and evaluate a good heuristic
    - - g(x): the distance from the source s to the node x - Djikstra's algorithm
      - h(x); the heuristic, a estimate of the distance of node x to goal t - the Greedy heuristic
      - A*: searches ordered on their sum f(x) = g(x) + h(x) - the total estimated distance from source to goal through node x
    - - Admissable: It must underestimate the distance
      - Monotonic: Satisfies a triangle inequality
      - Infromative: The closer h(x) is to the true distance to the goal from x, the more informative
      - (Possibly do more research on these)
- - - - Each action they take has an associated probability
      - The player gets no payoff
  - - - The set of nodes a player could be at given the information it has
      - Action taken must be the same for all nodes in the information set
  - - - We get the number of fully specified strategies by: for each decision point of the player, count the number of choices and then multiply all the numbers gained. (check this out again)
      - You can overspecify a player's actions as many branches would never happen.
    - - the root of the game tree belongs to the subtree
      - whenever it is Player i's turn at a node that belongs to the subtree, exactly one of the available moves belongs to the subtree and for all nodes in the same information set for Player i, the same action is chosen
      - whenever it is not Player i's turn at a node that belongs to the subtree, all of the available moves belong to the subtree,
  - - - A finite list of players 1, ..., i
      - for each player, a list of valid strategies for the players, 1 to ni for Player i
      - for each player i, a payoff function pi
    - - It requires the entire game in the table: a graph can be explored without storing the entire graph in memory
      - Normal form requires more parameters to describe the game
    - - If there is chance involved then one follows the unique play that the strategies under consideration determine until a final position has been reached.
      - If there is chance involved, then there may be more than on common path. We calculate the expected pay-off by, for each node reachable, calculate the probability of reaching that node by multiplying all the probabilities that appear along the path by the pay-off and adding all the numbers of the probabilistic pay-offs.
  - - - game represented a game tree: these are huge
      - You can create the game tree as you need it. This form is more compact and closer to actually playing a game.
      - The main disadvantage is that reasoning is difficult. It is possible to get proofs by simple induction on the height of the game tree.
    - - game represented as a table of strategies: strategy spaces are huge
      - Advantages
        
        decisions have been stratified, complexity has been subsumed, pay-offs are easy to compute, and the problem of chosing a strategy is simplified
      - Disadvantages
        
        size, as computing is not always feasible, modelling this way does not make sense when learning about games, generating all strategies is expensive
- - - - For all strategies s2, ..., si for players 2, ..., i respectively and
      - For all strategies s, for player 1 it is the case that p1(s*1, s2, ..., si) >= p1(s1, s2, ..., si)
    - - In a finite strategy space, there definitely is. Check all strategies and play (one of) the best one(s).
      - In an infinte strategy space, perhaps not. (But there is a counter-example in the notes)
  - - - Each strategy s*i is a strategy for Player i and
      - Each strategy is the best response to all the other strategies
    - - Solving the game means finding the Nash equilibrium
      - Each is playing best response to the other
    - - Looking at the systems arising from a number of agets playing such games and as if there are stable states
      - Measuring success differently eg minimising regret
      - Allowing the agents playing the game to negotiate with each other
    - - We can use formal notation such as a matrix or table with ij or mn where the first is the rows, and player 1
      - Proposition 2.7: For a 2-person, zero-sum game in normal form the strategy pair (i,j) is an equilibrium point iff the corresponding payoff is maximal in its column and minimal in its row.
      - Proposition 2.8: Let (i, j) be an equilibrium point for a 2-person, zero sum game in normal form with m strategies for P1 and n for P2, and a payoff matrix. Given the condition in the notes, the game has an equilibrium point.
      - Collary 2.9: All equilibrium poinys in a 2-person, zero-sum game lead to the same payoff.
      - Definition 9: For a 2-person, zero-sum game the (unique) payoff at an equilibrium point is the value of the game
      - Proposition 2.10: Let (i,j) and (i',j') be equilibria for a 2-person, zero-sum game. Then (i,j') and (i',j) are also equilibria for that game.
      - Proposition 2.11: A 2-person, zero-sum game has an equilibrium point iff there exists a value such that it is the highest pay-off Player 1 can guarantee and -v is the highest payoff Player 2 can guarantee. v is the value of the game/
      - Proposition 2.12: Every 2-person, zero-sum game of perfect information in which every play is finite has at least on equilibrium point.
  - - - plays probablistic combinations of pure strategies
      - recieves a probabilistic combination of payoffs
      - (Look in the notes for how to calcuate the expected payoff)
    - - Assign a probablity qi to the pure strategy i
      - Where qi is between 0 and 1, and the sum of all probabilites of all strategies is 1
      - Choose strategy i with the probability qi
      - Get the appropriate payoff with the probability qi
    - - At each node where a player has a decision, assign a probability function to each possible choice.
      - Definition 11: For a game in extensive form a mixed strategy for Player i is given by a probability distribution over the available choices for each decision point belonging to i in such a way that all the decision points in the same information set have matching probabilities for all possible actions.
    - - Probabilistic combinations of pure strategies
      - Nash equilibrium: all games with a finite number of players and finite numbers of moves have at least one Nash equilibrium (mixed or pure)
      - Finding Nash equilibria - general case
        
        General sum games: Best algorithm seems to be the Lemke-Howson algorithm which can be exponential in time
        
        Zero-sum games in normal form: Generally can be solved using linear programming
      - Common features of mixed strategy Nash equilibra
        
        If one player plays its component of the Nash equilibrium, the payoff is indifferent to what the other player does.
        
        The other player must also play its component of the Nash equilibrium to force the first player to play its component and not take advantage.
        
        Mixed strategies solve the problem of non-existence of equilibrium points (There is a theorem and proposition for this.)
        
        The drawback is: it works well when both players know enough to play their optimal strategies, otherwise one player can exploit anothers weakness.
      - While we can use the original definition of equilibrium, we now have the problem of as soon as one player has at least two strategies, there are infinitely many mixed strategies for at least one player, so how do we check for equilibirum? Well doing so for the pure ones will suffice. There is a proposition and lemma to verify this.
  - - - poly-time algorithm that finds all equilibria
      - can be reformulated to linear programming and here are three different algorithms
    - - Lemke-Howson finds one equilibrium point of a given name game
      - wost case is exponential and belongs between P and NP
    - - more difficult and not much known, no known algorithm that works for all
      - alternatives to equilibria that are easier to calculate are
        
        approximate equilibrium which guarantees the pay-off for all players is at most a given small number bellow the pay-off at an actual equilibrium point
        
        correlated equilibrium where players are allowed to correlate their choices with each other
- - - - V(J) if the node is terminal
      - The maximum value of its children if it is a MAX node
      - The minimum value of its children if it is a MIN node
  - - - The graph is built in real time
      - The goal is to evaluate the chilren of the current node to find the best node
    - - Find the values of all child nodes of your current position
      - Move to the child of your current node with the best value
    - - There is no reason to restrict the algorithm to two players
      - If it is non-zero sum we have to run it separately for each player.
      - For each subtree this algorthm calculates the largest expected payoff Player i can guarantee for themself.
  - - - WIN if any of its children are WIN nodes
      - LOSS if all of its children are LOSS nodes
    - - A child evaluates to WIN, then label J as WIN and leave the remaining children unevaluated
      - All children have been evaluated (and they are all LOSS nodes)
  - - - alpha is the maximum lower bound of the value of the node
      - beta is the minimum upper bound of the value of the node
      - when beta <= alpha prune the subtree containing that node
    - - searching to a predefined depth means not all moves can be explored, so the order of the search is vital
      - many programs carry out shallow searches, deepening the level one by one. This is combined with others to make use of the information gained.
      - if nothing else, we can use this information to decide the order in which the search moves.
    - - We only consider moves that lead to a value of at least alpha on a MAX node and at most beta on a MIN node, to allow for more pruning.
      - If alpha is above beta, stop the search and report the value back.
      - This technique is known as aspiration search, and can allow us to search twice as deeply in the same amout of time in the best case.
      - (Re-read this section of the notes.)
    - - We must first find good moves to fully exploit alpha-beta, but that is the point of alpha-beta, so we can use clues to speed up the process.
      - If we have previously searched positions, we have some idead of which moves lead to good positions
      - We can also have some idea of moves that are typically good eg 'killer moves' end the game.
      - We can also use Principle Variation Search (PVS) where everything is compared against the principle variation (the first move searched) as this leads to more pruning.
    - - Game playing programs search to a greater depth whenever
        
        there is a reason to belive tht the current value for a position is inaccurate
        
        when the current line of play is particularly important
  - - - Introduce an evaluation function, which is an approximation to V(J). Use it just like the value.
      - Search the tree to a given depth or until terminal nodes are found.
      - Use U or the evaluation function to evaluate the node.
      - Propagate the value up the minimax tree
    - - More positive evaluation function means Player 1 is more likely to win, negative for Player 2
  - - - play them against each other multiple times
      - play them against other heuristics
      - the one that wins the most is likely better
    - - Method I: hand tweaking - often just by guesswork or by manual tweaking of weights
      - Method II: stochastic hillclimber - start with a set of weights and then repeat
        
        perturb the weights slightly
        
        compare the heuristic using the original weights to that using the perterbed weights by playing the two against each other in multiple games
        
        choose the best of the two
      - You can also learn through genetic algorithms and neural networks
  - - - bit-wise operations are fast
      - bitboards required more than once only have to be computed once
      - several moves can be computed at the same time
  - - - This can be overcome be adding knowledge to the program, increasing the overall depth of the search or selectively searching deeper
- - - - games of incomplete information
      - general-sum games
      - games with many players
      - games with chance
  - - - Plays an action
      - Observes the new game position and any rewards
      - Strengthens moves leading to wins; supresses those leading to losses
      - By playing many games, learns to become a strong player
  - - - Observe the situation/state
      - Perform an action
      - Recieve a reward, positive or negative
    - - Learning where only the quality of the response or actions can be known but not what the correct actions are.
      - Alternatively, numerical rewards can be observed
      - The reinforcement information may be available only after a sequence of actions has been taken.
      - The environment may include opponents, who might also be learning
    - - Rewards: Denoted rt at time t. Can be positive, negative or 0 (win, lose, draw/no result)
      - State: The current position, board tate, agent location etc, denoted st
      - Policy: What to do in any situation. What action to take in a situation or state: pi(At, St) sometimes with a transition function Env(St+1| StAt)
      - Reinforcement learning: Learn an effective policy simply by taking actions and observing rewards.
  - - - The agents can observe the current state st of the world at time t
      - It chooses an action at to perform
      - It recieves a reward rt at time t, which is a function of the states and action
      - A "one-shot" game, it does not have to play more actions to recieve the reward
      - (There is an example of the Tabular RL approach in the notes)
    - - Constant: Use in a changing environment. A decreasing learning rate loses ability to respond to changes.
      - Decreasing learning rate: Use in an unchanging environment. Produces a more precise estomate of the average reward.
  - - - state-action pairs, Q(st, at) that are the expected reward for taking action at from state st at time t (Q-learning)
      - state values V(st), the expected reward of a state st assuming a policy for taking the action from that state (state-value learning)
      - (There is a learning algorithm in the notes)
    - - There are lots of states an it will take too long to learn. Solutions are
        
        A hand-coded representation of poker hands which ises a single representation fro equivalent hands
        
        Use a supervised learning function approximation, with input begin a representation of a hand and the action and the desired output being the expected reward, using a neural network, SUM, linear or polynomial regression etc.
        
        In supervised learning, there are examples which are inputs labelled with desired outputs. The goal is to learn to give appropriate outputs for new inputs.
  - - - To learn the expected rewards
      - To use that to find the action that maximises the expected rewards from each state by exploring and exoploiting
    - - The current estimate of the expected reward for taking action a from state s
      - This requires that every state action pair be visited multiple times
      - The learning equation uses a constant learning rate, alpha between 0 and 1: Q(st, at) <- Q(st, at) + alpha[rt - Q(st, at)]
      - (There is an algorithm for this in the notes)
      - After training - in use
        
        First we train the agent, and then we use it, observing the state, determining possible actions and looking in the table for the action which prodices the highest reward. This is a "greedy" policy.
    - - Uses a set of learning parameters, i.e. weights W = (w1, ..., wa)
      - This can be generalised to unseen state-action pairs
      - Function approximator
        
        Enter the current state, and each action (to find the best one)
        
        Take the chosen action and observe the reward
        
        Train the mdoel to produce the reward as output for the given input
      - Gradient descent learning of the weights
        
        One method to do an optimisation like this is gradient descent
        
        It is an iterative, local improvement algorithm, like hill-climbing, but using information about the most improving direction.
    - - In extensove form games, a sequence of actions is required to recieve a reward, but values are required for all decision nodes.
      - We have to learn with delayed rewards.
  - - - Future reward: at time t, the future reward is rt + rt+1 + rt+2 + ...
      - Discounted future reward: the longer you wait, the less valued the reward is by a polynomial factor of gamma (look in written notes)
    - - Value of a non-terminal node will be the expected future reward, discounting how long it takes to get it. However, we do not know the future rewards
      - We estimate the value of the best next state (Q-learning) using our current estimate V(st). Early in learning this will be rubbish.
      - The future rewards depend on the policy we are using
        
        We will assume a "greedy" policy. No exploration, pure exploitation. During learning we use an exploring strategy like epsilon greedy.
        
        After learing is over and in usem we use a purely exploiting strategy, called an off-policy approach
  - - - The expected future reward one gets by taking an action a from state s
      - discounted by the time to get it
      - assuming the best action is chose from this point forward
    - - Q-learning in a game setting works the same way except
        
        When a player makes a move, the updated state is returned only after the opponent makes a move
        
        When either player gets a reward, the opponent gets minus the reward, which is associated with the last move each made
      - We will have two learning agents playing games against each other many times and only after the learning phase is over can the agents play against real players
      - With function approximation
        
        There will be too many nodes in a typical game tree, so we will define a parameterised approximation function - R(s, a|W) where W is a vector of parameters, then we use gradient descent to minimise
    - - Q-learning can still be too slow. A heuristic speed up is often used.
      - This is called TD-lambda
  - - - Just previous with a weight of 1, the one before with weight lambda x gamma, and before with a weight of (lambda x gamma)^2
    - - An efficient way of accounting for states in the learning space
      - Let e(n) denote the eligibility of node n. This is related to how recently in the sequence node n was visited. (Function in the written notes.)
- - - - V(s) is a board evaluation function, which estimates future wins or losses
      - Given the position, seach over all legal actions to find the one which results in state s with the highest value V(s)
  - - - Keep a database of state, action, next state and reward
      - Use the Q-function to choose actions in real time
      - Do the learning non-sequentially by drawing batches from the database
  - - - This is a simple way to generate a heuristic function. Often called playouts. To evaluate the board position v, do the following N times
        
        Play the game randomly to the end and return the average payoff
        
        This estimates the strength of the board position by the average payoff of the nodes accessible to it
        
        The downside is the thime cost to perform the N playouts per board position
    - - A specific realisation used for Go and other games where no heuristic was available
    - - A game tree is built incrementally but asymmetrically
      - For each iteration, a tree policy is used to find the most urgent node to expand, balancing exploration and exploitation
      - A simulation is ran from the selected node using a default policy
      - Update the search tree according to the result. This updates the value of the selected node and all its ancestors
      - Return the best child of the current root node
    - - The current board position is v0
      - While time remaining
        
        Use Tree Policy(v0) to find the next node to expand vl
        
        Use Default Policy(vl) to simulate from vl to a terminal node T
        
        Backup (vl, T)
      - Return Move = BestChild(v0)
    - - Selection: Starting at the root, child selection policy is recursively applied to traverse the game tree, looking for the most important expandable node. A node is expandable if it is non-terminal and has unvisited children.
      - Expansion: Increase the size of the current partial tree (typically by one node) by expanding nodes
      - Playout (simulation): Run a simulation from the added node by random self-play
      - Back-up (back propagation): The statistics of the added node(s) and its ancestors are updated
    - - An exploring strategy that always chooses unselected children (or less selected children)
      - In the long run, children with lower tree values will be chosen less and the best child more often
      - Each node v stores two quantities: Q(v) - the sum of all payoffs received and N(v) - the count of the number of times node visited
      - Q(v)/N(v) is an estimate of the value of the node
      - At parent node v, we would choose child v' which maximises UCT (formula in notes) which is the explore/exploit balance
    - - The exploration-exploitation strategy is actually called UCB (formula in notes)
      - We choose the action with the largest UCB value
      - Ideally you want a strategy which explores a lot when little is known and stops exploring when the optimum is found
      - The best you can do is explore the best action/child exponentially more than the others.
    - - When time runs out, the algorithm returns the move to be made. The possibilities are:
      - Max child (default): choose the child with the highest average reward
      - Robust child: Choose the child most often visited
      - Max-Robust Child: Choose the child maximal in both. If none exists, run longer.
      - Secure child: Choose a child which maximises a lower confidence interval