Diplomacy
Cogitations

Q-Learning

John Newbury       28 October 2008

Home > Diplomacy > Cogitations > Q-Learning

Cogitations on AI in the Game of Diplomacy: Q-Learning


Here is a brain dump of full details my current Cogitations about Q-learning. (Later Cogitations override earlier ones if contradictory.)

Contents

2008-10-27

Q-learning seems to be similar to part of what I had been thinking of doing regarding goal seeking, long before I knew of the term. In many ways it seems somewhat obvious and old-hat, although I appreciate there are subtleties when it comes to proofs of convergence, optimum tuning, etc., as well as many variants.

From here, say, a pithy definition of Q-learning is:

"Q-learning algorithms works by estimating the values of state-action pairs. The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter. Once these values have been learned, the optimal action from any state is the one with the highest Q-value."

Strictly, "value" and "payoff" here means gain in "utility" for the agent; "state" means "game-state" (presumably including higher order states that we have discussed in DipAi); "discounted" means a value less than that when finally realized. A Q-value, then, is just a heuristic value associated with a given action (or transition) from a given state; Q(s,a) being the Q-function that gives the Q-value for a given state and action – once learned it is just a fast look-up – at least in a static environment, which is all that normally seems to be considered.

A discount is presumably included to help the learning (exploration phase) find the shortest (cheapest) path. It is also potentially useful because a realized reward is better than a theoretical one if based on a less than 100% certain partial search due to heuristic pruning – the chance of error multiplying with each ply of such a search. (However, such pruning is not done in the toy examples I have seen, nor explicitly in the standard method.) Discounting may also be appropriate because reaching a goal earlier means actual gains can be "invested" longer to make more gains: "resources" are released and/or produced earlier, making further gains possible by the time the goal would have been reached by a longer route. But in principle these are already covered, as Q-values apply to the total game-state. So Q-values seem much like minimax values in a game-tree search for chess, say. But such values are not normally discounted – perhaps they should be, to avoid "windy" (drawn out) wins, and to allow for accumulation of error at each ply due to heuristic pruning.

----------

The standard Q-learning method seems to be more designed for, say, a robot physically searching for a goal (thereby speeding future goal finding), or for a mental search where the location of the goal(s) in game-space is initially unknown. Also, it seems to be designed for a static environment – I am not sure how well it performs, or what proofs, such as convergence, still apply if the environment is changing during or after the searches (as would apply in a Diplomacy multi-turn plan). But the (physical or logical) locations of goals in Diplomacy, as I see them, would tend also to be known, so a search could equally easily search from goal to resource. (Unlike Q-learning, I would distinguish agents from resources, since in general an agent may coordinate many resources, such as units and provinces, towards physical goals; and it is difficult to conceive of an agent itself moving though an abstract space, such as "negotiation-space", whereas it may move its resources between different possible agreements.)

I had intended to simply diffuse a gradually reducing "influence" (like a "pheromone trail") from all resources and goals until there was at least a significant overlap. (BBB already uses similar logic to build a table of distances between any pair of provinces or sites (possible locations for each type of unit).) (Parallel goal searching from both ends is a standard AI method established long ago, probably in GPS: General Problem Solver, not Global Position Satellite! This halves the depth, and, for large enough b, vastly reduces the tree search, approaching a reduction from b^d to 2*b^ (d/2) nodes, where b is branching factor and d is depth. However, where number of nodes is well constrained, as in a search of Diplomacy provinces or sites, a full search is feasible. But some abstract search goal-spaces could be more open-ended, e.g. a goal of finding the mutually optimum SC distribution and plan for achieving it within a given alliance.) Anyway, when an agent controls a resource (typically a unit or province in a physical plan) in a region where one end can "smell" the other, then the optimum route from one to the other can be found trivially if gradients are appropriate; at the very least it should be trivial to ensure they are monotonic, so that some solution is found (even if not done exactly as in Q-learning). However, since resources and goals generally have a degree of persistence in their location from one turn to another, such pheromone trails could usefully persist too, perhaps exponentially decaying, but being periodically refreshed with more accurate and up-to-date values. (An unexpectedly troublesome unit or valuable site, or good plan for an alliance, would tend to be so next turn too, or from time to time as units advance and retreat.)

Unlike standard Q-learning, in a complex game with dynamic environment, like Diplomacy, I think the presence of multiple routes should tend to increase Q-value, since such redundancy reduces the chances of opponents being able to block it, or failure due to a mental block of some heuristic. (Assuming a fixed probability of a given route later being blocked, needing a backup one, the value of each additional backup route should exponentially decay, the total therefore approaching a limit. Searching for addition routes should terminate when a negligible total extra value could be added.) Also, the optimum route to a goal in a complex environment like Diplomacy would generally be heuristic, not certain as seems to be assumed in standard Q-learning. Perhaps the expected optimum route should follow the highest ridge defined by products of resource and goal pheromone strengths.

There is also a problem with conflicting goals, e.g., certain resources may be unable to simultaneously achieve goals A and B, which does not seem to be addressed by standard Q-learning. So different types of goals and resources ought to use different pheromones – not just say "there is some goal (or resource) nearby." Maybe each end would emit to its neighbouring nodes in the search-space (e.g., a province, or phase of a negotiation) one or more types of "droplet", each containing a specific set of pheromones, one pheromone for each property that a partner needs/helps; part would be retained by a recipient node and part shared by neighbours, and so on, thereby creating a gradient. (Separate droplets may be needed where it is important that a single partner has all the desired properties. The exact mixture of pheromones would indicate degree of match in different dimensions, which would be weighted.) The total initial strength of a goal's pheromones would be the tangible value of realizing it, the proportions indicating the relative desirability for various types of resource in order to realize it. The strength of a resource's pheromones would indicate its ability to achieve goals with the specified properties. Distribution of droplets could continue even if a node had already been visited (maybe up to some limit) – thereby emphasising a well-connected nodes – but would terminate when additions were below a certain strength to avoid wasting time directing resources towards implausibly distant goals. (This may need refining be sure there is always a monotonic gradient between the two ends, rather than leading to a dead-end. It would also be important to ensure gradients are such that a route in between two goals did not appear to have stronger smell! Perhaps the Q-learning convergence proof may give a hint.) Some nodes may immediately reflect all or part of droplets containing certain pheromones, indicating difficulty in their traversal, such as an army crossing sea, or a strong opposing force (or pheromone of a nearby one).

To help avoid conflict, and to partition the total problem space to reduce complexity, my inclination is to have a set of plausibly compatible "projects", each having a distinct set of plausibly compatible goals and distinct set of plausibly available resources to use. (A resource might be, say, a unit or province, or agreement with an ally, for some range of turns; acquiring certain types of resources might be pre-requisite goals.) Each project would try to plan an optimum solution, given its resources, requesting more resources when it no longer seems plausible to find a reasonable solution, or where a bit more would help a lot; relinquishing resources on request if they gain it little. A grand root meta-project (or hierarchy of them) would generate new subordinate projects to fill a needs and to exploit opportunities, suspend or destroy those found to be impracticable, and coordinate resource allocation.. (Hierarchies of BBB "Agents" may be directly applicable here, as "project managers".)

Standard Q-learning seems to deal only with a single resource (the agent), though I think there may be many goals. But in Diplomacy, at least, there are many resources too, e.g., in a physical plan: agents would typically be "commanders", with goals to arrange their resources, typically units and provinces, into good patterns; in an abstract plan: agents could be, say, "diplomats", with goals to utilize their resources, including units and provinces, but other kinds too, such as assignment of SCs, to combine and sequence into agreements and physical operations (the latter managed in detail by commanders), to gain and maintain the approval by an alliance. The pheromone trails from goals may be of value to all resources (that seek that type of goal), and vice versa – resources can achieve goals and goals can provide roles for resources.

----------

A potentially interesting development is Nash Q-learning. Apart from being multi-agent and dynamic environment, the main difference from standard Q-learning is that, rather than optimizing own utility, each player optimizes for joint Nash equilibrium.


Tracking, including use of cookies, is used by this website: see Logging.
Comments about this page are welcome: please post to DipAi or email to me.