Diplomacy
|
Home > Diplomacy > Cogitations > Game Theory
Cogitations on AI in the Game of Diplomacy: Game Theory
Here is a brain dump of full details my currently Cogitations about Game Theory (GT). (Later Cogitations override earlier ones if contradictory.)
[Originally presented in DipAi post #8002. But see Revised Method.]
I have coined the term Objective GT to refer to standard or classical Game Theory (GT), and similar kinds, which assume that it is common knowledge to all players that they all have access to all information, have perfect reasoning power, and perfect implementation skills, so they expect and can deliver the objectively correct solution – even though that may be a mixed strategy. I believe it to be inconsistent, and certainly an inappropriate model for many purposes, including Diplomacy. This in contrast to Subjective GT.
I cannot accept that Objective GT is the final answer, because it assumes there is common knowledge that each player has perfect knowledge of each player's utility, and common knowledge that each can and will act perfectly rationally given what he knows. This rarely accurately models the real world; nor even in Diplomacy. Also, Objective GT gives no clue how to adjust your play in future when there is clear evidence that another player is not playing as Objective GT would predict. Indeed, if I am confident that my opponent will play according to Objective GT, I need not! (Consider a simple binary guessing game.) At least if there is only one Nash equilibrium (NE), Objective GT says that no rational player dare deviate from it. But if that is so, any one of them could safely do so, if all the others were indeed rational, thereby contradicting one of its axioms! Or at least that demonstrates that the axioms are unrealistic! I would go further and call them surreal, as they immediately lead to a contradiction even in its ideal world, or at least instability in the face of the slightest perturbation! In a more realistic model, it would usually only gradually become clear that there may be irrational bias, but Objective GT does not say how to play to take advantage of such "errors". It merely asserts that it would be possible, as it is claimed to be a danger if not playing safely (minimax), but it also says that any rational player would play safely, so could not take advantage of any unsafe play by another player. If there were a certain amount of apparent bias that should trigger some change to Objective GT play, then the opponent would know that too, and could take advantage of that play; so Objective GT play would apparently still be best! Either way, an axiom would fail. That is a self-contradiction! Or at least proves that irrational (or ill-informed) players should be incorporated in the model. Therefore, Objective GT is incomplete too!
I have coined the term Subjective GT to refer to more realistic theories versions of Game Theory (GT), which do not necessary assume all players all have access to all information, have perfect reasoning power, or perfect implementation skills, so they should not expect and cannot deliver the objectively correct solution. I believe that GT can only be consistent in such a weakened form, and certainly only then an appropriate model for many purposes, including Diplomacy. This in contrast to Objective GT.
As a revolutionary approach, I had been considering that perhaps Subjective GT could best be implemented by just doing a deterministic, empirical analysis of opponents' play, without all the complexity of trying to model their (partially random) thought processes. For example, I was arguing, if an opponent has an undue bias towards (or against) certain patterns of play (such as holding rather than moving, or playing in a similar way to last turn) it is sufficient that I have observed and incorporated this fact – I do not need to know, or even postulate, why my opponent tends to play that way.
I still think this approach has merit in simple games, such as Binary Guessing (BinGuess) or Rock-Paper-Scissors (RPS). Indeed, my BinGuess program is effective without trying to model its opponent's thought processes. And although people probably often try to model their opponent's thought processes in either game, they probably rarely actually approximate the truth – the model is probably usually an illusion, the analysis really being only empirical. However, in complex games, like Diplomacy, I now believe such an unstructured approach would be intractable. Basically, the bot would (obviously!) effectively empirically have to learn all the structure of Diplomacy (or empirical equivalent) before it could begin to find many meaningful patterns in opponents' play. For example, a player surely normally only plays as he does (even if he has undue bias) after analysing how his play is likely to be adjudicated, yet adjudications can depend sensitively on complex interacting rules, the exact board state and expected play by other players. And players' decisions could depend in complex ways on what press is common knowledge to various subsets of players (as we had discussed earlier in DipAi. Any significant bias is likely to be hidden from simple, naive, unstructured empirical measures – at least without astronomical amount of observed play and processing power. (Maybe what others already thought!) So I have reverted to my original (and common sense!?) conclusion that the mechanics of the game and rational decision processes should modelled as accurately as is feasible, but that a Monte Carlo simulation would be necessary for much of the decision process, as the game is too complex (to design or execute) an exact analysis.
The likely decision processes (including likely variations) of other players should be modelled as accurately as possible by built-in code (or at least by initial "suggestions"), rather than just trying to determine them empirically from scratch. In general, as much Diplomacy knowledge as possible should be explicitly encoded up front, relying as little as possible on any learning mechanism – the main purpose of learning being to more finely tune what has been manually encoded (especially in areas where the author is only confident about the general structure, as is usual), and maybe spot a few unanticipated correlations. (I had outlined my ideas on this a while back. Even humans depend on being given good foundations on any subject, mainly just learning the finer details for themselves; a given researcher generally only produces small extensions to existing human knowledge, "standing on the shoulders of giants" rather than working it all out from first principles; biologically we are little advanced from cavemen, relying more on formal and informal education rather than raw powers of deduction; a bot does not have geological amounts of space-time for data gathering and thinking!)
As far as subjective GT is concerned, I would now just treat any apparently undue biases that I observe as evidence that such a player uses different heuristic weights to me. When I simulate "if he knew that I would probably do x then he would probably do y, so I should do z instead, ...." logic, the probable y would directly incorporate any bias that I expect, even if I would not have had that bias in his circumstances. For example, in the extreme case of a HoldBot opponent, I may come to expect him to hold (say by increasing the weight I assume he uses for a hold), even though that is not my view of what is rational in the circumstances. I may observe that another player tends (or players in general tend) to have a bias towards (or against) repeating orders that failed last time in a 50:50 (the "gambler's fallacy" or its opposite), even though I believe (and is surely objectively so!) that the choices should be totally independent.
I think unstable Nash Equilibria (NEs) are probably fundamentally difficult and expensive to find. (Most approaches that I have considered seem likely to erroneously stabilize on a fixed strategy, but with 50% chance of that being 0% or 100% on any given run, no matter what the payoffs!) But I also fear that unstable NEs may be very common – maybe even 50% of mixed strategies in each degree of freedom (not sure). If so, assuming arbitrary fixed strategies, could be very poor. Finding a stable NE (in a 2-dimensional case; that is, 2 free choices) seems a bit like finding where a ball will end when released somewhere in a bowl that has a vicious surface – wait (iterate) long enough and it correctly ends at the lowest point. An unstable NE is similar, except it is a dome rather than a bowl, so the ball always moves away from the correct solution at the highest point, ending somewhere on the rim. Even if you start near the correct solution it will diverge, ever more rapidly. Interestingly, the simulation of one would be a time and gravitational reversal of the other. So to find an unstable NE, naively it seems we just need to decide what would happen if we made such reversals. But in practice there may be large errors. Errors tend to be corrected each iteration as stable NE is approached, but this does not happen for an unstable NE as it is not approached closer than the initial guess. My tentative proposal to address unstable NEs is first to decide whether a given dimension has a stable or unstable NE. Assuming smooth mathematical surfaces (like bowls and domes, rather than trumpet-shaped spikes), ever widening adjustments to the best estimate each iteration should indicate a (divergent) unstable NE. In that case we try to estimate the point of divergence (the correct solution), and reset our estimate to that, rather than continuing from where we are, as we would if apparently approaching a stable NE. The better our estimate of the point of divergence the slower we should initially diverge in future iterations. At least we will progressively probe the search space near to the correct solution, rather than pointless probing far away towards the "rim of the dome", so we should potentially be able to progressively improve our estimate of the correct solution. Note that we do not know the exact shape of the dome, so cannot directly calculate the exact solution. Also "noise", due to complex interactions with other dimensions and quantized iteration steps would introduce errors, sometimes maybe even misclassifying whether or not stable. But signal to noise should increase with sufficient samples (iterations), correcting errors in due course. The full method would also need to allow for local optima (not necessarily simple bowls or domes), presumably by using simulated annealing. It would all be done as part of the one Monte Carlo process. Of course, the above intuitions and "poetic" description would need to be firmed up. There may be many devils in the detail!
[Originally presented in DipAi post #8003. But see Revised Method.]
After considering in more detail how I would calculate a new estimate for an unstable Nash Equilibrium (NE) (the apparent divergent point) it seems that a new estimate for a stable NE could be found in exactly the same way, avoiding the need to decide whether the NE is stable or not! As a bonus it should also optimize the rate of convergence of a stable NE, since at each step it could then jump immediately to the expected solution (as near as we can guess to the final solution), without wasting time sampling irrelevant points, typically moving either too slowly or too quickly (overshooting) – effectively making it critically-damped instead.
Each plausible and plausibly-independent choice of play would be represented by a dimension in search space, constrained to be a probability in range 0...1. For given points in all other dimensions, each point on a given dimension has an associated utility for the player concerned. We want to find the point with highest utility where gradient of utility is zero (a stable or unstable NE), using the nearest 0 or 1 limit (a fixed strategy) if out of that range. (We assume a smooth, single-valued, multidimensional surface.)
(So we are now imagining a balloon rising along a u/p (utility/probability) curve in each dimension, rather than a ball falling. The exact curve, in a given dimension, depends in a complex way on everyone's choices in every other dimension, but we assume all curves should mutually stabilize in due course – see below – there must surely be a single overall solution, at least if we assume consistent arbitrary preferences in the case of otherwise exactly equal solutions. The counterintuitive desirability of ever choosing an unstable NE minimum just reflects the fact that if I do not choose one with zero gradient my opponents can take advantage of me – by changing the relevant curve so that I can only find an even worse solution! At the overall asymptotic solution point, the curves in all dimensions should have zero gradient, unless at 0 or 1 because no such point is in range.)
Each possible order for a unit might well plausibly have its own dimension, except that it would be wasteful to include just supports and convoys, which are clearly pointless in isolation. Instead, various plausible groupings (order-subsets) would be combined into a set of dimensions, each of which I call an "operation" (or just "dimension" in context when its specific orders are irrelevant). (Probably all possibilities would be considered, unless impracticably many, in which case a random representative collection, would be used.) (Strictly, one order for a unit is constrained by the probabilities having to sum to 1, so does not need its own dimension, but for convenience and symmetry we assign each a notional probability, which would be normalized to give effective probability – see below.) It might similarly be best to combine orders such as A->B and B->C, since we can see directly that the former would necessarily be bad without the latter, and the latter without the former may be bad, due to leaving B uncontested. However, whenever there is an explosion of specific plausible combinations (such as A->B and B->x for many x) it may be necessary or advisable to consider only a random but representative collection. (We probably cannot easily consider x as a single generality, since adjudication is probably only straightforward for specific orders.) Important sequential combinations could also be included; for example, A->B followed by B->C by the same unit in a future turn. An operation may also include orders for a mixture for powers when coordination is important (such as for support, convoys and agreed combinations) for compatibility checks, but only the orders for the power that owns the operation would be selected for play at any step. (For the coordination to be effective, the (models of, and real) other powers would simultaneously – but independently – have to decide to select them, using their own copies of the operation. This might not happen even if there were an agreement, albeit the utility of each party would then be reduced due to reduced "trust".)
The more dimensions, the longer each cycle would be, and the more cycles likely to be needed for convergence to a given accuracy (depending on the lengths of causal sub-cycles), so we do not want to consider too many combinations. That would indicate an advantage for using as few as possible representative dimensions (operations); for example, few (maybe just one) of each different number of supports that could theoretically be needed, given the available opponents. However, clearly this may miss important combinations; for example, a support that could not, or (after deeper analysis during this procedure) probably would not, be cut. As there is no point in converging precisely to an optimum within a bad set, I would err towards including all combinations, except when pathologically many. (It would be a matter of judgement and compromise – maybe like look-ahead pruning in chess, in which bots, but not humans, are generally best advised to use minimum pruning. But, as demonstrated by chess masters, there could be much to gain if accurate heuristics can be found.)
Each operation would correspond to a small combinable order-subset, as I have proposed previously. As I then explained, it would probably be necessary to derive probabilities during any specific selection of an operation according to "intensity", being like effective probability, but derived from boosted notional probability when not inhibited by already-selected incompatible orders; 0 otherwise. (See below for "notional" and "effective".) If a desired effective probability were not achieved due to such clashes, its intensity would be increased to increase the chance of selection when there is no clash. I.e. when an operation is not inhibited by incompatible orders already selected, its effective probability would be boosted if it tends to be difficult fit in with others. Such boosts might be fairly stable even while notional probabilities are being rapidly adjusted. (Interactions between orders can be complex, so the relationship between intensity and probability can only be determined empirically in each case.)
So, at each step, an operation for some (modelled) power is selected for inclusion in the full order-set sample, with probability proportional to its intensity. (See below for how to choose the power.) Any incompatible operations are then removed from the sample. Additional compatible operations are then selected, with probability proportional to their intensities, until all units (of all powers) have an order. (Rarely should we fail to find a suitable operation to provide all required orders, but if so, repeat from the initial unconstrained selection. The intensity of the previous (problematic) initial selection should be reduced by a fraction to avoid a possible pathological loop when only incompatible operations have significant intensity. It must eventually succeed, provided all units have at least one operation that cannot be incompatible with other such operation; for example, just a hold.)
If we assume the u/p curve in each dimension (at any step) is a quadratic (albeit generally a bit different at each cycle) there is just one zero gradient point. We can use our estimates of utility at points that we sample to progressively improve our estimates of its 3 coefficients, from which we can trivially calculate its point of zero gradient, which is our currently best estimate of the solution for that dimension and best point to sample next (subject to adding "thermal" noise then trimming and scaling to within range 0...1 – see below). In fact, we only need to estimate the 2 coefficients for the first derivative – see below. Although the u/p curve (at any step) is unlikely to be a quadratic, this approximation can be arbitrarily good for a small enough region near to the solution. All the curves should stabilize and the regions being sampled should progressively reduce as the overall solution is converged upon.
Unfortunately, in practice we cannot compute utility directly from the probabilities in each dimension. We need to generate a set of specific orders, selected according to the currently estimated probabilities (using their intensities), adjudicate these orders, heuristically evaluate the utility, for the power concerned, of the resultant positions, and take the mean for many such samples. (This is the core reason for discarding my later over-simplified empirical approach. If we do not build-in the adjudication logic and "conventional" heuristics for position evaluation, the bot would have to learn it all empirically, and from very noisy data. Less significantly, hand-coded C++ should also improve it by a big constant factor.)
The first derivative of the assumed quadratic u/p curve, du/dp, is a straight line; the currently estimated solution for its probability being where this intersects the probability axis (du/dp = 0). To estimate the coefficients of the du/dp line we need as many samples as possible (at least 2 distinct points, but best to defer processing a very few, as it could lead to pathologically erratic behaviour). Conveniently, the ideal "thermal" scatter will already be applied to avoid being trapped in local optima – see below. Use the formula for best fit (least mean square deviation) to a straight line, but generally with different weights for each point – see below.
Older data should generally have lower weights to give a moving average (since the u/p curve and its derivative may be changing). (We assume simple derivatives, even though partial derivatives are strictly appropriate, since other dimensions are simultaneously variable. But cross-terms involving other dimensions should approach zero as all dimensions approach the constant solution.) Points with higher utility should also have higher weight to bias the solution towards higher u/p optima if more than one is visible at the current "temperature" – see below. Both types of bias should be proportional to temperature, which is a measure of the overall scale of what is in view. (When more then one u/p local optimum is visible, du/dp is not as straight line, but a curve that multiply-intersects the probability axis (once for each local optimum). In that case the concept of a best-fit straight line is meaningless, but appropriate weights should make it less so. We bias towards the local optimum that we prefer, others going out of view as temperature falls – though others may then be resolved as finer detail is exposed. Beware of cooling too quickly or all local optima may disappear from view without warning, even though an apparently best-fit straight line would always be found! However, the above biases would at least tend to make its estimated solution (and hence future sampling) move in the appropriate direction – as it would if there were no optimum in the range of a valid effective probability. Systematic movement in one direction could be used to indicate that temperature ought to be increased.
A further reason for using a moving, rather than all-time, average for the coefficients for the du/dp line is that the u/p curves may be changing rapidly in the earlier cycles, while far from the solution. So data should be weighted in inverse proportion to "temperature". (The curve in one dimension may be affected by the currently selected points in other dimensions; for example, reflecting the fact that one player would adjust his strategies (represented by current probabilities in the dimensions he controls) in the light of those of the other players; or due to correlations; for example, increasing probability of an operation represented by a point in one dimension may change the utility represented by another dimension.)
Note that the above estimated probabilities are "notional" – they could have any real value. When selecting an operation, only "effective" probabilities are meaningful: all must be in the range 0...1 and those competing in any selection must sum to 1. (I have not generally specified "notional" or "effective" unless not clear from context.) To obtain effective probabilities, add (different) thermal noise to each, treat negative values as zero, and then divide all by the sum of such adjusted values. (Negative values represent dominated operations that are so bad compared to alternatives that they are off the scale!) So a negative notional probability need only have a very rough estimate – just sufficiently accurate to avoid thermal noise giving it significant chance of having positive effective probability. (If an operation is not worth playing, we do not care by how much! Dimensions with sufficiently negative notional probability would be eventually eliminated from the problem-space – they must be reviewed while thermal noise would reasonably often be big enough to make their effective probabilities positive, but when it does not they do not affect other u/p curves, which would help overall convergence rate. Thermal noise could also often make a small positive notional probability have zero effective probability, but even high noise only has a 2^-n chance of all being in the right direction to possibly zero them all.) A mixed strategy is implied only if more than one effective probability is positive. Even a positive notional probability need not be estimated very accurately to be worthwhile – any improvement on always assuming an equal probability mixed strategy, let alone fixed strategy, is a gain. However, any further improvement may give an edge in an arms race of ability.
Best to consider different dimensions each step, rather than optimizing one before a change. (These is no point in fully optimizing A when a later optimization of B may change the shape of A's u/p curve!) Also best not to have a fixed sequence of dimensions, since any chains of influence in the reverse direction would be pathologically slow – one link each cycle through all the dimensions, compared to one link each step in the forward direction. This allows causal "leaps" to occur, which should reduce the mean length – analogous to apparently being able to deliver a letter to anyone on earth, with unknown address, in about 6 steps, if each forwarder make his best attempt to reduce the (logical) distance. Minimizing causal chain lengths should maximize convergence rate. All this can be achieved by random choice of (modelled) power at each step. (All levels of model – see below – are probably equally important: I only need the top level for making my actual play, but the accuracy of each level depends on accurate lower levels as firm foundations.)
Adjudication and evaluation would be done incrementally. Consider only orders or units that have changed this step, calculating the effects of removing the old then inserting the new. Most effects should be reasonably local.
To avoid getting stuck in a local optimum, "thermal" noise would be added by adding a random element to any notional probability when producing effective probability for sampling. (A Gaussian distribution would probably best, having bias to small deviations but some chance of arbitrarily large ones.) The "temperature" (standard deviation) would gradually be reduced ("simulated annealing") to enable the exact minimum to be approached in a good candidate local optimum. The slower the rate of "cooling" (the more time we have available) the better the candidate will likely be; that is, the closer to the global optimum. (Best to avoid prematurely "jumping to conclusions"!) (This assumes that peaks and troughs themselves lie on simple smooth curves, in a fractal hierarchy – like a fractal landscape or having small waves superimposed on larger ones. When temperature is large, only the larger-scale structures are visible and we explore widely; when lower, smaller-scale structures can be resolved, and we stay within a small region, which can be explored in depth.) Temperature would be increased again, to allow fresh annealing, in dimensions possibly affected by new press, significant discoveries in "game-space" or lack of any visible optimum. (By "discoveries in game-space" I mean things like discovering that only the occupation of a certain province would provide a reasonable set of operations next turn, such that our heuristics could not directly predict it this turn. In which case, the heuristic weight for that occupancy this turn should be increased, and temperature of logically "nearby" (strongly causally connected) operations increased for local re-annealing. Appropriately, dimensions with the highest temperatures (having most uncertainty in estimated utility and hence probability solution) and reasonably high notional probability will tend to be sampled more frequently (50% chance of a big bias in the right direction), thereby tending to correct gross errors before continuing with ongoing (maybe never-ending!) optimization.)
In general, I would expect the model only to stabilize into an orbit of finite size, possibly chaotic, wending its way through many dimensions with each cycle. But I believe (hope!) this orbit should approach a point (a definite and accurate solution) as temperature (mean adjustment size) approaches zero. Provided the orbit is small enough, any snapshot should be a reasonably accurate estimate of the asymptotic solution; for example, if we have run out of time and must issue our orders. A moving average should be even better.
The rate of convergence would also probably become progressively slower, eventually extremely so, as the optima would probably generally become less sharply defined, so giving less "feedback". For example, in a binary guessing game, at the solution, the gradients in each dimension is constant, which provides no feedback! This would lead to much random walking – harmless but unproductive. The solution for more complex games or sub-games (as in the double RPS game that we have discussed) probably do not have a flat u/p for all valid effective probabilities; it would be sufficient to constrain random walk to within the flat range. When mean improvement in accuracy per step becomes too low we should devote our resources to other types of analysis; for example, strategic look ahead without detailed operations and adjudication. The GT procedure may need continuing when temperature-increasing events occur (see above).
The optimum rates for updates, such as moving averages or cooling rate, could change during the convergence procedure, or could be different for different dimensions or at different stages of the game. But it is not obvious to me, from theory, what they should be. I assume they are controlled by intrinsically ad hoc parameters of DipSpace that would have to be learned empirically, over many games, and during a game. Hopefully not critical: they should only affect speed of convergence (efficiency of the method) by a constant factor, and not the asymptotic solution.
Strictly, all the above only applies if we assume all knowledge is common to all powers. The main knowledge (the rules, board state, plays to date and much "common sense" about Dip) will be common to all. However, when press is sent, the sender and all recipients will share additional common knowledge. (Fortunately, unless I use FRM "hearsay" press, I only need to consider "direct" press that I have sent or received – there is probably no point in speculating about other specific press, and such speculations would not be common knowledge anyway. And some press might be too trivial or irrelevant to worry about here.)
In general, any step in the procedure should use data specific to the model "mind" of the power concerned, which should include a current sample of a full order-set for all powers, each operation that includes any of his orders, his current estimates of coefficients of their du/dp lines (and hence optimum probability), intensity boosts, and the heuristic weights that I expect he would use (including any observed bias from what I would use). Such a modelled mind could even be a model of my mind in that of another modelled power, generally different from my model of my mind! But, as stated, I am unlikely to need to represent more than one level of mind below mine. (A bottom level model always assumes its models are identical to itself – as happens in the single single-level model if no press (of significance) is used and no significant unexpected bias is detected (or looked for!) – so its pointers to its models just point to itself, thereby stopping an infinite regress.) Nevertheless, an extra model ought to be simulated for every distinct set of powers that I converse with (in any significant way) – maybe typically 6 extra. Not only is there then less CPU power to converge each model, but because lower-level models affect higher ones, each higher one will tend to converge more slowly (fortunately only one typically). However, each model need not be totally distinct: the same set of heuristic weights can be pointed to where there is no obviously different bias; the same order-subset data can be pointed to by all powers that have an order in it. More importantly, the same data can be pointed to unless I have seen appropriate press relating to it (as happens when my models for ENG and FRA have the same beliefs about a given operation, unless I have discussed something that would affect a component order with just one of them). Any reduction in distinct dimension data provides more updates (since it would be updated when simulating any model that shares it), and convergence should be faster with fewer degrees of freedom.
I do not claim to have proven that the above would work (at all, tractably, efficiently or accurately), but it all seems plausible to me. In any case, I do not currently have any plausible alternative ideas! It is somewhat complex and may well not get close to the theoretical optimum in practice, but that probably just reflects intrinsic complexities of Diplomacy. Still worth our best attempt with available resources, rather than ignoring the potential of mixed strategies or searching for unexpectedly good, but difficult-to-find, combinations. Some of it has to be done anyway; for example, generating compatible operations, elimination of bad (maybe dominated) ones, after adjudication and heuristic evaluation. In any case, perfection is not essential, since even if perfect in itself, its effective accuracy would be limited by the accuracy of its heuristics! And even if the method fails in some circumstances (say, where my assumptions do not apply), it might still prove to be a worthwhile heuristic in itself.
---------------
As there are some non-obvious subtleties, I define "incompatible operations" (an operation comprising one or more order, for any mixture of powers), such that they would never be allowed to be in the same full order-set for all powers. Basically, to avoid wasting time (only), the full order-set should be such that it could all plausibly be issued by non-idiot players. (Otherwise we would have to wait for operations to be eliminated due to their poor average resultant utility. Generally, best to use algorithmic, rather than empirical methods when possible) So all orders must at least all be legal!
First, operations can only be incompatible (for the logic here) if "owned" by the same power (even though they may sometimes also contain orders for other powers). However, all orders within the operations owned by a given power must be compatible, whatever power owns the units being ordered. (There is no point in having an operation that contains no orders for its owner. Another power may also own an operation comprising the same order-subset, but it would contain his own independent probability, and so on, for it, and would be selected independently.)
Second, operations owned by a given power are obviously incompatible if they contain more than one order for the same unit.
Third, less obvious, is that operations owned by a given power, containing multiple orders for movement into the same destination (by any power), would always be considered to be incompatible, unless contained in the same operation, in which case there must be equal support for each move (within that operation). That is, "accidental" standoffs by separate operations are considered to be incompatible (choose one or another or none); only "deliberate" standoffs are allowed within operations owned by a given power (whether or not all orders involved are for the same power), provided each required order to cause the standoff is specified within a single, balanced, operation. (Like supports and convoys, alternative plausible combinations that try specify the same movement order require distinct alternative operations, and multiple combinations to or from the same place are incompatible.)
Forth, heuristic checks could be used to eliminate further combinations that, a priori, would highly likely to be bad if played together. For example, for strategic reasons (without regard to specific combinations that can be adjudicated), it might be considered probably unacceptably poor to move too many units into, or out of, a certain region; or to build too disparate a ratio of armies to fleets. Although each compatibility test would slow the operation-selection phase, a reasonably accurate heuristic test here would almost certainly be faster overall (at least increasing time at a lower rate with problem size) than allowing the full method to determine this empirically. (But best not to eliminate possibilities unless sure enough, or we may cause significant blind spots!) For better speed, pre-calculated data, such as flags or codes (indicating relevant "kind") or counts, and so on, could be used.
I would not consider lack of some specific order (or member of some set) to be a reason to reject the full order-set. Nor would I now allow an operation to specify the absence of any order(s) (even though I did suggest this long ago). To represent inhibitions, such as NOT(XDO(x)), or less specific DMZ(y), say, (whether or not appearing in press) _absolutely_, just totally eliminate operations that contain inhibited orders from those operations owned by the power concerned. (Where a quick static check is possible, it would be wasteful ever to add totally implausible operations in the list possibilities for selection.) To represent operations that should only tend to be inhibited, they just need low probability – to which they should converge if the conflict tends to produce low utility (for instance, by loss of "trust" due to making an apparent threat, or breach of an explicit agreement).
The reason for including and checking compatibility of orders for multiple powers in a given operation (even though, if selected, only the owner's orders would be added to the full order-set) is a bit subtle. Basically, a (model of a) power tries to find the optimum combination of all orders, for all powers, that he would like to achieve. But there is no point hoping that any power might cooperate (with or without press) by issuing any orders that he would consider to be mutually incompatible (as defined above), or similarly incompatible with those of other powers. An idiot (such as a pure RandBot that is not a ConsBot) might do so, but then we cannot expect any likely cooperation!
In general, if combinations of order that are plausibly only reasonable when issued together are only contained in separate operations, their separate, possibly necessarily high (but < 100%) probabilities could also give significant probability for issuing just one. If playing just one is not even plausible, only the combined operation should be included to avoid having to wait for separate probabilities to converge to zero due to their systemically tending to produce lower utility than alternatives.
A further example of a plausibly useful multi-order operation would be A->B and C->D, such that the 2 units would try to move in parallel formation, say – or not at all – to try to maintain an unbroken line.
[See DipAi post #8125 from Abram Demski.]
The link http://arxiv.org/abs/cond-mat/0402508 found by Abram Demski is exactly what I have been looking for! I do recommend those interested in GT to at least skim the paper, but you could start with http://www.mcs.utulsa.edu/~stephane/presentation/colloquium04-coin.pdf, which has a much simpler potted, bullet-pointed guide to Product Distribution (PD).
For me, it is one of the best written and clearest paper on a deep subject that I have seen for a long time – if ever. Minimal jargon (none pretentious) that is not at least briefly explained in terms that a well informed person, but outside the specific domain, could understand. The math, especially towards the end, was beyond me to follow in detail, but it can be ignored on a first pass, and I think I could convert to code equivalent, if need be, with a bit of a refresher/look up on Web. In any case, as Einstein said, explanations should be "as simple as possible – but no simpler!" – so if that is how the correct math is, so be it! It certainly sounds plausible to me, and indeed, amazingly, seems to be a more precise version of much of what I had thought and espoused here. I certainly could not attempt to refute what the author (Wolpert) says (apart from a few quibbles) – rather, until someone can show good reason to believe otherwise, I shall assume his proofs are valid. Furthermore, I shall not attempt to compete with it in my cogitations – only build on it.
With hindsight, I guess the "surface" in which we have talked about finding the optimum is the Lagrangian of the game-state. (I never really got the hang of Lagrangians – or Hamiltonians – when a Physics undergraduate, though did recently when working my way through Roger Penrose's tome: The Road to Reality! Knowing it is a Lagrangian gives a pointer to how to accurately interpret the math, but for many purposes we can ignore that, and just use intuition to get as feel for the problem, as before. To follow (his, and physics free energy) convention then, I will reverse sign and go back to defining the problem as minimizing, rather than maximizing – compare a ball rolling to the lowest point on the floor, rather than a balloon rising to highest point on the ceiling. The Lagrangian then represents cost, rather than utility, where cost = -utility. Each player is trying to minimize his own cost (= maximizing his own utility). Cost can also be considered to be "opportunity cost" for missing chance to increase utility. So "cost" is a good word and concept, provided we remember we mean "negative-utility" – as defined in what is normally called "Utility" (not "Cost") theory.
Furthermore, as the paper says, the minimum (cost) Lagrangian is also at maximum entropy (which ties in with what happens in physical systems too – entropy of a closed system never decreases, but evolves with time towards maximum entropy; that is, maximum disorder – all possible states having equal probability – the Boltzmann distribution). (It is known as the maxent principle for short.) Wolpert also reminds us that, in the absence of information to the contrary, smooth surface, maximum entropy states should be assumed. It is like, as is readily observed, the fact that soap bubbles tend to remove any initial bumps (as they minimize the "free energy" of the surface). Or that air tends to fill a room uniformly in temperature and pressure, rather than remaining, let alone clumping in one corner, say, or retaining, let alone spontaneously creating, hot spots. Such surfaces are simpler to describe (need less information) than more complex, lumpy ones. It is a form of Occam's Razor.
"Bounded rationality" (rather than "absolute rationality") seems to be correct for real world (non-trivial) cases (including Diplomacy), and seems to be the key reason for real-world difficulties with Nash's paradigm. It is what I was trying to get at with my Subjective versus Objective GT.
Perhaps we could say that the degree of "irrationality" can be equated to "temperature" (and analogues that I have discussed; for example, in simulated annealing temperature is gradually reduced, meaning irrationality would be gradually reduced). Specifically, -T*S(q) should be minimized, where T is temperature, and S(q) is Shannon entropy of state distribution q (Shannon entropy being negative; that is, minus the bits of information, rather than a (positive) measure of physical disorder).
I had not heard of the term Product Distribution (PD) theory, and it a (presumably) important aspect that was not adequately explained in the paper to an outsider like me. (It is an unfortunate choice of phrase, as a natural interpretation – commonly found by Google – means "distributing commercial products to shops" – or even "the distribution of chemical products during a chemical reaction"! "Boltzmann" is a good keyword to add to a search.)
However, although most the math presented by Wolpert can presumably represent the problem space exactly, he mainly just hints at the (well known) difficulties in obtaining accurate data, and obtaining accurate solutions in feasible times. No doubt good standard numerical approximation methods are available in the literature – though he does say there are debates on what is the best approach in various areas. And I would expect that it would be difficult to formulate the exact Lagrangian for a messy problem-space like DipSpace (far more complex than, say, the physics of an ideal gas in which each particle can be represented by just 6 real variables: position and velocity in 3 dimensions. (The vast number of gas particles does not complicate the formulation – it "only" complicates finding specific solutions, requiring the approximations of statistic mechanics.)
So I expect that most of what I and others have discussed in a hand-waving way will still be of value – as practical approximations of the accurate theory, which some of us (I anyway) only understood intuitively. But I do like to have an accurate theory as a conceptual longstop, to ground speculative methods in reality – even if the accurate methods are never worth using or coding exactly. And note that Boltzmann entropy is only an approximation to Gibbs entropy, where it is assumes that the distribution of each particle is independent of all the others (good for gases, due to wide separations, but much less so in liquids and solids). NB: There are many (related) types of entropy – a prefix ought always to be included if not implicit from context!
In II/F (p.5) Wolpert discusses a case where a player does not care only about minimizing his cost; that is, maximizing his utility, and then goes on to state that this violates the axioms of utility theory. Quite! I am not prepared to consider such a violation. To me (and conventional technical interpretation of utility), trying to maximize (what he believes to be) his utility effectively just is (defines) what a given free agent is trying to do – even if he does not know what is good for him and tends to be self destructive! Wolpert suggests that an alternative goal might be to reduce risk, but I would just wrap that up with his utility. For example, for some players, lower risk may lower their (actual or perceived) expected cost, and hence raise expected utility, as when a player in an apparently weak position should tend to prefer more risk, as he will probably lose for sure otherwise; a player in an apparently strong position should tend to reduce risk as he will probably win if he just plays steady and sure. (It is good that maxent Lagrangians allow prior knowledge to be defined in many ways – probably any conceivable way – but it is surely surreal and pointless, at least, to define it in ways that, by definition, could never occur, even in principle!)
In II/I (p.6) and elsewhere Wolpert states, in effect, that, whereas Nash Equilibria (NEs) can occur at the boundary of game-space, representing pure strategies, bounded rationality solutions can only occur in the interior; that is, with at least some infinitesimal probability of any pure strategy solution. (Such subtleties can be critical in many strict proofs, even if not in practical problems with noise or fuzziness.) In effect this is a trivial statement: any real player has some finite chance of doing anything, including making an arbitrarily bad mistake (if only due to a cosmic ray hitting a critical neuron or transistor)! But surely, in a discrete valued, deterministic model, where all powers have limited computation power (hence bounded rationality), all can find pure strategy solutions (on the boundary) in trivial cases.
On reflection, I think my previously proposed method for finding Nash Equilibria (NEs), at least unstable ones, is somewhat naive. The math may be valid, but its assumptions are probably not. It relies on each dimension of the (inverted Lagrangian) surface approaching a quadratic (parabola) when sufficiently close to an optimum (the global optimum if approached slowly enough by simulated annealing). There are at least three problems with this assumption.
First, when there are multiple local optima (as is likely to be common, especially with something as complex as DipSpace), it seems that the method would, surely (if unstable NEs can indeed be found), be as likely to converge on a local minimum as a local maximum.
Second, more important, noise is likely to disrupt any underlying parabola, thereby not only disrupting consistent "forces" towards the optimum, but could be extremely unstable. For example, if the three points that I proposed to measure to define the parabola happen to become collinear (or nearly so), due to noise, the optimum would go to infinity (or nearly so). If this ever happened (and it would probably recur), the mean of many measurements of optimum would become infinite (or unduly large). (There are not three specific points that can be averaged instead. There is likely to be significant noise because it is somewhat hit or miss whether or not a specific order succeeds, so associated gains or losses tend to be all or nothing.)
Third, contrary to what I have previously thought, a fixed strategy (probability 0 or 1) is not best thought of as just a special case of the general mixed one. It would only be a special case of a mixed strategy if, mathematically, it were infinitesimally different from 0 or 1. In general, a fixed strategy occurs when a strategy is dominated by at least one competitor strategy (so should have probability 0), or dominates all other competitors (so should have probability 1). In general, the dominance tends to be sufficient to maintain the extreme probabilities even in the presence of some noise. Recognising this distinction avoids consideration of perverse notional probabilities, which may be outside the range 0...1, which tend otherwise to occur in the most natural formulae, until ad hoc trimming is applied.
I now think a heuristic approach must be used, using empirical measures of how best to move within DipSpace. An analogy is balancing – our body when walking, an inverted broom on our hand, or an unstable military aircraft. The brain's cerebellum in the first two cases, or aircraft's computer in the last, store a large number of ad hoc rules-of-thumb or parameters for how to react to any given situation (at least after sufficient practice or tuning), provided the situation is within feasible limits (for instance, the flight envelope). The rules or parameters would represent a non-linear function that cannot be deduced from theory, but should vary reasonably smoothly if enough are used, so interpolation can be used for intermediate situations. (As usual, extrapolation is more problematic, so should be avoided if possible.) Each rule specifies the best guess, from previous experience, of what muscle or control setting would be most likely produce the most appropriate acceleration towards the required position, or to a known "waypoint" to such a position. Sometimes, especially during practice, while there are insufficient or inadequately tuned rules or parameters, balance is lost and the system falls over – in which case it is restarted from a reasonably stable position.
In Diplomacy terms, this means the looking up (and interpolating) the best guess as to what adjustment to probabilities (or "intensity" equivalent, described previously) is appropriate, given the various observed parameters of the game-state (including bias of opponents, say), and given what has worked best before in similar game-states. In principle, the final solution could be looked up (as I had proposed in DipAi even earlier), but, on reflection, I concluded that this would probably take far too many rules to be effective. (This would happen because it could not incorporate any theoretical model, so would be purely heuristic and empirical). Instead, the look up would merely be for an incremental change (analogous to a force), which, together with later incremental changes, would approach the desired solution. If a given rule or parameter were as not quite right, or even disruptive, its error would tend to be corrected by the rule or parameter applicable to where (in DipSpace) it arrives. But even after much learning it would probably often fall over (from a mixed to a fixed strategy) when unstable in a Monte Carlo (MC) simulation. The goal would be to tune response as accurately as possible, in as much of DipSpace as possible, to maximize the time between falling over. The MC would then be restarted from the best solution so far (approaching zero "energy" gradient, or "force", as a local optimum were approached). The best solution would tend to be where mean acceleration (or force) were minimum. Moving mean values should be accumulated. (Not all-time means, as opponents would be adapting too, thereby moving the optimum location for a given player in DipSpace.)
So, as previously proposed, the method should always try to reposition as close to a local (hopefully near the global) optimum as possible during a MC. But we should be resigned to the corrections being heuristic, determined empirically. Furthermore, radically different rules may be best in different regions of DipSpace. There would probably be as much ad hoc skill needed here as in other heuristic areas, such as static position evaluation. Nevertheless, firm theoretical foundations are still important, even though on the one hand a perfect algorithm may remain unknown, and on the other hand it may only be practicable to use heuristic approximations to the best theoretical ideal that is known.
When the required solution is stable it should be easy to find anyway, but the above method should still help approach it most rapidly (ideally analogously to critical damping, but once again probably too complex for an analytic method (or approximation). It still seems that we need not choose between methods for stable and unstable solutions. Also, it seems likely that the solution could often be stable in some dimensions, but not others (maybe typically 50% chance of each dimension).
Even humans have mental blocks and make mistakes! But even a low accuracy heuristic could be a great improvement on always assuming 100%, or even 50:50; small regular gains can accumulate exponentially over time, like compound interest, .
I have not yet fully reformulated my approach – very much back to the drawing board! Whatever I decide, I shall test my final method in an idealised model game before trying it in the complexities of Diplomacy.