Diplomacy
|
Home > Diplomacy > Tournaments > Tournament3
Tournaments for AI in the Game of Diplomacy: Tournament #3
Tournament #3 finished on 18 November 2005. The bots and Server settings were used as in Tournament #2. A tournament of 2000 (compare with 1000 previously) was once again directed by the (ad hoc) DeepLoamSea Tournament Director Tools (DTDT), but using a different method of selecting bots for each game (Slow Knockout). For convenience, moving averages of additional values (Strength and Tenacity) were recorded for each bot playing in each game. See Bots for details for the players.
The Strength of a player is defined here to be is a measure of its success at scoring well, with minimal sharing. This is taken to be its ability to be a leader; that is, have at least as many supply centres as any of the other players when a game is terminated. As previously, games were terminated when there was an outright solo win, or the numbers of supply centres owned by each player had not changed for y years (where y = 1 in this case). The Strength of a player in a given game is 0 if it is not a leader, otherwise it is number_of_players / number_of_leaders. For p players (7 in the STANDARD variant used), there are therefore p units of Strength to be shared between the leaders; each gets 1 unit if they all finish equal.
The Tenacity of a player is defined here to be a measure of its success at surviving games, with minimal sharing; that is, having at least one supply centre when the game is terminated. Analogously to Strength, the Tenacity of a player in a given game is 0 if it is not a survivor, otherwise it is number_of_players / number_of_survivors. For p players, there are therefore p units of Tenacity to be shared between the survivors; each gets 1 unit if they all finish equal.
Moving averages for the Strength and Tenacity of each playing bot were calculated by:
new_average = alpha * new_value + (1 – alpha) * old_average
where alpha is the relative weight given to the new_value (the Strength or Tenacity, while playing in this game). Here, alpha was set to 1% in both cases. So all players of a given game first lost 1% of their own values, then an absolute 7% was shared between the leaders or survivors, as appropriate. If all were to draw every time, all their Strengths and Tenacities would approach 1. NB: It is a zero-sum game with respect to gains as these are absolute, but not for losses, as these depend on the bot's current value; if they lose, strong players lose more than weak ones in a given game. Averages of Strength and Tenacity were initialised to 1, which is the value all bots would approach (from any starting value) if they were equally good.
The above formula was applied for each instance of a given bot in a given game; the order of evaluation is unimportant.
When choosing a bot for a game, the probability of a given bot being chosen was proportional to the latest moving average value of its Strength. As moving average Strength values begin to diverge, as their relative merits become apparent, so the weaker ones tend to be chosen less often that the stronger ones. I call this method Slow Knockout; the rate of knockout being proportional to alpha.
The idea is to simulate evolution of survival of the fittest (highest Strength in this case). The relative density (probability of playing in this case) of various competing species (bots) depends on their relative fitness. If too unfit, a species may go extinct (here we only approach extinction), but a less fit species may be able to coexist, at low density, with fitter ones. Or, one species may decline, only to rebound when another competitor has been sufficiently suppressed by a third one.
Makeweights, such as random bots, and less obvious cases, should be automatically suppressed, thereby giving more time for evaluating the more intelligent bots and giving these more serious competition. But we allow for the possibility that any bot can re-emerge from obscurity if it can thrive in a newly formed community. It is possible that all but one bot would be get ever smaller probabilities of competing, in which case we approach an solo win. However, a so-called "evolutionary stable state" may emerge, comprising more than one bot with fixed relative probabilities, generally with periodic and/or chaotic oscillations superimposed – hopefully discernable above the random noise.
See graphs in Tournament #3: Strength by Game and Tournament #3: Tenacity by Game. For an easy overall comparison, see the four graphs from both Tournaments #3 and #4 – click on any graph there for an expanded view. In the key, the same symbol assignments are used; the word Man'chi and version numbers have been omitted; RandomBot means Man'chi RandBot. Symbols for all bots are visible at least in parts of all the above graphs if viewed at full resolution – the less obvious ones are near the bottom. Note that the graph range for Strength is over twice that for Tenacity. The style is intended to show the general trends and variability, rather than precise values.
The following tables show various statistics more precisely: Tables 1a-d for Strength and 2a-d for Tenacity; (a) Last, (b) Average, (c) Minimum, (d) Maximum. (All the values before the first game were exactly 1, but this is excluded from the statistics.)
|
The Results were comparable to the Results of Tournament #2 (comparing Strength with winning and Tenacity with surviving) and likewise the general Conclusions of Tournament #2 are also applicable here and are not reiterated. There was a little reordering where values were close.
As might be expected, Strength produced wider separation between bots than Tenacity. From the graphs there was much apparently random movement, but trends can be seen. Bear in mind the differences in vertical scale.
Considering Strength graphs: all values tended to decay during most of the Tournament. The values for poor bots decay because they are poor, very smoothly in the poorest cases. The better ones initially gain at the expense of the poorer ones, then decay as, on average, they tend to face ever stiffer competition. It is interesting that near the end some of the poorer ones seem to systematically rise for a while; it may be random walk, but it is plausible that they have found a temporary niche combination of other bots in which they can moderately thrive. A more sophisticated analysis and probably many more games would be needed to determine if the latter mechanism really has a significant effect. In any case, even if the effect is minimal with the current set of bots, it might well be important for some future set.
Considering Tenacity graphs: two bots (RandBot and Man'chi RandBot) separated very distinctly from the others very early on, and much more than their Strengths ever did. (NB: The latter is labelled as RandomBot, and is more like ConsBot, in that it avoids impossible moves, such as supporting a move that it does not order. Although the Server rejects such moves, the effect on David RandBot might not be identical.) Also, these two bots seem to reach a stable position (about 0.4), well above zero Tenacity (whereas they gradually approached zero Strength). So here they seem able to coexist fine with the other bots; at reduced probability of playing, but still about a third of the best.
Another difference that is apparent from the graphs is how Tenacity seemed to stabilise much more rapidly than Strength, especially for the less random bots, which rapidly occupied a narrow band (especially considering the expanded vertical scale used for Tenacity). However, the rate of stabilisation may be illusory; for instance, DumbBot is still on a downward trend, so it it may not be appropriate to reduce the number of games in a tournament based on Tenacity. There was some tendency for bots to maintain their ranking level, but there was, nevertheless, much crossover – more so for Tenacity than Strength, probably because they were less separated. So the final ranking would have a bigger random element when using Tenacity, than when using Strength. But in both cases, using an average would be advisable to minimise randomness. However, it would probably be more meaningful to eliminate the earlier games from the average, to find ranking in the long-term, steady state situation.
Overall, the results and differences between Strength and Tenacity are consistent with casual observation: random (or even very weak) bots very rarely win (but I have seen a random bot win in some combinations), yet random bots are often not eliminated before a player wins outright (or play becomes too repetitive to be worth continuing).
While a bot is not a leader or a survivor, the moving averages of Strength or Tenacity, respectively, decay exponentially with respect to plays; they are frozen when not playing, but move even faster when playing several powers in a given game. Since the probability and hence rate of being assigned a power depends on the current moving average of Strength, when losing, the exponential decay rate of both values with respect to tournament games must decay exponentially with plays. (And so rate of decay of decay, and so on, at any level, similarly decays.) So the decay rate with respect to games should be slower than exponential, and becomes progressively slower still. (The effect is small though; it is not obvious from the graphs.) This means that the method is less effective at suppressing weak bots than might be assumed. On the other hand, by avoiding over-zealous suppression, any bot that initially performed poorly can more rapidly rebound if suitable conditions (distribution of frequently chosen bots) occur later. (We need some "shrews" to survive for when the "dinosaurs" may die off.) In particular, a dramatic reorganisation may be appropriate if a new bot is added, or an existing (advanced) bot learns a new trick. Nevertheless, if a set of bots plays for a very large number of games before a new bot is added (or one has a "eureka" moment), it might be advisable not to allow moving averages (and hence probabilities) to fall below moderately small values; otherwise a comparable number of additional games may be needed to stabilise on values that are applicable to the new situation. However, this may not be a significant problem if the set stabilises with many bots having moving averages that are not too far below to the highest.
If only a few bots come to dominate, they will increasingly tend to play multiple copies in a given game. Each is then increasingly competing against itself to a degree; they may or may not work effectively with their clones, but if the effect is neutral then the score for each game will approach 1, in any case becoming exactly to 1 when only one type of bot is playing. The moving average of the leader would then approach 1 and that for the other bots would become progressively frozen. To avoid such pointless games, it would normally be useful to inhibit more than a certain number of clones of a given bot. An exception could be when the tournament is intended to facilitate the champion bot tuning itself prior to playing unknown players (bot or human).
It might be more meaningful to use logs of Strength and Tenacity, rather than absolute values. This would better show up differences between poor bots (though they are normally of little interest). It would also give stability to values for poor bots; as it is, however close they approach to zero, a spurious win or draw raises their Strength or Tenacity to between 1 and 7% (absolute). But it also might prove better, on practical and/or theoretical grounds, for each game to be completely zero-sum in some sense, which is the case for logs as the decay rate, alpha, approaches zero. To make logs exactly zero-sum (although it would have made little difference here, and noise would have swamped the effect), the sum of the log values should be constant. So strictly, absolute values for all p players in a game should be divided by alog(beta) and those for the w leaders should be multiplied by alog(p*beta/w); rather than subtracting alpha (which is equivalent) but adding p*alpha/w (not exactly equivalent if w < p), respectively, as was effectively done here, where alog(beta) = 1/(1-alpha), for small alpha.
Using the above log variation, where, as here, the moving average of Strength was used to control choice of bots for each power, this value for the strongest bot would tend to rise until there was a negligible chance of any other bot playing in a game. As such an exclusive state approached, all moving averages would change progressively more slowly; they would not move at all when there was only one bot playing all powers. To avoid running progressively more such pointless games, poorer bots would have to be included by some mechanism. Maybe the current method is, after all, the best way to do this.
The score could be made zero-sum without using logs, provided the absolute adjustment was independent of the players' current values. This may be conceptually simpler, but, depending on the adjustment rate, low values could jump even more wildly (even becoming negative), rather than exponentially decaying, and/or large values could change unreasonably slowly. In any case, presenting results would probably be more useful if logarithmic, even though there would then be less apparent difference between the leaders.
The value of alpha used is somewhat arbitrary. The value of alpha used (1%) reduces values for all players in a game by 1% of their current values, followed by 1 to 7% absolute gains for all leaders or survivors, which may have been a reasonable compromise. Smaller values would reduce random effects in proportion, which would be useful to help identify real trends for given bot as the distribution of bots changed. On the other hand, for a given number of games, the distribution of bots would then change correspondingly less, so would be less likely to reach a long-term equilibrium. Perhaps it would be more efficient to use large alpha initially to rapidly reach approximate equilibrium, then gradually reduce it to reduce random effects – a form of simulated annealing.
There does appear to be an evolutionary "stable" state that includes most of the bots, albeit with much noise or chaotic oscillation; no obvious periodic oscillation.
On balance, this Slow Knockout method, or some variation, seems useful; probably better than trying to give each bot an equal number of plays (as done in Tournaments #1 and #2). It is also probably better than a (conventional) Total Knockout method, especially for a never-ending tournament for developing and tuning a bot or selecting its better variants (which was my original impetus for developing DTDT for running tournaments).