Diplomacy
|
Home > Diplomacy > Tournaments > Tournament7
Tournaments for AI in the Game of Diplomacy: Tournament #7
Tournament #7 finished on 3 March 2006. The bots and Server settings were as in Tournament #6 except KissMyBot v2.0 was added. Once again, a slow-knockout of 2000 games was used, but this time with probability of a bot playing was made proportional to the moving average of its Fitness (rather than Strength). Fitness is a measure of a bot's ability to achieve high scores in the (tentative) DAIDE Standard Scoring Scheme; it is a moving average that is proportional to recent score+1 values, decaying to 90% after each losing game played. See Bots for details for the players.
In the scoring scheme used, a solo winner gets M-1 points; each member of a draw gets M/N-1 points; other players get -1 point; where M is the number of powers and N is the number of players in a draw. In other words, each player pays one point into a pot; winners (solo or draw members) share the pot equally. If there is no solo and no draw is agreed by the surviving players, then the game is terminated after the score of supply centres has remained unchanged for kill years; all survivors are then considered to be in a draw. If a game is terminated for any other reason, it is void and each contribution to the pot is returned. It is a zero-sum game.
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
In the above tables, Plays is the total number of plays by the given bot, where each instance of a given bot in a game counts as a play; it shows the effect of slow-knockout; Solo %, Leader % and Survivor % are the percentage of plays by the given bot in which, at the end of the game, it owned more than half the supply centres, owned at least as many supply centres as any other power, or owned at least one supply centre, respectively. Each table is in descending order of the numeric field.
Fitness Rank | |
---|---|
Bot | Final Fitness |
KissMyBot2 | 1.506 |
Project20M | 1.200 |
KissMyBot1 | 1.075 |
AngryBot | 0.990 |
HaAI | 0.482 |
DumbBot | 0.393 |
DiploBot | 0.267 |
AttackBot | 0.181 |
ChargeBot | 0.097 |
HoldBot | 0.076 |
DefenceBot | 0.065 |
RevengeBot | 0.064 |
RandomBot | 0.062 |
ParanoidBot | 0.059 |
RandBot | 0.055 |
The above table shows the final Fitness values of each bot, in descending order of Fitness.
Score Summary for All Games | |||
---|---|---|---|
Bot | Plays | Mean Score | SD of Mean |
KissMyBot2 | 2660 | 0.654 | 0.058 |
Project20M | 2396 | 0.458 | 0.058 |
KissMyBot1 | 2100 | 0.281 | 0.059 |
AngryBot | 1711 | 0.087 | 0.061 |
HaAI | 892 | -0.384 | 0.066 |
DiploBot | 840 | -0.453 | 0.065 |
DumbBot | 726 | -0.543 | 0.064 |
AttackBot | 477 | -0.758 | 0.056 |
ChargeBot | 353 | -0.881 | 0.048 |
RevengeBot | 342 | -0.931 | 0.036 |
DefenceBot | 317 | -0.943 | 0.032 |
ParanoidBot | 312 | -0.972 | 0.010 |
HoldBot | 309 | -0.972 | 0.011 |
RandomBot | 277 | -1.000 | 0.000 |
RandBot | 288 | -1.000 | 0.000 |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
The above tables summarise, for each bot, the number of plays, the mean (tentative DAIDE-Standard) score (for those plays) and standard deviation of the mean. The first table is for all 2000 games; the following ones are for successive 500-game quarters of the tournament. All are in descending order of mean score.
The mean number of years per game was 27.1, but the distribution was very skew. Of the 2000 games, a total of 54264 years were played. Only 18 games (0.9%) were draws (all due to termination by the server according to its kill value); 47 games (2.35%) lasted over 100 years, representing 16.4% of the years total played. The longest game lasted 1528 years, representing 2.8% of the total years played. The table below shows the final counts in this marathon game.
Bot | Power | Centres |
---|---|---|
'Manchi AngryBot | FRA | 0 |
DumbBot | RUS | 0 |
KissMyBot2 | TUR | 18 |
Project20M | ITA | 2 |
Project20M | GER | 0 |
Project20M | ENG | 14 |
Project20M | AUS | 0 |
The same raw results were logged as in Tournament #6. At least with the new (tentative DAIDE-Standard) scoring scheme and high kill value (100), KissMyBot v2.0 can reasonably certainly be declared as the champion, ahead of Projecdt20M in second place, KissMyBot v1.0 in third place and Man'Chi AngryBot in fourth place; all others having negative mean scores. But note, from the standard deviations in the Score Summaries, that the abilities of adjacently ranked bots are not very clearly separated. Apart from randomness, arguably reasonable variations of tournament arrangements could make a difference to ranking, as shown by KissMyBot v1.0 appearing, on balance, to be champion in Tournament #5, and Project20M having highest Survivor % and highest score in the first quarter of the current Tournament.
It would be more meaningful always to include some measure of error or reliability of any tournament results (between bots and/or humans, in any type of game or sport). Standard deviations are convenient, but note that, as defined, the scores do not (and cannot) have a normal (Gaussian) distribution: individual scores are quantized and have a highly skew distribution, within a limited range. With the current bots, draws were rare, so almost all scores were -1 or 6.
If all bots played randomly and there were no draws, then, in the STANDARD variant, there would be an average of 6 losses (-1 point each) for every solo (6 points), giving an RMS deviation of sqrt(6). The mean would be zero. For 2000 games, the standard deviation of the mean would be about sqrt(6/2000) = 0.055. This is close to the value obtained experimentally for the better bots. Most of their games were against their clones or near equals, so scoring was near random.
The very weak bots had small standard deviations because their scores were normally -1; this was epitomised by the random bots that always lost, and so had zero deviation. (Standard deviation would also have approached or reached zero if a bot had approached or reached 100% solos.)
If a bot won and lost 50% of the time, with no draws, its standard deviation would be maximum, as there would be maximum uncertainty in an individual result. All the deviations from the mean, and hence the RMS deviation, would then be 3.5. For 2000 games, the standard deviation of the mean should be about 3.5/sqrt(2000) = 0.078. The more mediocre bots had the highest standard deviations, but much less that this maximum. Compared to the better bots, they would have spent a larger proportion of their games playing against the weaker bots, in which they would have had higher proportions of solos and higher standard deviations. This effect can be seen in the quarter-tournament summaries: standard deviations in comparable circumstances should be twice as large in these as for the whole tournament, but instead, the values tend to be still larger for the better bots in the first quarter, presumably due to better available pickings then, before the frequency of plays by weak bots had decayed away.
Standard deviations would tend to be smaller with a higher proportion of draws, since scores would then be less extreme. In any case, RMS values of individual scores are are constrained to lie between 0 and 3.5, being about 2.45 for all-random plays that lead to solos.
For the current set of bots, 2000 games was probably about the right number for ranking the better bots. For example, the 500 games in the first quarter would have indicated a different champion. But there is still some room for doubt with the full set of games, and the ranking is even less certain for weaker bots. The extreme case is illustrated by the random bots, which never won at all; but even Man'chi ParanoidBot and HoldBot appear to be equally good in the all-game summary, whereas they surely do not have identical in abilities. Even the relative rankings of DumbBot and DiploBot vary from quarter to quarter, which is unsurprising, given their standard deviations. The poorer resolution of weaker bots is probably mainly due to the fact that the weaker bots did not play so many games, due to the slow-knockout method. This is intentional, as it is generally more interesting to resolve the ranking of the leaders; you cannot have it both ways when playing a given number of games. Unfortunately, where the only factor of merit is soloing, there is little information available.
Even with a large kill value (100), once again, the proportion of time spent processing pathologically long games was not excessive. Even so, in due course, it could become impracticable to run so many games in total; only much smaller number would be viable if human players were to be included. For example, if each bot had used one second of computer time for each movement turn, each year in the STANDARD variant would take 14 seconds, ignoring other turns and overheads. So this tournament of 54264 simulated years would have taken 8.8 days in real-time to run. And such thinking times may be unduly short for more advanced bots (compare with the arguably simpler game of chess). So tournaments would then normally have to have fewer games, thereby giving larger standard deviations of the means. If rankings are to be meaningful in such cases, we will have to hope that their abilities are more widely dispersed. (This is probably the case with humans and may become the case when bots use press.)
But the mean scores are limited to the range -1 to 6, so the abilities of only a limited number of bots can be clearly distinguished, for given standard deviations. And as number of games falls, the standard deviations of the means increase, thereby reducing the number of bot abilities that can be resolved.
However, mean score may not be the best way to express relative ability. It ought to be possible to say that one bot is a hundred times as good as another. How many times better is one that scores 0.5 compared to one scoring -0.5? Adding one to all scores would make this meaningful. And a logarithmic scale; that is, log(score+1) or log(Fitness) might be more useful. In particular, perhaps standard deviations would be better measured in such a logarithmic scale; that is, percentage, rather than absolute, deviations of DAIDE score are probably more appropriate, though a normal distribution should have no upper bound either. (This tournament could be reanalysed in such a manner, but more consideration would be appropriate, and it is probably not worthwhile delaying publication regarding the new champion.)
Final Fitness values had a range of 27.4 to1. It too is a measure of relative abilities. Unlike mean score, however, it may not have reached steady state. Values decay exponentially to their proper values; if a bot never wins, its Fitness will exponentially decay to zero, but it will take an infinite number of games to arrive. However, no matter how close to zero, if a bot has just won a game it will have a certain minimum Fitness, so a snapshot of Fitness is a noisy measure; but it could be smoothed by averaging over more games.
The rate of choosing a bot to play is proportional to its Fitness. The number of plays has a smaller range than final Fitness, however, because average Fitness had a smaller range than that its final value. Statistically, the number of plays is proportional to the average of Fitness over all plays. Being averages, they is more stable, so potentially also good measures of relative abilities, but these take even longer to reach steady-state, as values decay harmonically, rather than exponentially.
Evaluation times for new bots or new versions could also be reduced by continuing a previous tournament, but with the new bots now included, initially with high Fitness (to cause high probability of their playing), or even forcing each game to include one or more clones of the new bots with one or more clones of the rated ones. Indeed, maximum information would be gained by having half new and half old, and this would be enable press-bots to show their worth while there are few other press-bots to talk to. In this way, time would not be wasted re-evaluating bots that are already rated.
Note that Diplomacy might really need more than a scalar scoring scheme. Abilities might not be transitive; that is, bot A may tend to beat bot B; B may tend to beat C; C may tend to beat A. Even if not so extreme, no doubt some players play better with some combinations of players than others; for example, HoldBot would never incur the wrath of RevengeBot, as it would never attack it. It is even possible, in principle, that an intrinsically weak bot might find a symbiotic niche in conjunction with some other bot. For example, a bot that is generally strong, but weak against ENG, might support a bot that is generally weak, but good at stalemating a potential ENG solo. Perhaps unlikely, but it would make an interesting discovery, so it may be better not to eliminate any bot totally (like keeping bio-diversity).
To follow the consensus of DAIDE members, and because little difference would have resulted with the current bots, the (tentative) DAIDE-standard scoring scheme and related Fitness values are now my preferred measures and I will probably use these alone for all my future Tournaments, without further comment on their merits or otherwise. However, to better represent the range and separations of abilities, a mapping function may be applied to mean values, perhaps as outlined above.