Diplomacy
Tournaments

Tournament #7

John Newbury       17 July 2012

Home > Diplomacy > Tournaments > Tournament7

Tournaments for AI in the Game of Diplomacy: Tournament #7


Method

Tournament #7 finished on 3 March 2006. The bots and Server settings were as in Tournament #6 except KissMyBot v2.0 was added. Once again, a slow-knockout of 2000 games was used, but this time with probability of a bot playing was made proportional to the moving average of its Fitness (rather than Strength). Fitness is a measure of a bot's ability to achieve high scores in the (tentative) DAIDE Standard Scoring Scheme; it is a moving average that is proportional to recent score+1 values, decaying to 90% after each losing game played. See Bots for details for the players.

In the scoring scheme used, a solo winner gets M-1 points; each member of a draw gets M/N-1 points; other players get -1 point; where M is the number of powers and N is the number of players in a draw. In other words, each player pays one point into a pot; winners (solo or draw members) share the pot equally. If there is no solo and no draw is agreed by the surviving players, then the game is terminated after the score of supply centres has remained unchanged for kill years; all survivors are then considered to be in a draw. If a game is terminated for any other reason, it is void and each contribution to the pot is returned. It is a zero-sum game.

Results

Plays and Percentages

Plays Rank
Bot     Plays    
KissMyBot v2.0 2660
Project20M v 0.1 2396
KissMyBot v1.0 2100
Man'chi AngryBot 7 1711
HaAI 0.64 Vanilla 892
DiploBot v1.1 840
DumbBot 4 726
Man'chi AttackBot 7 477
Man'chi ChargeBot 7 353
Man'chi RevengeBot 7 342
Man'chi DefenceBot 7 317
Man'chi ParanoidBot 7 312
David's HoldBot 2 309
Man'chi RandBot 7 277
RandBot 2 288
Solo Rank
Bot     Solo %    
KissMyBot v2.0 23.61
Project20M v 0.1 20.58
KissMyBot v1.0 18.24
Man'chi AngryBot 7 15.31
HaAI 0.64 Vanilla 8.74
DiploBot v1.1 7.74
DumbBot 4 6.47
Man'chi AttackBot 7 3.14
Man'chi ChargeBot 7 1.70
Man'chi RevengeBot 7 0.88
Man'chi DefenceBot 7 0.63
Man'chi ParanoidBot 7 0.00
David's HoldBot 2 0.00
Man'chi RandBot 7 0.00
RandBot 2 0.00
 
Leader Rank
Bot  Leader % 
KissMyBot v2.0 23.61
Project20M v 0.1 20.95
KissMyBot v1.0 18.29
Man'chi AngryBot 7 15.66
HaAI 0.64 Vanilla 8.86
DiploBot v1.1 7.74
DumbBot 4 6.61
Man'chi AttackBot 7 3.77
Man'chi ChargeBot 7 1.70
Man'chi RevengeBot 7 0.88
Man'chi DefenceBot 7 0.63
Man'chi RandBot 7 0.00
Man'chi ParanoidBot 7 0.00
RandBot 2 0.00
David's HoldBot 2 0.00
 
Survivor Rank
Bot  Survivor % 
Project20M v 0.1 72.12
KissMyBot v2.0 65.83
KissMyBot v1.0 64.10
Man'chi ParanoidBot 7 62.82
Man'chi AngryBot 7 61.43
HaAI 0.64 Vanilla 61.10
Man'chi DefenceBot 7 60.88
Man'chi AttackBot 7 53.04
DiploBot v1.1 52.14
Man'chi RevengeBot 7 51.46
David's HoldBot 2 48.87
Man'chi ChargeBot 7 43.06
DumbBot 4 41.05
Man'chi RandBot 7 15.52
RandBot 2 13.89

In the above tables, Plays is the total number of plays by the given bot, where each instance of a given bot in a game counts as a play; it shows the effect of slow-knockout; Solo %, Leader % and Survivor % are the percentage of plays by the given bot in which, at the end of the game, it owned more than half the supply centres, owned at least as many supply centres as any other power, or owned at least one supply centre, respectively. Each table is in descending order of the numeric field.

Fitness

Fitness Rank
Bot  Final Fitness 
KissMyBot2 1.506
Project20M 1.200
KissMyBot1 1.075
AngryBot 0.990
HaAI 0.482
DumbBot 0.393
DiploBot 0.267
AttackBot 0.181
ChargeBot 0.097
HoldBot 0.076
DefenceBot 0.065
RevengeBot 0.064
RandomBot 0.062
ParanoidBot 0.059
RandBot 0.055

The above table shows the final Fitness values of each bot, in descending order of Fitness.

Score Summaries

Score Summary for All Games
Bot Plays Mean Score SD of Mean
KissMyBot2 2660 0.654 0.058
Project20M 2396 0.458 0.058
KissMyBot1 2100 0.281 0.059
AngryBot 1711 0.087 0.061
HaAI 892 -0.384 0.066
DiploBot 840 -0.453 0.065
DumbBot 726 -0.543 0.064
AttackBot 477 -0.758 0.056
ChargeBot 353 -0.881 0.048
RevengeBot 342 -0.931 0.036
DefenceBot 317 -0.943 0.032
ParanoidBot 312 -0.972 0.010
HoldBot 309 -0.972 0.011
RandomBot 277 -1.000 0.000
RandBot 288 -1.000 0.000
 
Score Summary for 1st Quarter of Games
Bot Plays Mean Score SD of Mean
Project20M 436 0.983 0.151
KissMyBot2 460 0.917 0.146
KissMyBot1 390 0.529 0.146
AngryBot 339 0.378 0.151
HaAI 270 -0.166 0.138
DumbBot 234 -0.247 0.142
DiploBot 228 -0.263 0.143
AttackBot 175 -0.693 0.104
ChargeBot 140 -0.800 0.099
DefenceBot 145 -0.887 0.069
RevengeBot 145 -0.895 0.068
ParanoidBot 150 -0.960 0.018
HoldBot 118 -0.988 0.012
RandBot 145 -1.000 0.000
RandomBot 125 -1.000 0.000
Score Summary for 2nd Quarter of Games
Bot Plays Mean Score SD of Mean
KissMyBot2 644 0.587 0.116
Project20M 564 0.424 0.118
KissMyBot1 534 0.316 0.118
AngryBot 455 0.138 0.121
HaAI 247 -0.230 0.139
DiploBot 237 -0.316 0.135
DumbBot 169 -0.669 0.115
AttackBot 103 -0.758 0.118
RevengeBot 80 -0.895 0.090
ChargeBot 98 -0.929 0.071
HoldBot 85 -0.953 0.027
ParanoidBot 67 -0.979 0.021
RandBot 76 -1.000 0.000
DefenceBot 70 -1.000 0.000
RandomBot 71 -1.000 0.000
 
Score Summary for 3rd Quarter of Games
Bot Plays Mean Score SD of Mean
KissMyBot2 777 0.697 0.108
KissMyBot1 599 0.288 0.111
Project20M 632 0.279 0.107
AngryBot 431 -0.075 0.113
DiploBot 227 -0.562 0.112
HaAI 193 -0.703 0.101
AttackBot 117 -0.821 0.103
DumbBot 141 -0.839 0.086
HoldBot 65 -0.978 0.022
ParanoidBot 57 -0.980 0.020
ChargeBot 51 -1.000 0.000
RandBot 35 -1.000 0.000
DefenceBot 53 -1.000 0.000
RandomBot 52 -1.000 0.000
RevengeBot 70 -1.000 0.000
 
Score Summary for 4th Quarter of Games
Bot Plays Mean Score SD of Mean
KissMyBot2 779 0.511 0.103
Project20M 764 0.331 0.099
KissMyBot1 577 0.073 0.105
AngryBot 486 -0.020 0.110
DumbBot 182 -0.577 0.124
HaAI 182 -0.577 0.124
DiploBot 148 -0.795 0.094
AttackBot 82 -0.808 0.121
ChargeBot 64 -0.891 0.109
HoldBot 41 -0.957 0.043
DefenceBot 49 -0.964 0.036
RandBot 32 -1.000 0.000
ParanoidBot 38 -1.000 0.000
RandomBot 29 -1.000 0.000
RevengeBot 47 -1.000 0.000

The above tables summarise, for each bot, the number of plays, the mean (tentative DAIDE-Standard) score (for those plays) and standard deviation of the mean. The first table is for all 2000 games; the following ones are for successive 500-game quarters of the tournament. All are in descending order of mean score.

Game Lengths

The mean number of years per game was 27.1, but the distribution was very skew. Of the 2000 games, a total of 54264 years were played. Only 18 games (0.9%) were draws (all due to termination by the server according to its kill value); 47 games (2.35%) lasted over 100 years, representing 16.4% of the years total played. The longest game lasted 1528 years, representing 2.8% of the total years played. The table below shows the final counts in this marathon game.

Bot  Power   Centres 
'Manchi AngryBot FRA 0
DumbBot RUS 0
KissMyBot2 TUR 18
Project20M ITA 2
Project20M GER 0
Project20M ENG 14
Project20M AUS 0

Conclusions

The same raw results were logged as in Tournament #6. At least with the new (tentative DAIDE-Standard) scoring scheme and high kill value (100), KissMyBot v2.0 can reasonably certainly be declared as the champion, ahead of Projecdt20M in second place, KissMyBot v1.0 in third place and Man'Chi AngryBot in fourth place; all others having negative mean scores. But note, from the standard deviations in the Score Summaries, that the abilities of adjacently ranked bots are not very clearly separated. Apart from randomness, arguably reasonable variations of tournament arrangements could make a difference to ranking, as shown by KissMyBot v1.0 appearing, on balance, to be champion in Tournament #5, and Project20M having highest Survivor % and highest score in the first quarter of the current Tournament.

It would be more meaningful always to include some measure of error or reliability of any tournament results (between bots and/or humans, in any type of game or sport). Standard deviations are convenient, but note that, as defined, the scores do not (and cannot) have a normal (Gaussian) distribution: individual scores are quantized and have a highly skew distribution, within a limited range. With the current bots, draws were rare, so almost all scores were -1 or 6.

If all bots played randomly and there were no draws, then, in the STANDARD variant, there would be an average of 6 losses (-1 point each) for every solo (6 points), giving an RMS deviation of sqrt(6). The mean would be zero. For 2000 games, the standard deviation of the mean would be about sqrt(6/2000) = 0.055. This is close to the value obtained experimentally for the better bots. Most of their games were against their clones or near equals, so scoring was near random.

The very weak bots had small standard deviations because their scores were normally -1; this was epitomised by the random bots that always lost, and so had zero deviation. (Standard deviation would also have approached or reached zero if a bot had approached or reached 100% solos.)

If a bot won and lost 50% of the time, with no draws, its standard deviation would be maximum, as there would be maximum uncertainty in an individual result. All the deviations from the mean, and hence the RMS deviation, would then be 3.5. For 2000 games, the standard deviation of the mean should be about 3.5/sqrt(2000) = 0.078. The more mediocre bots had the highest standard deviations, but much less that this maximum. Compared to the better bots, they would have spent a larger proportion of their games playing against the weaker bots, in which they would have had higher proportions of solos and higher standard deviations. This effect can be seen in the quarter-tournament summaries: standard deviations in comparable circumstances should be twice as large in these as for the whole tournament, but instead, the values tend to be still larger for the better bots in the first quarter, presumably due to better available pickings then, before the frequency of plays by weak bots had decayed away.

Standard deviations would tend to be smaller with a higher proportion of draws, since scores would then be less extreme. In any case, RMS values of individual scores are are constrained to lie between 0 and 3.5, being about 2.45 for all-random plays that lead to solos.

For the current set of bots, 2000 games was probably about the right number for ranking the better bots. For example, the 500 games in the first quarter would have indicated a different champion. But there is still some room for doubt with the full set of games, and the ranking is even less certain for weaker bots. The extreme case is illustrated by the random bots, which never won at all; but even Man'chi ParanoidBot and HoldBot appear to be equally good in the all-game summary, whereas they surely do not have identical in abilities. Even the relative rankings of DumbBot and DiploBot vary from quarter to quarter, which is unsurprising, given their standard deviations. The poorer resolution of weaker bots is probably mainly due to the fact that the weaker bots did not play so many games, due to the slow-knockout method. This is intentional, as it is generally more interesting to resolve the ranking of the leaders; you cannot have it both ways when playing a given number of games. Unfortunately, where the only factor of merit is soloing, there is little information available.

Even with a large kill value (100), once again, the proportion of time spent processing pathologically long games was not excessive. Even so, in due course, it could become impracticable to run so many games in total; only much smaller number would be viable if human players were to be included. For example, if each bot had used one second of computer time for each movement turn, each year in the STANDARD variant would take 14 seconds, ignoring other turns and overheads. So this tournament of 54264 simulated years would have taken 8.8 days in real-time to run. And such thinking times may be unduly short for more advanced bots (compare with the arguably simpler game of chess). So tournaments would then normally have to have fewer games, thereby giving larger standard deviations of the means. If rankings are to be meaningful in such cases, we will have to hope that their abilities are more widely dispersed. (This is probably the case with humans and may become the case when bots use press.)

But the mean scores are limited to the range -1 to 6, so the abilities of only a limited number of bots can be clearly distinguished, for given standard deviations. And as number of games falls, the standard deviations of the means increase, thereby reducing the number of bot abilities that can be resolved.

However, mean score may not be the best way to express relative ability. It ought to be possible to say that one bot is a hundred times as good as another. How many times better is one that scores 0.5 compared to one scoring -0.5? Adding one to all scores would make this meaningful. And a logarithmic scale; that is, log(score+1) or log(Fitness) might be more useful. In particular, perhaps standard deviations would be better measured in such a logarithmic scale; that is, percentage, rather than absolute, deviations of DAIDE score are probably more appropriate, though a normal distribution should have no upper bound either. (This tournament could be reanalysed in such a manner, but more consideration would be appropriate, and it is probably not worthwhile delaying publication regarding the new champion.)

Final Fitness values had a range of 27.4 to1. It too is a measure of relative abilities. Unlike mean score, however, it may not have reached steady state. Values decay exponentially to their proper values; if a bot never wins, its Fitness will exponentially decay to zero, but it will take an infinite number of games to arrive. However, no matter how close to zero, if a bot has just won a game it will have a certain minimum Fitness, so a snapshot of Fitness is a noisy measure; but it could be smoothed by averaging over more games.

The rate of choosing a bot to play is proportional to its Fitness. The number of plays has a smaller range than final Fitness, however, because average Fitness had a smaller range than that its final value. Statistically, the number of plays is proportional to the average of Fitness over all plays. Being averages, they is more stable, so potentially also good measures of relative abilities, but these take even longer to reach steady-state, as values decay harmonically, rather than exponentially.

Evaluation times for new bots or new versions could also be reduced by continuing a previous tournament, but with the new bots now included, initially with high Fitness (to cause high probability of their playing), or even forcing each game to include one or more clones of the new bots with one or more clones of the rated ones. Indeed, maximum information would be gained by having half new and half old, and this would be enable press-bots to show their worth while there are few other press-bots to talk to. In this way, time would not be wasted re-evaluating bots that are already rated.

Note that Diplomacy might really need more than a scalar scoring scheme. Abilities might not be transitive; that is, bot A may tend to beat bot B; B may tend to beat C; C may tend to beat A. Even if not so extreme, no doubt some players play better with some combinations of players than others; for example, HoldBot would never incur the wrath of RevengeBot, as it would never attack it. It is even possible, in principle, that an intrinsically weak bot might find a symbiotic niche in conjunction with some other bot. For example, a bot that is generally strong, but weak against ENG, might support a bot that is generally weak, but good at stalemating a potential ENG solo. Perhaps unlikely, but it would make an interesting discovery, so it may be better not to eliminate any bot totally (like keeping bio-diversity).

To follow the consensus of DAIDE members, and because little difference would have resulted with the current bots, the (tentative) DAIDE-standard scoring scheme and related Fitness values are now my preferred measures and I will probably use these alone for all my future Tournaments, without further comment on their merits or otherwise. However, to better represent the range and separations of abilities, a mapping function may be applied to mean values, perhaps as outlined above.


Tracking, including use of cookies, is used by this website: see Logging.
Comments about this page are welcome: please post to DipAi or email to me.