Tournaments for AI in the Game of Diplomacy: Tournament #7

Tournament #7 finished on 3 March 2006. The bots and Server settings were as in Tournament #6 except KissMyBot v2.0 was added. Once again, a slow-knockout of 2000 games was used, but this time with probability of a bot playing was made proportional to the moving average of its Fitness (rather than Strength). Fitness is a measure of a bot's ability to achieve high scores in the (tentative) DAIDE Standard Scoring Scheme; it is a moving average that is proportional to recent score+1 values, decaying to 90% after each losing game played. See Bots for details for the players.

In the scoring scheme used, a solo winner gets M-1 points; each member of a draw gets M/N-1 points; other players get -1 point; where M is the number of powers and N is the number of players in a draw. In other words, each player pays one point into a pot; winners (solo or draw members) share the pot equally. If there is no solo and no draw is agreed by the surviving players, then the game is terminated after the score of supply centres has remained unchanged for kill years; all survivors are then considered to be in a draw. If a game is terminated for any other reason, it is void and each contribution to the pot is returned. It is a zero-sum game.

Results

Plays and Percentages

In the above tables, Plays is the total number of plays by the given bot, where each instance of a given bot in a game counts as a play; it shows the effect of slow-knockout; Solo %, Leader % and Survivor % are the percentage of plays by the given bot in which, at the end of the game, it owned more than half the supply centres, owned at least as many supply centres as any other power, or owned at least one supply centre, respectively. Each table is in descending order of the numeric field.

Fitness

The above table shows the final Fitness values of each bot, in descending order of Fitness.

Score Summaries

Fitness Rank
KissMyBot2	1.506
Project20M	1.200
KissMyBot1	1.075
AngryBot	0.990
HaAI	0.482
DumbBot	0.393
DiploBot	0.267
AttackBot	0.181
ChargeBot	0.097
HoldBot	0.076
DefenceBot	0.065
RevengeBot	0.064
RandomBot	0.062
ParanoidBot	0.059
RandBot	0.055

Score Summary for All Games
Bot	Plays	Mean Score	SD of Mean
KissMyBot2	2660	0.654	0.058
Project20M	2396	0.458	0.058
KissMyBot1	2100	0.281	0.059
AngryBot	1711	0.087	0.061
HaAI	892	-0.384	0.066
DiploBot	840	-0.453	0.065
DumbBot	726	-0.543	0.064
AttackBot	477	-0.758	0.056
ChargeBot	353	-0.881	0.048
RevengeBot	342	-0.931	0.036
DefenceBot	317	-0.943	0.032
ParanoidBot	312	-0.972	0.010
HoldBot	309	-0.972	0.011
RandomBot	277	-1.000	0.000
RandBot	288	-1.000	0.000

Score Summary for 1st Quarter of Games
Bot	Plays	Mean Score	SD of Mean
Project20M	436	0.983	0.151
KissMyBot2	460	0.917	0.146
KissMyBot1	390	0.529	0.146
AngryBot	339	0.378	0.151
HaAI	270	-0.166	0.138
DumbBot	234	-0.247	0.142
DiploBot	228	-0.263	0.143
AttackBot	175	-0.693	0.104
ChargeBot	140	-0.800	0.099
DefenceBot	145	-0.887	0.069
RevengeBot	145	-0.895	0.068
ParanoidBot	150	-0.960	0.018
HoldBot	118	-0.988	0.012
RandBot	145	-1.000	0.000
RandomBot	125	-1.000	0.000

Score Summary for 2nd Quarter of Games
Bot	Plays	Mean Score	SD of Mean
KissMyBot2	644	0.587	0.116
Project20M	564	0.424	0.118
KissMyBot1	534	0.316	0.118
AngryBot	455	0.138	0.121
HaAI	247	-0.230	0.139
DiploBot	237	-0.316	0.135
DumbBot	169	-0.669	0.115
AttackBot	103	-0.758	0.118
RevengeBot	80	-0.895	0.090
ChargeBot	98	-0.929	0.071
HoldBot	85	-0.953	0.027
ParanoidBot	67	-0.979	0.021
RandBot	76	-1.000	0.000
DefenceBot	70	-1.000	0.000
RandomBot	71	-1.000	0.000


Score Summary for 3rd Quarter of Games
Bot	Plays	Mean Score	SD of Mean
KissMyBot2	777	0.697	0.108
KissMyBot1	599	0.288	0.111
Project20M	632	0.279	0.107
AngryBot	431	-0.075	0.113
DiploBot	227	-0.562	0.112
HaAI	193	-0.703	0.101
AttackBot	117	-0.821	0.103
DumbBot	141	-0.839	0.086
HoldBot	65	-0.978	0.022
ParanoidBot	57	-0.980	0.020
ChargeBot	51	-1.000	0.000
RandBot	35	-1.000	0.000
DefenceBot	53	-1.000	0.000
RandomBot	52	-1.000	0.000
RevengeBot	70	-1.000	0.000


Score Summary for 4th Quarter of Games
Bot	Plays	Mean Score	SD of Mean
KissMyBot2	779	0.511	0.103
Project20M	764	0.331	0.099
KissMyBot1	577	0.073	0.105
AngryBot	486	-0.020	0.110
DumbBot	182	-0.577	0.124
HaAI	182	-0.577	0.124
DiploBot	148	-0.795	0.094
AttackBot	82	-0.808	0.121
ChargeBot	64	-0.891	0.109
HoldBot	41	-0.957	0.043
DefenceBot	49	-0.964	0.036
RandBot	32	-1.000	0.000
ParanoidBot	38	-1.000	0.000
RandomBot	29	-1.000	0.000
RevengeBot	47	-1.000	0.000

The above tables summarise, for each bot, the number of plays, the mean (tentative DAIDE-Standard) score (for those plays) and standard deviation of the mean. The first table is for all 2000 games; the following ones are for successive 500-game quarters of the tournament. All are in descending order of mean score.

Game Lengths

The mean number of years per game was 27.1, but the distribution was very skew. Of the 2000 games, a total of 54264 years were played. Only 18 games (0.9%) were draws (all due to termination by the server according to its kill value); 47 games (2.35%) lasted over 100 years, representing 16.4% of the years total played. The longest game lasted 1528 years, representing 2.8% of the total years played. The table below shows the final counts in this marathon game.

Bot	Power	Centres
'Manchi AngryBot	FRA	0
DumbBot	RUS	0
KissMyBot2	TUR	18
Project20M	ITA	2
Project20M	GER	0
Project20M	ENG	14
Project20M	AUS	0

Conclusions

The same raw results were logged as in Tournament #6. At least with the new (tentative DAIDE-Standard) scoring scheme and high kill value (100), KissMyBot v2.0 can reasonably certainly be declared as the champion, ahead of Projecdt20M in second place, KissMyBot v1.0 in third place and Man'Chi AngryBot in fourth place; all others having negative mean scores. But note, from the standard deviations in the Score Summaries, that the abilities of adjacently ranked bots are not very clearly separated. Apart from randomness, arguably reasonable variations of tournament arrangements could make a difference to ranking, as shown by KissMyBot v1.0 appearing, on balance, to be champion in Tournament #5, and Project20M having highest Survivor % and highest score in the first quarter of the current Tournament.

It would be more meaningful always to include some measure of error or reliability of any tournament results (between bots and/or humans, in any type of game or sport). Standard deviations are convenient, but note that, as defined, the scores do not (and cannot) have a normal (Gaussian) distribution: individual scores are quantized and have a highly skew distribution, within a limited range. With the current bots, draws were rare, so almost all scores were -1 or 6.

If all bots played randomly and there were no draws, then, in the STANDARD variant, there would be an average of 6 losses (-1 point each) for every solo (6 points), giving an RMS deviation of sqrt(6). The mean would be zero. For 2000 games, the standard deviation of the mean would be about sqrt(6/2000) = 0.055. This is close to the value obtained experimentally for the better bots. Most of their games were against their clones or near equals, so scoring was near random.

The very weak bots had small standard deviations because their scores were normally -1; this was epitomised by the random bots that always lost, and so had zero deviation. (Standard deviation would also have approached or reached zero if a bot had approached or reached 100% solos.)

If a bot won and lost 50% of the time, with no draws, its standard deviation would be maximum, as there would be maximum uncertainty in an individual result. All the deviations from the mean, and hence the RMS deviation, would then be 3.5. For 2000 games, the standard deviation of the mean should be about 3.5/sqrt(2000) = 0.078. The more mediocre bots had the highest standard deviations, but much less that this maximum. Compared to the better bots, they would have spent a larger proportion of their games playing against the weaker bots, in which they would have had higher proportions of solos and higher standard deviations. This effect can be seen in the quarter-tournament summaries: standard deviations in comparable circumstances should be twice as large in these as for the whole tournament, but instead, the values tend to be still larger for the better bots in the first quarter, presumably due to better available pickings then, before the frequency of plays by weak bots had decayed away.

Standard deviations would tend to be smaller with a higher proportion of draws, since scores would then be less extreme. In any case, RMS values of individual scores are are constrained to lie between 0 and 3.5, being about 2.45 for all-random plays that lead to solos.

For the current set of bots, 2000 games was probably about the right number for ranking the better bots. For example, the 500 games in the first quarter would have indicated a different champion. But there is still some room for doubt with the full set of games, and the ranking is even less certain for weaker bots. The extreme case is illustrated by the random bots, which never won at all; but even Man'chi ParanoidBot and HoldBot appear to be equally good in the all-game summary, whereas they surely do not have identical in abilities. Even the relative rankings of DumbBot and DiploBot vary from quarter to quarter, which is unsurprising, given their standard deviations. The poorer resolution of weaker bots is probably mainly due to the fact that the weaker bots did not play so many games, due to the slow-knockout method. This is intentional, as it is generally more interesting to resolve the ranking of the leaders; you cannot have it both ways when playing a given number of games. Unfortunately, where the only factor of merit is soloing, there is little information available.

Even with a large kill value (100), once again, the proportion of time spent processing pathologically long games was not excessive. Even so, in due course, it could become impracticable to run so many games in total; only much smaller number would be viable if human players were to be included. For example, if each bot had used one second of computer time for each movement turn, each year in the STANDARD variant would take 14 seconds, ignoring other turns and overheads. So this tournament of 54264 simulated years would have taken 8.8 days in real-time to run. And such thinking times may be unduly short for more advanced bots (compare with the arguably simpler game of chess). So tournaments would then normally have to have fewer games, thereby giving larger standard deviations of the means. If rankings are to be meaningful in such cases, we will have to hope that their abilities are more widely dispersed. (This is probably the case with humans and may become the case when bots use press.)

But the mean scores are limited to the range -1 to 6, so the abilities of only a limited number of bots can be clearly distinguished, for given standard deviations. And as number of games falls, the standard deviations of the means increase, thereby reducing the number of bot abilities that can be resolved.

However, mean score may not be the best way to express relative ability. It ought to be possible to say that one bot is a hundred times as good as another. How many times better is one that scores 0.5 compared to one scoring -0.5? Adding one to all scores would make this meaningful. And a logarithmic scale; that is, log(score+1) or log(Fitness) might be more useful. In particular, perhaps standard deviations would be better measured in such a logarithmic scale; that is, percentage, rather than absolute, deviations of DAIDE score are probably more appropriate, though a normal distribution should have no upper bound either. (This tournament could be reanalysed in such a manner, but more consideration would be appropriate, and it is probably not worthwhile delaying publication regarding the new champion.)

Final Fitness values had a range of 27.4 to1. It too is a measure of relative abilities. Unlike mean score, however, it may not have reached steady state. Values decay exponentially to their proper values; if a bot never wins, its Fitness will exponentially decay to zero, but it will take an infinite number of games to arrive. However, no matter how close to zero, if a bot has just won a game it will have a certain minimum Fitness, so a snapshot of Fitness is a noisy measure; but it could be smoothed by averaging over more games.

The rate of choosing a bot to play is proportional to its Fitness. The number of plays has a smaller range than final Fitness, however, because average Fitness had a smaller range than that its final value. Statistically, the number of plays is proportional to the average of Fitness over all plays. Being averages, they is more stable, so potentially also good measures of relative abilities, but these take even longer to reach steady-state, as values decay harmonically, rather than exponentially.

Evaluation times for new bots or new versions could also be reduced by continuing a previous tournament, but with the new bots now included, initially with high Fitness (to cause high probability of their playing), or even forcing each game to include one or more clones of the new bots with one or more clones of the rated ones. Indeed, maximum information would be gained by having half new and half old, and this would be enable press-bots to show their worth while there are few other press-bots to talk to. In this way, time would not be wasted re-evaluating bots that are already rated.

Note that Diplomacy might really need more than a scalar scoring scheme. Abilities might not be transitive; that is, bot A may tend to beat bot B; B may tend to beat C; C may tend to beat A. Even if not so extreme, no doubt some players play better with some combinations of players than others; for example, HoldBot would never incur the wrath of RevengeBot, as it would never attack it. It is even possible, in principle, that an intrinsically weak bot might find a symbiotic niche in conjunction with some other bot. For example, a bot that is generally strong, but weak against ENG, might support a bot that is generally weak, but good at stalemating a potential ENG solo. Perhaps unlikely, but it would make an interesting discovery, so it may be better not to eliminate any bot totally (like keeping bio-diversity).

To follow the consensus of DAIDE members, and because little difference would have resulted with the current bots, the (tentative) DAIDE-standard scoring scheme and related Fitness values are now my preferred measures and I will probably use these alone for all my future Tournaments, without further comment on their merits or otherwise. However, to better represent the range and separations of abilities, a mapping function may be applied to mean values, perhaps as outlined above.

Diplomacy
Tournaments

Tournament #7

Method

Results

Plays and Percentages

Fitness

Score Summaries

Game Lengths

Conclusions

Fitness Rank
Bot	Final Fitness
KissMyBot2	1.506
Project20M	1.200
KissMyBot1	1.075
AngryBot	0.990
HaAI	0.482
DumbBot	0.393
DiploBot	0.267
AttackBot	0.181
ChargeBot	0.097
HoldBot	0.076
DefenceBot	0.065
RevengeBot	0.064
RandomBot	0.062
ParanoidBot	0.059
RandBot	0.055

Diplomacy Tournaments

Tournament #7

Method

Results

Plays and Percentages

Fitness

Score Summaries

Game Lengths

Conclusions

Diplomacy
Tournaments