Tournaments for AI in the Game of Diplomacy: Tournament #9

Tournament #9 finished on 13 May 2006. The bots and Server settings were as in Tournament #8 except

BlabBot 2.0 was later found to have serious bugs, albeit they would not have affected the tournament. The bugs were fixed in 2.1. As 2.1 would have played exactly the same as 2.0 in the tournament, BlabBot's version is renamed as 2.1 in the SAGA database, thereby facilitating future DEMO trials and analyses.

Once again, a slow-knockout of 2000 games was used, with probability of a bot playing made proportional to the moving average of its Fitness. Fitness is a measure of a bot's ability to achieve high scores in the (now agreed) DAIDE Standard Scoring Scheme; it is a moving average that is proportional to recent score+1 values, decaying to 90% after each losing game played.

Unlike previously, only two types of bot competed in any given game; 3 of one and 4 of the other. This was done to enhance the effect of being able to use press (used successfully for the first time in these Tournaments, by BlabBot). Without such a change, while press-bots are rare, there would be little advantage or incentive to use press. Currently, the effect of press would be negligible – requiring two or more clones of BlabBot in a given game, and only then if playing powers that can interact significantly.

The Server kill value was reduced because, in preliminary tests, a few combinations of bot otherwise failed to terminate in a feasible time (see Conclusions).

Where poor results for a bot could possibly have been due to it never being unable to cope with press (from BlabBot, which caused catastrophic problems with at least four bots in Tournament #8), a sample Server-log was checked to be sure that all bots were successfully submitting valid orders for at least several years. (All seemed well. Less catastrophic effects would have been undetected, but, ultimately, the measure of a bot's ability depends on its lack of bugs.)

Results

Plays and Percentages

In the above tables, Plays is the total number of plays by the given bot, where each instance of a given bot in a game counts as a play; it shows the effect of slow-knockout; Solo %, Leader % and Survivor % are the percentage of plays by the given bot in which, at the end of the game, it owned more than half the supply centres, owned at least as many supply centres as any other power, or owned at least one supply centre, respectively. Each table is in descending order of the numeric field.

Fitness

The above table shows the final Fitness values of each bot, in descending order of Fitness.

Score Summaries

Fitness Rank
KissMyBot3.01	1.762
Project20M	1.328
BlabBot2	1.249
DiploBot	1.028
Man'chi AngryBot	0.994
Man'chi AttackBot	0.892
DumbBot	0.826
Man'chi ParanoidBot	0.759
HaAI	0.739
Man'chi DefenceBot	0.689
Man'chi RevengeBot	0.483
Man'chi ChargeBot	0.398
HoldBot	0.305
Man'chi RandBot	0.202
RandBot	0.145

Score Summary for All Games
Bot	Plays	Mean Score	SD of Mean
KissMyBot3.01	1713	0.640	0.071
BlabBot2	1635	0.466	0.061
Project20M	1330	0.287	0.068
Man'chi AngryBot	1433	0.170	0.059
DiploBot	964	-0.041	0.074
HaAI	1055	-0.122	0.067
Man'chi AttackBot	920	-0.153	0.056
DumbBot	821	-0.211	0.069
Man'chi DefenceBot	854	-0.295	0.038
Man'chi ParanoidBot	810	-0.330	0.028
Man'chi RevengeBot	729	-0.420	0.055
Man'chi ChargeBot	632	-0.481	0.068
HoldBot	435	-0.698	0.024
Man'chi RandBot	328	-0.814	0.057
RandBot	341	-0.881	0.038

Score Summary for 1st Quarter of Games
Bot	Plays	Mean Score	SD of Mean
KissMyBot3.01	360	0.800	0.160
BlabBot2	328	0.527	0.143
Man'chi AngryBot	365	0.354	0.127
Project20M	249	0.196	0.152
DiploBot	222	0.096	0.164
DumbBot	242	0.021	0.147
Man'chi AttackBot	243	-0.107	0.113
HaAI	235	-0.130	0.140
Man'chi DefenceBot	209	-0.256	0.078
Man'chi ParanoidBot	206	-0.262	0.064
Man'chi ChargeBot	181	-0.349	0.140
Man'chi RevengeBot	214	-0.351	0.115
HoldBot	168	-0.709	0.037
Man'chi RandBot	146	-0.825	0.084
RandBot	132	-0.935	0.027

Score Summary for 2nd Quarter of Games
Bot	Plays	Mean Score	SD of Mean
KissMyBot3.01	461	0.553	0.134
Project20M	405	0.458	0.129
BlabBot2	453	0.431	0.115
Man'chi AngryBot	337	0.223	0.126
HaAI	242	-0.134	0.139
Man'chi AttackBot	223	-0.249	0.102
Man'chi DefenceBot	204	-0.253	0.086
DiploBot	207	-0.287	0.137
DumbBot	198	-0.342	0.128
Man'chi ParanoidBot	224	-0.404	0.061
Man'chi ChargeBot	158	-0.499	0.131
Man'chi RevengeBot	116	-0.581	0.095
HoldBot	105	-0.681	0.051
Man'chi RandBot	79	-0.783	0.127
RandBot	88	-0.842	0.086


Score Summary for 3rd Quarter of Games
Bot	Plays	Mean Score	SD of Mean
KissMyBot3.01	456	0.608	0.137
BlabBot2	415	0.577	0.128
Man'chi AngryBot	351	0.111	0.117
Project20M	316	0.108	0.130
DiploBot	259	-0.026	0.145
HaAI	330	-0.054	0.124
Man'chi AttackBot	215	-0.188	0.113
Man'chi RevengeBot	231	-0.318	0.115
Man'chi DefenceBot	231	-0.369	0.067
Man'chi ChargeBot	163	-0.403	0.145
Man'chi ParanoidBot	173	-0.434	0.043
DumbBot	179	-0.514	0.110
HoldBot	72	-0.658	0.060
RandBot	68	-0.774	0.146
Man'chi RandoBot	41	-0.829	0.171


Score Summary for 4th Quarter of Games
Bot	Plays	Mean Score	SD of Mean
KissMyBot3.01	436	0.632	0.141
BlabBot2	439	0.351	0.111
Project20M	360	0.313	0.131
DiploBot	276	0.019	0.141
Man'chi AngryBot	380	0.002	0.101
Man'chi AttackBot	239	-0.077	0.118
DumbBot	202	-0.092	0.149
HaAI	248	-0.195	0.131
Man'chi ParanoidBot	207	-0.230	0.046
Man'chi DefenceBot	210	-0.295	0.071
Man'chi RevengeBot	168	-0.536	0.080
HoldBot	90	-0.732	0.049
Man'chi ChargeBot	130	-0.740	0.108
Man'chi RandBot	62	-0.819	0.118
RandBot	53	-0.947	0.037

The above tables summarise, for each bot, the number of plays, the mean (DAIDE-Standard) score (for those plays) and standard deviation of the mean. The first table is for all 2000 games; the following ones are for successive 500-game quarters of the tournament. All are in descending order of mean score.

Crosstab of Mean Scores of Each Pairing of Bots
Id	Bot	2	3	4	5	6	7	8	9	10	11	12	13	14	19	20
2	DumbBot		-1	-1.5	-1.186	-1.092	3.313	0.778	1.194	2.778	3.36	3.5	2.709	-2.59	-3.063	-3.18
3	HaAI	1		-0.684	-1.638	0.383	2.222	1.965	1.262	2.833	3.375	3.2	3.8	-2.28	-2.667	-1.109
4	DiploBot	1.5	0.684		-1.423	2.167	1.286	1.918	1.917	2.489	3.25	3.8	3.143	-1.562	-3.205	-2.863
5	Man'chi AngryBot	1.186	1.638	1.423		0.558	2.7	1.001	1.094	1.042	3.5	3.5	3.161	-0.069	-1.76	-1.109
6	Man'chi AttackBot	1.092	-0.383	-2.167	-0.558		1.45	-0.11	0	0.605	2.694	3.055	1.227	-0.689	-3.139	-1.81
7	Man'chi ChargeBot	-3.313	-2.222	-1.286	-2.7	-1.45		0.208	0.496	0.667	2.4	2.667	3.5	-3.327	-3.2	-3.058
8	Man'chi DefenceBot	-0.778	-1.965	-1.918	-1.001	0.11	-0.208		0	0.275	4	1.959	0	-1.826	-2.538	-1.767
9	Man'chi ParanoidBot	-1.194	-1.262	-1.917	-1.094	0	-0.496	0		0.292	0.225	3.75	0	-0.594	-3.123	-1.779
10	Man'chi RevengeBot	-2.778	-2.833	-2.489	-1.042	-0.605	-0.667	-0.275	-0.292		3	2.889	1.375	-2.412	-2.846	-2.476
11	Man'chi RandBot	-3.36	-3.375	-3.25	-3.5	-2.694	-2.4	-4	-0.225	-3		3	3	-3.36	-3.462	-3.285
12	RandBot	-3.5	-3.2	-3.8	-3.5	-3.055	-2.667	-1.959	-3.75	-2.889	-3		3	-3.267	-3.8	-2.859
13	HoldBot	-2.709	-3.8	-3.143	-3.161	-1.227	-3.5	0	0	-1.375	-3	-3		-2.592	-3.625	-2.967
14	Project20M	2.59	2.28	1.562	0.069	0.689	3.327	1.826	0.594	2.412	3.36	3.267	2.592		-1.944	0.837
19	KissMyBot3.01	3.063	2.667	3.205	1.76	3.139	3.2	2.538	3.123	2.846	3.462	3.8	3.625	1.944		-0.933
20	BlabBot2	3.18	1.109	2.863	1.109	1.81	3.058	1.767	1.779	2.476	3.285	2.859	2.967	-0.837	0.933

The above table shows the mean score for each pairing of bot. For compactness, the internal id of each bot is used for headings; the correspondence of id and name of bot being shown in the first two columns. A positive score means the row bot beat the column bot.

Game Lengths

The mean number of years per game was 24.7, but the distribution was very skew. In the 2000 games, a total of 49414 years were played. 494 games (24.7%) were draws (all due to termination by the server according to its kill value); 17 games (0.85%) lasted over 100 years, representing 4.5% of the total years played. The longest game lasted 187 years, representing 0.38% of the total years played (7.6 times the mean of 24.7).

Conclusions

The same raw results were logged as in Tournament #7. The previous conclusions generally also apply here. The same form of presentation is used here to help Compare the effect of the new selection rules (only two bot-types per game). In that comparison, there were various changes in nearby rankings, but none surprising, given their standard deviations. There is a much reduced range of Plays for given bots. This was probably due to the change of scheme. Fewer bots (one, HoldBot) had zero Leader% and Solo%. This was probably be due to all but the weakest bot now having a chance to play against only the weakest bot. (In this respect, HoldBot.is weakest, as it can never lead, let alone solo.)

Notwithstanding random effects, it is also more than plausible that the new selection rules systematically changed ratings and rankings. Firstly, a given game generally had a more significant effect on the scores of its players, albeit fewer adjustments. Typically, the winning bot-type now gained of 3 or 4 points, rather than 6 points; with 1 bot-type now losing 3 or 4 points, rather than 6 bot-types losing one point. So there would be a smaller number of proportionately larger changes, thereby increasing the randomness in a given number of games, as can be seen by the larger Standard Deviations this time. (The expected RMS deviation is step_size*sqrt(number_of_steps).)

Secondly, when several clones of a bot-type compete in a given game, the bot-type partly fights itself, so cancelling part of its ability. In this case, multiple clones are the norm. Better bots are now systematically included in games with better bots (themselves.); likewise, weaker bots systematically compete with weaker bots.

Thirdly, there now tends to be much less variety of play in a given game. The competitors now probably tend to play a more "pure", simpler strategy, which may well have a significant effect. (Perhaps like comparing blended and single malt whiskys.)

Having a simpler mix of strategies is probably the main reason it was necessary to significantly reduce the server's kill value. The problem was that, in preliminary tests with kill=100 (which was successfully used in Tournament #7), certain pairs of bot-types sometimes became almost-stalemated. If literally stalemated, with no change in supply centre scores, then the server would end the game after kill years. But in occasional games, a pair would reach an almost-stable position, but occasionally make a change in their scores, before returning to the almost-stable position, thereby inhibiting Server from terminating the game. This was always a danger, but it was evident, from casual observation, that this was now a serious problem. Whether some games would ever terminate was unknowable, but at least seemed unlikely, in a few cases, in any feasible time. When, previously, there had been more bot-types in game, there were probably always some that behaved erratically enough to prevent such stalemates. However, given the abnormally large percentage (24.7%) of draws (compared with 0.9% in Tournament #7), a larger kill value (or alternative termination method) is clearly desirable. Progressively larger values will be tried in future Tournaments.

Well done to Jason van Hal, once again, for his yet further improved KissMyBot (now 3.01). It has clearly have won the Tournament – not only did it come top in most measures, it came top in Mean Score for All Games, and all the Quarters of Games – Score being currently the "official" measure.

It is interesting that, despite most of KissMyBot's measures being excellent, and its Solo % being superb, its Survivor % was, literally, the most mediocre. This might indicate that it tends to be a risk-taker – all or nothing. This was not apparent with earlier versions of KissMyBot. Unless the style of play of the latest version has changed substantially, the effect is more likely due to the fact that, in this Tournament, each bot systematically plays against more of its clones than before. KissMyBot tends to play against much better bots than the others (its clones), and these competitors are especially good at soloing, which will tend to eliminate a higher-than-average number of the opposition (including its clones).

But probably the most interesting observation, albeit not unexpected, is the extent to which BlabBot does so much better than DumbBot, merely by the addition of a simple press heuristic (see BlabBot for an outline). Indeed, BlabBot can plausibly be considered to have come second with this Tournament scheme and versions of bots used. It came top for Survivor % and Leader %, but was evidently much less able than KissMyBot to convert from leader to solo. It is also plausible that, in an extended Tournament, it would be third, behind Project20M, given their changing rankings in the Quarters of the Games, the closeness in the last Quarter, considering the standard deviations, and the fact that BlabBot was third in Final Fitness. But in any case, the effect of its simple press heuristic was impressive.

But in must be emphasised that this was surely only possible for BlabBot with a special tournament scheme, contrived to be especially friendly towards lone press-bots. With the previous schemes, BlabBot would be probably appear only marginally better than DumbBot, since typically few, if any, other players would comprehend what it was saying.

Anyway, if even such a simple press heuristic can vastly improves a bot's performance – upgrading a middling bot to near champion – it is clear that use of press is likely to be a fertile source of improvements for bots in general. Similar improvements could probably be achieved by adding a similar press strategy to any of the other bots. However, the heuristic used is extremely fragile: it relies on total trust between bots with a peace agreement. It would presumably fare much worse than DumbBot against any bot that negotiated peace but did not honour it. Against such a liar, BlabBot would leave itself defenceless, never attempt to exploit the supposed friend, nor even attempt to detect its lie.

Like any other press strategy, BlabBot's heuristic would be neutral if too few other bots could understand it, and would be ineffective if too many used the same strategy. Its effect would only be significant in certain types of tournament, such as this one. But, hopefully, an "arms-race" of press sophistication will now ensue (especially after release of BlabBot source). Eventually it should no longer be necessary to tailor tournaments to encourage use of press.

In the Crosstab of Mean Scores of Each Pairing of Bots, although no figures are presented here, beware that there were necessarily relatively few games between specific pairs of bots. This was especially so (sometimes only one game, as hinted at by the round values) between the low scoring bots (due to the Slow-Knockout method), so individually, their values may not all be reliable indicators. But the table does at least give a hint about which bots can directly beat which others, and hence about the compatibility of the various playing styles. Of particular interest is the top three, KissMyBot, BlabBot and Project20M (which should have the most reliable values). It can be seen that KissMyBot beats all but BlabBot; BlabBot beats all but Project20M; and Project20M beats all but KissMyBot. It is cyclic – their scores are intransitive; there is no strict ranking order for them when playing pair-wise, albeit KissMyBot beats Project20M more decisively than KissMyBot is beaten by BlabBot, and so on.

The probability of a bot being chosen to play is proportional to its current Fitness (albeit, in this Tournament, bot plays are grouped together), which would be proportional to the current moving average score plus one. So, as the Tournament progresses, the proportions of the various competitors to a given bot vary, gradually increasing the (overall and moving) mean Fitness (and score) of competitors to any given bot. (That is not to say that each bot will find the difficulty of playing each other bots exactly proportional to the ratio of their Fitnesses.) Therefore, the (overall and mean) Fitness of all bots will tend to decrease over time, until the trend is lost in noise. In steady state, (overall and moving) mean Fitness will become proportional to (overall or moving) mean score plus one. (Eventually, the total number of plays would also become proportional mean score plus one, but the fact that there is still a wide disparity here does not mean that score is not close to steady state. However, the Quarter Game Summaries indicate that ratings and rankings have reasonably, but not totally, stabilized.)

Diplomacy
Tournaments

Tournament #9

Method

Results

Plays and Percentages

Fitness

Score Summaries

Game Lengths

Conclusions

Fitness Rank
Bot	Final Fitness
KissMyBot3.01	1.762
Project20M	1.328
BlabBot2	1.249
DiploBot	1.028
Man'chi AngryBot	0.994
Man'chi AttackBot	0.892
DumbBot	0.826
Man'chi ParanoidBot	0.759
HaAI	0.739
Man'chi DefenceBot	0.689
Man'chi RevengeBot	0.483
Man'chi ChargeBot	0.398
HoldBot	0.305
Man'chi RandBot	0.202
RandBot	0.145

Diplomacy Tournaments

Tournament #9

Method

Results

Plays and Percentages

Fitness

Score Summaries

Game Lengths

Conclusions

Diplomacy
Tournaments