Diplomacy
Tournaments

Tournament #9

John Newbury       17 July 2012

Home > Diplomacy > Tournaments > Tournament9

Tournaments for AI in the Game of Diplomacy: Tournament #9


Method

Tournament #9 finished on 13 May 2006. The bots and Server settings were as in Tournament #8 except

BlabBot 2.0 was later found to have serious bugs, albeit they would not have affected the tournament. The bugs were fixed in 2.1. As 2.1 would have played exactly the same as 2.0 in the tournament, BlabBot's version is renamed as 2.1 in the SAGA database, thereby facilitating future DEMO trials and analyses.

See Bots for details for the players.

Once again, a slow-knockout of 2000 games was used, with probability of a bot playing made proportional to the moving average of its Fitness. Fitness is a measure of a bot's ability to achieve high scores in the (now agreed) DAIDE Standard Scoring Scheme; it is a moving average that is proportional to recent score+1 values, decaying to 90% after each losing game played.

Unlike previously, only two types of bot competed in any given game; 3 of one and 4 of the other. This was done to enhance the effect of being able to use press (used successfully for the first time in these Tournaments, by BlabBot). Without such a change, while press-bots are rare, there would be little advantage or incentive to use press. Currently, the effect of press would be negligible – requiring two or more clones of BlabBot in a given game, and only then if playing powers that can interact significantly.

The Server kill value was reduced because, in preliminary tests, a few combinations of bot otherwise failed to terminate in a feasible time (see Conclusions).

Where poor results for a bot could possibly have been due to it never being unable to cope with press (from BlabBot, which caused catastrophic problems with at least four bots in Tournament #8), a sample Server-log was checked to be sure that all bots were successfully submitting valid orders for at least several years. (All seemed well. Less catastrophic effects would have been undetected, but, ultimately, the measure of a bot's ability depends on its lack of bugs.)

Results

Plays and Percentages

Plays Rank
Bot     Plays    
KissMyBot v3.01 1713
BlabBot 2.0 1635
Man'chi AngryBot 7 1433
Project20M v 0.1 1330
HaAI 0.64 Vanilla 1055
DiploBot v1.1 964
Man'chi AttackBot 7 920
Man'chi DefenceBot 7 854
DumbBot 4 821
Man'chi ParanoidBot 7 810
Man'chi RevengeBot 7 729
Man'chi ChargeBot 7 632
HoldBot 2 435
RandBot 2 341
Man'chi RandBot 7 328
Solo Rank
Bot     Solo %    
KissMyBot v3.01 23.06
BlabBot 2.0 16.09
Project20M v 0.1 15.26
DiploBot v1.1 12.34
Man'chi AngryBot 7 12.00
HaAI 0.64 Vanilla 10.81
DumbBot 4 8.89
Man'chi AttackBot 7 6.41
Man'chi ChargeBot 7 6.17
Man'chi RevengeBot 7 4.53
Man'chi DefenceBot 7 2.22
Man'chi RandBot 7 2.13
RandBot 2 0.88
Man'chi ParanoidBot 7 0.86
HoldBot 2 0.00
 
Leader Rank
Bot  Leader % 
BlabBot 2.0 23.79
KissMyBot v3.01 23.35
Project20M v 0.1 20.15
Man'chi AngryBot 7 18.84
DiploBot v1.1 14.11
HaAI 0.64 Vanilla 13.55
Man'chi AttackBot 7 13.26
DumbBot 4 12.42
Man'chi DefenceBot 7 11.12
Man'chi RevengeBot 7 8.92
Man'chi ChargeBot 7 7.12
Man'chi ParanoidBot 7 6.42
Man'chi RandBot 7 2.44
RandBot 2 0.88
HoldBot 2 0.00
 
Survivor Rank
Bot  Survivor % 
BlabBot 2.0 83.67
Man'chi ParanoidBot 7 82.59
Man'chi DefenceBot 7 78.10
Project20M v 0.1 77.07
Man'chi AngryBot 7 71.11
Man'chi AttackBot 7 70.65
HaAI 0.64 Vanilla 70.33
KissMyBot v3.01 66.20
Man'chi RevengeBot 7 59.67
DiploBot v1.1 58.92
HoldBot 2 56.55
DumbBot 4 55.42
Man'chi ChargeBot 7 52.85
RandBot 2 35.78
Man'chi RandBot 7 31.10

In the above tables, Plays is the total number of plays by the given bot, where each instance of a given bot in a game counts as a play; it shows the effect of slow-knockout; Solo %, Leader % and Survivor % are the percentage of plays by the given bot in which, at the end of the game, it owned more than half the supply centres, owned at least as many supply centres as any other power, or owned at least one supply centre, respectively. Each table is in descending order of the numeric field.

Fitness

Fitness Rank
Bot  Final Fitness 
KissMyBot3.01 1.762
Project20M 1.328
BlabBot2 1.249
DiploBot 1.028
Man'chi AngryBot 0.994
Man'chi AttackBot 0.892
DumbBot 0.826
Man'chi ParanoidBot 0.759
HaAI 0.739
Man'chi DefenceBot 0.689
Man'chi RevengeBot 0.483
Man'chi ChargeBot 0.398
HoldBot 0.305
Man'chi RandBot 0.202
RandBot 0.145

The above table shows the final Fitness values of each bot, in descending order of Fitness.

Score Summaries

Score Summary for All Games
Bot Plays Mean Score SD of Mean
KissMyBot3.01 1713 0.640 0.071
BlabBot2 1635 0.466 0.061
Project20M 1330 0.287 0.068
Man'chi AngryBot 1433 0.170 0.059
DiploBot 964 -0.041 0.074
HaAI 1055 -0.122 0.067
Man'chi AttackBot 920 -0.153 0.056
DumbBot 821 -0.211 0.069
Man'chi DefenceBot 854 -0.295 0.038
Man'chi ParanoidBot 810 -0.330 0.028
Man'chi RevengeBot 729 -0.420 0.055
Man'chi ChargeBot 632 -0.481 0.068
HoldBot 435 -0.698 0.024
Man'chi RandBot 328 -0.814 0.057
RandBot 341 -0.881 0.038

Score Summary for 1st Quarter of Games
Bot Plays Mean Score SD of Mean
KissMyBot3.01 360 0.800 0.160
BlabBot2 328 0.527 0.143
Man'chi AngryBot 365 0.354 0.127
Project20M 249 0.196 0.152
DiploBot 222 0.096 0.164
DumbBot 242 0.021 0.147
Man'chi AttackBot 243 -0.107 0.113
HaAI 235 -0.130 0.140
Man'chi DefenceBot 209 -0.256 0.078
Man'chi ParanoidBot 206 -0.262 0.064
Man'chi ChargeBot 181 -0.349 0.140
Man'chi RevengeBot 214 -0.351 0.115
HoldBot 168 -0.709 0.037
Man'chi RandBot 146 -0.825 0.084
RandBot 132 -0.935 0.027
Score Summary for 2nd Quarter of Games
Bot Plays Mean Score SD of Mean
KissMyBot3.01 461 0.553 0.134
Project20M 405 0.458 0.129
BlabBot2 453 0.431 0.115
Man'chi AngryBot 337 0.223 0.126
HaAI 242 -0.134 0.139
Man'chi AttackBot 223 -0.249 0.102
Man'chi DefenceBot 204 -0.253 0.086
DiploBot 207 -0.287 0.137
DumbBot 198 -0.342 0.128
Man'chi ParanoidBot 224 -0.404 0.061
Man'chi ChargeBot 158 -0.499 0.131
Man'chi RevengeBot 116 -0.581 0.095
HoldBot 105 -0.681 0.051
Man'chi RandBot 79 -0.783 0.127
RandBot 88 -0.842 0.086
 
Score Summary for 3rd Quarter of Games
Bot Plays Mean Score SD of Mean
KissMyBot3.01 456 0.608 0.137
BlabBot2 415 0.577 0.128
Man'chi AngryBot 351 0.111 0.117
Project20M 316 0.108 0.130
DiploBot 259 -0.026 0.145
HaAI 330 -0.054 0.124
Man'chi AttackBot 215 -0.188 0.113
Man'chi RevengeBot 231 -0.318 0.115
Man'chi DefenceBot 231 -0.369 0.067
Man'chi ChargeBot 163 -0.403 0.145
Man'chi ParanoidBot 173 -0.434 0.043
DumbBot 179 -0.514 0.110
HoldBot 72 -0.658 0.060
RandBot 68 -0.774 0.146
Man'chi RandoBot 41 -0.829 0.171
 
Score Summary for 4th Quarter of Games
Bot Plays Mean Score SD of Mean
KissMyBot3.01 436 0.632 0.141
BlabBot2 439 0.351 0.111
Project20M 360 0.313 0.131
DiploBot 276 0.019 0.141
Man'chi AngryBot 380 0.002 0.101
Man'chi AttackBot 239 -0.077 0.118
DumbBot 202 -0.092 0.149
HaAI 248 -0.195 0.131
Man'chi ParanoidBot 207 -0.230 0.046
Man'chi DefenceBot 210 -0.295 0.071
Man'chi RevengeBot 168 -0.536 0.080
HoldBot 90 -0.732 0.049
Man'chi ChargeBot 130 -0.740 0.108
Man'chi RandBot 62 -0.819 0.118
RandBot 53 -0.947 0.037

The above tables summarise, for each bot, the number of plays, the mean (DAIDE-Standard) score (for those plays) and standard deviation of the mean. The first table is for all 2000 games; the following ones are for successive 500-game quarters of the tournament. All are in descending order of mean score.

Crosstab of Mean Scores of Each Pairing of Bots
Id Bot 2 3 4 5 6 7 8 9 10 11 12 13 14 19 20
2 DumbBot   -1 -1.5 -1.186 -1.092 3.313 0.778 1.194 2.778 3.36 3.5 2.709 -2.59 -3.063 -3.18
3 HaAI 1   -0.684 -1.638 0.383 2.222 1.965 1.262 2.833 3.375 3.2 3.8 -2.28 -2.667 -1.109
4 DiploBot 1.5 0.684   -1.423 2.167 1.286 1.918 1.917 2.489 3.25 3.8 3.143 -1.562 -3.205 -2.863
5 Man'chi AngryBot 1.186 1.638 1.423   0.558 2.7 1.001 1.094 1.042 3.5 3.5 3.161 -0.069 -1.76 -1.109
6 Man'chi AttackBot 1.092 -0.383 -2.167 -0.558   1.45 -0.11 0 0.605 2.694 3.055 1.227 -0.689 -3.139 -1.81
7 Man'chi ChargeBot -3.313 -2.222 -1.286 -2.7 -1.45   0.208 0.496 0.667 2.4 2.667 3.5 -3.327 -3.2 -3.058
8 Man'chi DefenceBot -0.778 -1.965 -1.918 -1.001 0.11 -0.208   0 0.275 4 1.959 0 -1.826 -2.538 -1.767
9 Man'chi ParanoidBot -1.194 -1.262 -1.917 -1.094 0 -0.496 0   0.292 0.225 3.75 0 -0.594 -3.123 -1.779
10 Man'chi RevengeBot -2.778 -2.833 -2.489 -1.042 -0.605 -0.667 -0.275 -0.292   3 2.889 1.375 -2.412 -2.846 -2.476
11 Man'chi RandBot -3.36 -3.375 -3.25 -3.5 -2.694 -2.4 -4 -0.225 -3   3 3 -3.36 -3.462 -3.285
12 RandBot -3.5 -3.2 -3.8 -3.5 -3.055 -2.667 -1.959 -3.75 -2.889 -3   3 -3.267 -3.8 -2.859
13 HoldBot -2.709 -3.8 -3.143 -3.161 -1.227 -3.5 0 0 -1.375 -3 -3   -2.592 -3.625 -2.967
14 Project20M 2.59 2.28 1.562 0.069 0.689 3.327 1.826 0.594 2.412 3.36 3.267 2.592   -1.944 0.837
19 KissMyBot3.01 3.063 2.667 3.205 1.76 3.139 3.2 2.538 3.123 2.846 3.462 3.8 3.625 1.944   -0.933
20 BlabBot2 3.18 1.109 2.863 1.109 1.81 3.058 1.767 1.779 2.476 3.285 2.859 2.967 -0.837 0.933  

The above table shows the mean score for each pairing of bot. For compactness, the internal id of each bot is used for headings; the correspondence of id and name of bot being shown in the first two columns. A positive score means the row bot beat the column bot.

Game Lengths

The mean number of years per game was 24.7, but the distribution was very skew. In the 2000 games, a total of 49414 years were played. 494 games (24.7%) were draws (all due to termination by the server according to its kill value); 17 games (0.85%) lasted over 100 years, representing 4.5% of the total years played. The longest game lasted 187 years, representing 0.38% of the total years played (7.6 times the mean of 24.7). 

Conclusions

The same raw results were logged as in Tournament #7. The previous conclusions generally also apply here. The same form of presentation is used here to help Compare the effect of the new selection rules (only two bot-types per game). In that comparison, there were various changes in nearby rankings, but none surprising, given their standard deviations. There is a much reduced range of Plays for given bots. This was probably due to the change of scheme. Fewer bots (one, HoldBot) had zero Leader% and Solo%. This was probably be due to all but the weakest bot now having a chance to play against only the weakest bot. (In this respect, HoldBot.is weakest, as it can never lead, let alone solo.)

Notwithstanding random effects, it is also more than plausible that the new selection rules systematically changed ratings and rankings. Firstly, a given game generally had a more significant effect on the scores of its players, albeit fewer adjustments. Typically, the winning bot-type now gained of 3 or 4 points, rather than 6 points; with 1 bot-type now losing 3 or 4 points, rather than 6 bot-types losing one point. So there would be a smaller number of proportionately larger changes, thereby increasing the randomness in a given number of games, as can be seen by the larger Standard Deviations this time. (The expected RMS deviation is step_size*sqrt(number_of_steps).)

Secondly, when several clones of a bot-type compete in a given game, the bot-type partly fights itself, so cancelling part of its ability. In this case, multiple clones are the norm. Better bots are now systematically included in games with better bots (themselves.); likewise, weaker bots systematically compete with weaker bots.

Thirdly, there now tends to be much less variety of play in a given game. The competitors now probably tend to play a more "pure", simpler strategy, which may well have a significant effect. (Perhaps like comparing blended and single malt whiskys.)

Fourthly, the smaller kill value probably affects bot-types differently.

Having a simpler mix of strategies is probably the main reason it was necessary to significantly reduce the server's kill value. The problem was that, in preliminary tests with kill=100 (which was successfully used in Tournament #7), certain pairs of bot-types sometimes became almost-stalemated. If literally stalemated, with no change in supply centre scores, then the server would end the game after kill years. But in occasional games, a pair would reach an almost-stable position, but occasionally make a change in their scores, before returning to the almost-stable position, thereby inhibiting Server from terminating the game. This was always a danger, but it was evident, from casual observation, that this was now a serious problem. Whether some games would ever terminate was unknowable, but at least seemed unlikely, in a few cases, in any feasible time. When, previously, there had been more bot-types in game, there were probably always some that behaved erratically enough to prevent such stalemates. However, given the abnormally large percentage (24.7%) of draws (compared with 0.9% in Tournament #7), a larger kill value (or alternative termination method) is clearly desirable. Progressively larger values will be tried in future Tournaments.

Well done to Jason van Hal, once again, for his yet further improved KissMyBot (now 3.01). It has clearly have won the Tournament – not only did it come top in most measures, it came top in Mean Score for All Games, and all the Quarters of Games – Score being currently the "official" measure.

It is interesting that, despite most of KissMyBot's measures being excellent, and its Solo % being superb, its Survivor % was, literally, the most mediocre. This might indicate that it tends to be a risk-taker – all or nothing. This was not apparent with earlier versions of KissMyBot. Unless the style of play of the latest version has changed substantially, the effect is more likely due to the fact that, in this Tournament, each bot systematically plays against more of its clones than before. KissMyBot tends to play against much better bots than the others (its clones), and these competitors are especially good at soloing, which will tend to eliminate a higher-than-average number of the opposition (including its clones).

But probably the most interesting observation, albeit not unexpected, is the extent to which BlabBot does so much better than DumbBot, merely by the addition of a simple press heuristic (see BlabBot for an outline). Indeed, BlabBot can plausibly be considered to have come second with this Tournament scheme and versions of bots used. It came top for Survivor % and Leader %, but was evidently much less able than KissMyBot to convert from leader to solo. It is also plausible that, in an extended Tournament, it would be third, behind Project20M, given their changing rankings in the Quarters of the Games, the closeness in the last Quarter, considering the standard deviations, and the fact that BlabBot was third in Final Fitness. But in any case, the effect of its simple press heuristic was impressive.

But in must be emphasised that this was surely only possible for BlabBot with a special tournament scheme, contrived to be especially friendly towards lone press-bots. With the previous schemes, BlabBot would be probably appear only marginally better than DumbBot, since typically few, if any, other players would comprehend what it was saying.

Anyway, if even such a simple press heuristic can vastly improves a bot's performance – upgrading a middling bot to near champion – it is clear that use of press is likely to be a fertile source of improvements for bots in general. Similar improvements could probably be achieved by adding a similar press strategy to any of the other bots. However, the heuristic used is extremely fragile: it relies on total trust between bots with a peace agreement. It would presumably fare much worse than DumbBot against any bot that negotiated peace but did not honour it. Against such a liar, BlabBot would leave itself defenceless, never attempt to exploit the supposed friend, nor even attempt to detect its lie.

Like any other press strategy, BlabBot's heuristic would be neutral if too few other bots could understand it, and would be ineffective if too many used the same strategy. Its effect would only be significant in certain types of tournament, such as this one. But, hopefully, an "arms-race" of press sophistication will now ensue (especially after release of BlabBot source). Eventually it should no longer be necessary to tailor tournaments to encourage use of press.

In the Crosstab of Mean Scores of Each Pairing of Bots, although no figures are presented here, beware that there were necessarily relatively few games between specific pairs of bots. This was especially so (sometimes only one game, as hinted at by the round values) between the low scoring bots (due to the Slow-Knockout method), so individually, their values may not all be reliable indicators. But the table does at least give a hint about which bots can directly beat which others, and hence about the compatibility of the various playing styles. Of particular interest is the top three, KissMyBot, BlabBot and Project20M (which should have the most reliable values). It can be seen that KissMyBot beats all but BlabBot; BlabBot beats all but Project20M; and Project20M beats all but KissMyBot. It is cyclic – their scores are intransitive; there is no strict ranking order for them when playing pair-wise, albeit KissMyBot beats Project20M more decisively than KissMyBot is beaten by BlabBot, and so on.

The probability of a bot being chosen to play is proportional to its current Fitness (albeit, in this Tournament, bot plays are grouped together), which would be proportional to the current moving average score plus one. So, as the Tournament progresses, the proportions of the various competitors to a given bot vary, gradually increasing the (overall and moving) mean Fitness (and score) of competitors to any given bot. (That is not to say that each bot will find the difficulty of playing each other bots exactly proportional to the ratio of their Fitnesses.) Therefore, the (overall and mean) Fitness of all bots will tend to decrease over time, until the trend is lost in noise. In steady state, (overall and moving) mean Fitness will become proportional to (overall or moving) mean score plus one. (Eventually, the total number of plays would also become proportional mean score plus one, but the fact that there is still a wide disparity here does not mean that score is not close to steady state. However, the Quarter Game Summaries indicate that ratings and rankings have reasonably, but not totally, stabilized.)


Tracking, including use of cookies, is used by this website: see Logging.
Comments about this page are welcome: please post to DipAi or email to me.