Diplomacy
|
Home > Diplomacy > Tournaments > Tournament9
Tournaments for AI in the Game of Diplomacy: Tournament #9
Tournament #9 finished on 13 May 2006. The bots and Server settings were as in Tournament #8 except
Upgraded to BlabBot 2.0, which used press as intended – the first and only bot capable of any press at the time.
Upgraded to KissMyBot 3.01, including ability to handle (albeit not use) the new press syntax.
Updated HoldBot, RandBot and DumbBot to handle (albeit not use) the new press syntax. Version numbers unchanged.
To maximize the effects of press, all games comprised 3 of one bot-type and 4 of another.
Server kill value reduced from 100 to 10, due to games between a few combinations of bot otherwise failing to terminate in a feasible time.
BlabBot 2.0 was later found to have serious bugs, albeit they would not have affected the tournament. The bugs were fixed in 2.1. As 2.1 would have played exactly the same as 2.0 in the tournament, BlabBot's version is renamed as 2.1 in the SAGA database, thereby facilitating future DEMO trials and analyses.
See Bots for details for the players.
Once again, a slow-knockout of 2000 games was used, with probability of a bot playing made proportional to the moving average of its Fitness. Fitness is a measure of a bot's ability to achieve high scores in the (now agreed) DAIDE Standard Scoring Scheme; it is a moving average that is proportional to recent score+1 values, decaying to 90% after each losing game played.
Unlike previously, only two types of bot competed in any given game; 3 of one and 4 of the other. This was done to enhance the effect of being able to use press (used successfully for the first time in these Tournaments, by BlabBot). Without such a change, while press-bots are rare, there would be little advantage or incentive to use press. Currently, the effect of press would be negligible – requiring two or more clones of BlabBot in a given game, and only then if playing powers that can interact significantly.
The Server kill value was reduced because, in preliminary tests, a few combinations of bot otherwise failed to terminate in a feasible time (see Conclusions).
Where poor results for a bot could possibly have been due to it never being unable to cope with press (from BlabBot, which caused catastrophic problems with at least four bots in Tournament #8), a sample Server-log was checked to be sure that all bots were successfully submitting valid orders for at least several years. (All seemed well. Less catastrophic effects would have been undetected, but, ultimately, the measure of a bot's ability depends on its lack of bugs.)
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
In the above tables, Plays is the total number of plays by the given bot, where each instance of a given bot in a game counts as a play; it shows the effect of slow-knockout; Solo %, Leader % and Survivor % are the percentage of plays by the given bot in which, at the end of the game, it owned more than half the supply centres, owned at least as many supply centres as any other power, or owned at least one supply centre, respectively. Each table is in descending order of the numeric field.
Fitness Rank | |
---|---|
Bot | Final Fitness |
KissMyBot3.01 | 1.762 |
Project20M | 1.328 |
BlabBot2 | 1.249 |
DiploBot | 1.028 |
Man'chi AngryBot | 0.994 |
Man'chi AttackBot | 0.892 |
DumbBot | 0.826 |
Man'chi ParanoidBot | 0.759 |
HaAI | 0.739 |
Man'chi DefenceBot | 0.689 |
Man'chi RevengeBot | 0.483 |
Man'chi ChargeBot | 0.398 |
HoldBot | 0.305 |
Man'chi RandBot | 0.202 |
RandBot | 0.145 |
The above table shows the final Fitness values of each bot, in descending order of Fitness.
Score Summary for All Games | |||
---|---|---|---|
Bot | Plays | Mean Score | SD of Mean |
KissMyBot3.01 | 1713 | 0.640 | 0.071 |
BlabBot2 | 1635 | 0.466 | 0.061 |
Project20M | 1330 | 0.287 | 0.068 |
Man'chi AngryBot | 1433 | 0.170 | 0.059 |
DiploBot | 964 | -0.041 | 0.074 |
HaAI | 1055 | -0.122 | 0.067 |
Man'chi AttackBot | 920 | -0.153 | 0.056 |
DumbBot | 821 | -0.211 | 0.069 |
Man'chi DefenceBot | 854 | -0.295 | 0.038 |
Man'chi ParanoidBot | 810 | -0.330 | 0.028 |
Man'chi RevengeBot | 729 | -0.420 | 0.055 |
Man'chi ChargeBot | 632 | -0.481 | 0.068 |
HoldBot | 435 | -0.698 | 0.024 |
Man'chi RandBot | 328 | -0.814 | 0.057 |
RandBot | 341 | -0.881 | 0.038 |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
The above tables summarise, for each bot, the number of plays, the mean (DAIDE-Standard) score (for those plays) and standard deviation of the mean. The first table is for all 2000 games; the following ones are for successive 500-game quarters of the tournament. All are in descending order of mean score.
Crosstab of Mean Scores of Each Pairing of Bots | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | Bot | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 19 | 20 |
2 | DumbBot | -1 | -1.5 | -1.186 | -1.092 | 3.313 | 0.778 | 1.194 | 2.778 | 3.36 | 3.5 | 2.709 | -2.59 | -3.063 | -3.18 | |
3 | HaAI | 1 | -0.684 | -1.638 | 0.383 | 2.222 | 1.965 | 1.262 | 2.833 | 3.375 | 3.2 | 3.8 | -2.28 | -2.667 | -1.109 | |
4 | DiploBot | 1.5 | 0.684 | -1.423 | 2.167 | 1.286 | 1.918 | 1.917 | 2.489 | 3.25 | 3.8 | 3.143 | -1.562 | -3.205 | -2.863 | |
5 | Man'chi AngryBot | 1.186 | 1.638 | 1.423 | 0.558 | 2.7 | 1.001 | 1.094 | 1.042 | 3.5 | 3.5 | 3.161 | -0.069 | -1.76 | -1.109 | |
6 | Man'chi AttackBot | 1.092 | -0.383 | -2.167 | -0.558 | 1.45 | -0.11 | 0 | 0.605 | 2.694 | 3.055 | 1.227 | -0.689 | -3.139 | -1.81 | |
7 | Man'chi ChargeBot | -3.313 | -2.222 | -1.286 | -2.7 | -1.45 | 0.208 | 0.496 | 0.667 | 2.4 | 2.667 | 3.5 | -3.327 | -3.2 | -3.058 | |
8 | Man'chi DefenceBot | -0.778 | -1.965 | -1.918 | -1.001 | 0.11 | -0.208 | 0 | 0.275 | 4 | 1.959 | 0 | -1.826 | -2.538 | -1.767 | |
9 | Man'chi ParanoidBot | -1.194 | -1.262 | -1.917 | -1.094 | 0 | -0.496 | 0 | 0.292 | 0.225 | 3.75 | 0 | -0.594 | -3.123 | -1.779 | |
10 | Man'chi RevengeBot | -2.778 | -2.833 | -2.489 | -1.042 | -0.605 | -0.667 | -0.275 | -0.292 | 3 | 2.889 | 1.375 | -2.412 | -2.846 | -2.476 | |
11 | Man'chi RandBot | -3.36 | -3.375 | -3.25 | -3.5 | -2.694 | -2.4 | -4 | -0.225 | -3 | 3 | 3 | -3.36 | -3.462 | -3.285 | |
12 | RandBot | -3.5 | -3.2 | -3.8 | -3.5 | -3.055 | -2.667 | -1.959 | -3.75 | -2.889 | -3 | 3 | -3.267 | -3.8 | -2.859 | |
13 | HoldBot | -2.709 | -3.8 | -3.143 | -3.161 | -1.227 | -3.5 | 0 | 0 | -1.375 | -3 | -3 | -2.592 | -3.625 | -2.967 | |
14 | Project20M | 2.59 | 2.28 | 1.562 | 0.069 | 0.689 | 3.327 | 1.826 | 0.594 | 2.412 | 3.36 | 3.267 | 2.592 | -1.944 | 0.837 | |
19 | KissMyBot3.01 | 3.063 | 2.667 | 3.205 | 1.76 | 3.139 | 3.2 | 2.538 | 3.123 | 2.846 | 3.462 | 3.8 | 3.625 | 1.944 | -0.933 | |
20 | BlabBot2 | 3.18 | 1.109 | 2.863 | 1.109 | 1.81 | 3.058 | 1.767 | 1.779 | 2.476 | 3.285 | 2.859 | 2.967 | -0.837 | 0.933 |
The above table shows the mean score for each pairing of bot. For compactness, the internal id of each bot is used for headings; the correspondence of id and name of bot being shown in the first two columns. A positive score means the row bot beat the column bot.
The mean number of years per game was 24.7, but the distribution was very skew. In the 2000 games, a total of 49414 years were played. 494 games (24.7%) were draws (all due to termination by the server according to its kill value); 17 games (0.85%) lasted over 100 years, representing 4.5% of the total years played. The longest game lasted 187 years, representing 0.38% of the total years played (7.6 times the mean of 24.7).
The same raw results were logged as in Tournament #7. The previous conclusions generally also apply here. The same form of presentation is used here to help Compare the effect of the new selection rules (only two bot-types per game). In that comparison, there were various changes in nearby rankings, but none surprising, given their standard deviations. There is a much reduced range of Plays for given bots. This was probably due to the change of scheme. Fewer bots (one, HoldBot) had zero Leader% and Solo%. This was probably be due to all but the weakest bot now having a chance to play against only the weakest bot. (In this respect, HoldBot.is weakest, as it can never lead, let alone solo.)
Notwithstanding random effects, it is also more than plausible that the new selection rules systematically changed ratings and rankings. Firstly, a given game generally had a more significant effect on the scores of its players, albeit fewer adjustments. Typically, the winning bot-type now gained of 3 or 4 points, rather than 6 points; with 1 bot-type now losing 3 or 4 points, rather than 6 bot-types losing one point. So there would be a smaller number of proportionately larger changes, thereby increasing the randomness in a given number of games, as can be seen by the larger Standard Deviations this time. (The expected RMS deviation is step_size*sqrt(number_of_steps).)
Secondly, when several clones of a bot-type compete in a given game, the bot-type partly fights itself, so cancelling part of its ability. In this case, multiple clones are the norm. Better bots are now systematically included in games with better bots (themselves.); likewise, weaker bots systematically compete with weaker bots.
Thirdly, there now tends to be much less variety of play in a given game. The competitors now probably tend to play a more "pure", simpler strategy, which may well have a significant effect. (Perhaps like comparing blended and single malt whiskys.)
Fourthly, the smaller kill value probably affects bot-types differently.
Having a simpler mix of strategies is probably the main reason it was necessary to significantly reduce the server's kill value. The problem was that, in preliminary tests with kill=100 (which was successfully used in Tournament #7), certain pairs of bot-types sometimes became almost-stalemated. If literally stalemated, with no change in supply centre scores, then the server would end the game after kill years. But in occasional games, a pair would reach an almost-stable position, but occasionally make a change in their scores, before returning to the almost-stable position, thereby inhibiting Server from terminating the game. This was always a danger, but it was evident, from casual observation, that this was now a serious problem. Whether some games would ever terminate was unknowable, but at least seemed unlikely, in a few cases, in any feasible time. When, previously, there had been more bot-types in game, there were probably always some that behaved erratically enough to prevent such stalemates. However, given the abnormally large percentage (24.7%) of draws (compared with 0.9% in Tournament #7), a larger kill value (or alternative termination method) is clearly desirable. Progressively larger values will be tried in future Tournaments.
Well done to Jason van Hal, once again, for his yet further improved KissMyBot (now 3.01). It has clearly have won the Tournament – not only did it come top in most measures, it came top in Mean Score for All Games, and all the Quarters of Games – Score being currently the "official" measure.
It is interesting that, despite most of KissMyBot's measures being excellent, and its Solo % being superb, its Survivor % was, literally, the most mediocre. This might indicate that it tends to be a risk-taker – all or nothing. This was not apparent with earlier versions of KissMyBot. Unless the style of play of the latest version has changed substantially, the effect is more likely due to the fact that, in this Tournament, each bot systematically plays against more of its clones than before. KissMyBot tends to play against much better bots than the others (its clones), and these competitors are especially good at soloing, which will tend to eliminate a higher-than-average number of the opposition (including its clones).
But probably the most interesting observation, albeit not unexpected, is the extent to which BlabBot does so much better than DumbBot, merely by the addition of a simple press heuristic (see BlabBot for an outline). Indeed, BlabBot can plausibly be considered to have come second with this Tournament scheme and versions of bots used. It came top for Survivor % and Leader %, but was evidently much less able than KissMyBot to convert from leader to solo. It is also plausible that, in an extended Tournament, it would be third, behind Project20M, given their changing rankings in the Quarters of the Games, the closeness in the last Quarter, considering the standard deviations, and the fact that BlabBot was third in Final Fitness. But in any case, the effect of its simple press heuristic was impressive.
But in must be emphasised that this was surely only possible for BlabBot with a special tournament scheme, contrived to be especially friendly towards lone press-bots. With the previous schemes, BlabBot would be probably appear only marginally better than DumbBot, since typically few, if any, other players would comprehend what it was saying.
Anyway, if even such a simple press heuristic can vastly improves a bot's performance – upgrading a middling bot to near champion – it is clear that use of press is likely to be a fertile source of improvements for bots in general. Similar improvements could probably be achieved by adding a similar press strategy to any of the other bots. However, the heuristic used is extremely fragile: it relies on total trust between bots with a peace agreement. It would presumably fare much worse than DumbBot against any bot that negotiated peace but did not honour it. Against such a liar, BlabBot would leave itself defenceless, never attempt to exploit the supposed friend, nor even attempt to detect its lie.
Like any other press strategy, BlabBot's heuristic would be neutral if too few other bots could understand it, and would be ineffective if too many used the same strategy. Its effect would only be significant in certain types of tournament, such as this one. But, hopefully, an "arms-race" of press sophistication will now ensue (especially after release of BlabBot source). Eventually it should no longer be necessary to tailor tournaments to encourage use of press.
In the Crosstab of Mean Scores of Each Pairing of Bots, although no figures are presented here, beware that there were necessarily relatively few games between specific pairs of bots. This was especially so (sometimes only one game, as hinted at by the round values) between the low scoring bots (due to the Slow-Knockout method), so individually, their values may not all be reliable indicators. But the table does at least give a hint about which bots can directly beat which others, and hence about the compatibility of the various playing styles. Of particular interest is the top three, KissMyBot, BlabBot and Project20M (which should have the most reliable values). It can be seen that KissMyBot beats all but BlabBot; BlabBot beats all but Project20M; and Project20M beats all but KissMyBot. It is cyclic – their scores are intransitive; there is no strict ranking order for them when playing pair-wise, albeit KissMyBot beats Project20M more decisively than KissMyBot is beaten by BlabBot, and so on.
The probability of a bot being chosen to play is proportional to its current Fitness (albeit, in this Tournament, bot plays are grouped together), which would be proportional to the current moving average score plus one. So, as the Tournament progresses, the proportions of the various competitors to a given bot vary, gradually increasing the (overall and moving) mean Fitness (and score) of competitors to any given bot. (That is not to say that each bot will find the difficulty of playing each other bots exactly proportional to the ratio of their Fitnesses.) Therefore, the (overall and mean) Fitness of all bots will tend to decrease over time, until the trend is lost in noise. In steady state, (overall and moving) mean Fitness will become proportional to (overall or moving) mean score plus one. (Eventually, the total number of plays would also become proportional mean score plus one, but the fact that there is still a wide disparity here does not mean that score is not close to steady state. However, the Quarter Game Summaries indicate that ratings and rankings have reasonably, but not totally, stabilized.)