Diplomacy
Tournaments

Tournament #1

John Newbury 17 July 2012

Method

Tournament #1 finished on 26 April 2005. It was directed automatically by the (ad hoc) DeepLoamSea Tournament Director Tools (DTDT). 1000 games of STANDARD variant Diplomacy were controlled by the DAIDE Server, with results being saved to an Access database for later analysis. See Bots for details for the players.

For each game, the first bot to be selected was the one that had played the fewest times so far in the tournament (selected arbitrarily when equal). Further bots for the game were selected uniformly at random from those available, with a given selection having no effect on the probabilities for later selections in the game. If a game seemed to hang (indicated by no advancing of turn for many seconds, depending on stage) it was terminated and rerun. The same selection of bots were used for a rerun to avoid biasing against selecting error-prone bots. (There was otherwise a significant bias. However, the Server would generally assign bots to different powers next time, so there would be a bias against any specific bot-power assignments that tended to hang.) In this way, each game comprised a uniformly random mixture of bots, including possible clones of a bot, but (because of how the first bot was selected in each game) the choice of games to play was such as to tend to minimise the variation in number of times each bot was selected over the whole tournament. (As the current Server always randomly assigns specified bot to available powers, it was not possible actively to minimise variation in assignment of a given bot to a given power.)

Each game was terminated when a bot had gained more than half the supply centres (normal finish), or when there had been no change to supply centre scores for a year (potential stalemate).

Server

NB: -ptl=1 (press time limit before deadline) would have been set, but caused Server failure. However, as no bot here could use press, all press settings were irrelevant in this tournament.

Bots

See the DAIDE site for details for these bots including outline descriptions of their strategies.

Results

The tournament of 1000 games took about 10 hours to run on an AMD 3400 with 1 GB of RAM under Windows XP Home, with little other concurrent activity. 34% of the time was spent in game set-up, which included the time to start the server and bots, and that wasted on playing games that hung before completion, 4 of which had very outlying times and must have taken exceptionally many restarts to succeed. Of the 66% of time that was actually devoted to playing successful games, the mean time per year was 1.55 seconds. Although not directly measured, from casual observation, the set time limits rarely expired (except when totally hung) – turns rarely taking more than a fraction of a second.

An analysis of performance of each bot is shown in Tables 1a to 1d. Each is sorted in descending order of the last column. Plays is the number of times the bot played – multiple instance in a game counting separately; the average is therefore necessarily 700 in this tournament. Score indicates the bot's strength relative to the average of the cohort, which is scaled to be 1, being the average number of points that a given bot received for its plays. In each game, one point for each power (7 for STANDARD) was shared between all the leader bots of the game. A leader is a bot that owned at least as many supply centres as any other when the game ended. A solo is a bot that had finished with more than half the available supply centres (and hence must be the sole leader). A survivor is a bot that still owned some supply centres when the game ended. The percentages of Solo, Leader, and Survivor relate to the plays by a given bot. 57.6 % of games had a solo; the remainder were formally unresolved by standard rules for a full game; draws were not accepted (or requested), although the Server classed them as DIAS (Draw Including All Survivors).

An analysis of performance of each power is shown in Tables 2a to 2d. Each is sorted in descending order of the last column. The values in each table are analogous to those for bots, above, but relate to the power, rather than bot concerned. Plays is not shown as it was always necessarily equal to the total number of games (1000).

Table 3: Miscellaneous Statistics
	Minimum	Maximum	Mean
Set-up Seconds per Game	3.5	566.2	12.1
Playing Seconds per Game	0.1	163.5	23.6
Total Seconds per Game	7.9	670.0	35.7
Years per Game	3	57	15.27
Supply Centres of Leaders	5	23	15.18
Leaders per Game	1	6	1.08
Survivors per Game	2	7	5.16

Conclusions

For the definition of leader used here, Table 1a is the best measure of ranking of the bots' relative strengths (at least for random selections from the available cohort of bots); the score being proportional to the observed probabilities of each bot being a leader; one being the mean. The Score covers a range of nearly 100:1 from strongest to weakest. Ignoring Man'chi RandBot, which has the least possible skill (without actively calculating bad moves!), then the score ratio is nearly 10:1. (NB: Confusingly, Man'chi RandBot should play in the same way as ConsBot, not RandBot; the former ones avoid inconsistent moves, which are strictly illegal.) Man'chi RandBot comes bottom of all the tables, as it surely should do. There is, perhaps, a strong suggestion that aggression pays off, considering the ranking by score and the authors' outline descriptions of the bots' strategies – hinted at in some of their names.

The Plays in Table 1a are clearly very similar for each bot. The standard deviation is 6.6, compared with an expected value of 26.5 if no preference had been given for choosing the bot with least plays, thereby validating the technique. Indeed, I believe (but cannot prove) that with the method used, as the number of games increases, the standard deviation of plays should rapidly to approach a fixed limit, rather than increasing as square root of the number. Note that having a small variation in Plays (and hence tests of each bot) is not essential, but helps to achieve more uniformly reliable results.

Table 1b shows the percentage of plays that each bot was amongst the leaders of the game. The ranking is the same as that for Score (1a), which was most likely, especially as their was usually only one leader (mean 1.08), but it was not inevitable.

Each percentage of solos, shown in Table 1c, is necessarily no higher than for the corresponding leaders. However, the ratio between given good and bad bots was generally greater for solos than for leaders. This may indicate a difference in "killer instinct"; that is, a bot may be good enough to frequently be a leader, but not have the flare to become a solo. Nevertheless, an ability to become a solo will tend to significantly increase score – having more than half the supply centres guarantees a win and that no other bot (or power) will share the prize, and it freezes the game while the bot (or power) is ahead, whereas slightly lower high scores often evaporate in a year or two. The Man'chi bots, except AngryBot, rarely had what it takes to be a solo and, as might be expected, Man'chi RandBot never did. The same 4 bots appear at the top of Tables 1a, 1b and 1c, with significant steps down to lower values of Score, Leader % and Solo %, respectively.

The ordering for merely being a survivor, in Table 1d, shows some interesting differences. There may be a hint that aggression is less useful here – maybe tending to be a risky all-or-nothing strategy. The percentages of bot surviving is necessarily no lower than for it being a leader. As may be expected, but not inevitably, the spread of values is smaller – even Man'chi RandBot is only a bit less than half that of the highest. Perhaps with any luck you will not be picked on and so will survive. Although the Server declares a draw when it terminates without there being a solo, it would clearly be unfair, or at least not give good discrimination, to allocate points equally to all survivors.

Tables 2a to 2d show a similar analysis, but by powers rather than bots. The scores have a range of nearly 5:1, so luck of the draw appears to be very significant (at least in a game without press). The results confirm the commonly held belief about the difficulty of playing GER and AUS (presumed to be because they are surrounded by enemies). However, ITA is also generally considered to be difficult, probably the worst, to play, but that was not the case here, indeed, being the best for becoming solo. Maybe ITA is only difficult in a (proper) game with press in use.

Other scoring schemes could be computed from the saved results data; for example, all powers that survive until game end could share the game points, or perhaps only if there were no solo. The method used here makes more complete use of the data and so should probably give best discrimination on ability, but may not be sensitive to the "killer-instinct" needed to become a solo (and hence win a full game under standard rules). Results could be analysed by each bot-power combination (perhaps DumbBot is exceptionally good when playing ENG), or by various bot combinations (perhaps DumbBot plays exceptionally well with DiploBot), which might be especially significant if press were used.

Fewer games would have finished without a solo if the Server had been set to terminate after a larger number of repeated supply centre scores. However, even small increases can sometimes lead to much longer games, and it increases the theoretical danger of a non-terminating game (that is, stability, except for two powers swapping a supply centre each year), which would have prevented the tournament from progressing to further games. Perhaps a better compromise would be to increase the required number of repeated scores, but also make the tournament director terminate a game that runs for more than a certain number of game years and/or a certain real time. Nevertheless, the scores seemed to oscillate a lot during most games, so it is not obvious that longer games would be a better measure of anything; there appears to be a lot of chance involved. Probably a good bot primarily just needs to try to survive as long as possible, and further to try to keep its score high, to maximise its chance of finally scoring points (whatever scoring scheme is used). It may be that the bots used are rather erratic (RandBot certainly is, by definition); future bots might be more consistent, but if so, then the better ones should simply tend to win more consistently in any tournament regime.

It would be useful to test other game variants to check universal ability. However, the Man'chi bots can currently only be configured to play one other variant, so it would not be possible to explore this dimension extensively without drastically reducing the variety of bots in the tournament. Furthermore, the analyses of the powers would be more complicated, and the length of tournament would have to be extended in proportion to the number of variants to obtain results of the same accuracy.

It would be useful to repeat the tournament, or to analyse partitions of the results, to determine how reproducible the results are, and how much effect various alternative scoring methods, say, would have, and the significance of quantization error (due to having no gradation between win or lose in a given game).

Tournaments would be faster if the Server and certain bots ran more reliably. (The DiploBot was probably the main culprit – its notes rightly claim that it can fail to start.) Adding resilience against such problems probably caused most effort while developing the tournament director. A further complication was the ad hoc code needed to hide bot windows that annoyingly obscure the Server window – not essential, but highly desirable. (Ideally, bots should always have an option not to display any windows). Both these factors complicated the incorporation of each new bot, requiring bespoke code in many cases. So for the present, it has not proved practical to package this tournament director so that arbitrary new bots can be added parametrically, but could be done if new bots do not add further complications. (DTD defines bots parametrically, but does not properly seem to cope with all these problems.)

Tracking, including use of cookies, is used by this website: see Logging.
Comments about this page are welcome: please post to DipAi or email to me.