Is Warhammer Balanced?

This is an archival version of a post on another blog that makes sense to have here in this context as well – if for no other reason than a memorial to the first time I threw statistical software at a miniatures wargame. Keep in mind this post is fairly old, so no longer reflects the current meta, nor thoughts on analysis, explanation and visualization improved by another half-decade of graduate school.

So this blog is rapidly becoming a public repository for thoughts both nerdy and statistical. Not entirely sure how I feel about that, but it would be a shame to break precedent. Today we’ll be talking about miniatures wargames. For those of you who have escaped this time consuming, expensive, and somewhat oddball hobby, the gist is this: Players, using small painted figures (essentially toy soldiers) representing factions either real or imagined, fight out battles using a set of rules. Think complicated chess and you’re on the right track. Or Google “Warhammer”.

The question today is: Is the miniatures wargame “Warhammer”, or its sci-fi sibling “Warhammer 40,000″ balanced – meaning can a player using one faction reasonably expect to beat another player using a different faction because of their skill or luck, rather than one faction being inherently more powerful? Answer after the jump.

How to go about answering this? Statistics!

Methods: The data for this analysis was pulled from the publicly available results of a recent tournament, the Throne of Skulls.

Each tournament had three “heats”, which were pooled together to get both the largest sample size and largest number of players for each respective game. I then added a variable representing the order in which the supplement for the game (an “Army book” or “Codex”) was released – this will come up later. Then, in SAS 9.2 and JMP 7, both by the SAS Institute, I ran a one way ANOVA to examine the mean “Gaming Total”, or score for the tournament, over the various factions, with pairwise comparison done post-hoc with Tukey’s HSD.

To answer a secondary question, are newer army books more powerful (“Codex Creep”), I performed a logistic regression to see if the odds of placing in the top 95th percentile of players differed between players using newer and older army books. Competitors missing data on what faction they play were excluded from the analysis, as were two Space Marine players who had negative scores, presumably due to being appalling sportsmen.



Players of Warhammer Fantasy Battles had some clear favorite tournament armies, most notably Daemons of Chaos and Vampire Counts, followed up by Dark Elves and then a smattering of everything else (Figure 1). This likely reflects the player perceptions that these are particularly “powerful” armies.

So what then of the actual results? Daemons of Chaos did have the highest mean score (108.7) of a maximum of 180), while Beasts of Chaos had the lowest (60.7) – Figure 2, some nice boxplots of the performance of the armies.

Note the considerable variation in the performance of all the armies. In pairwise post-testing, the “high performing” armies of Vampire Counts, Daemons of Chaos and Dark Elves were only significantly better than a handful of armies, notably the Empire, High Elves, Dwarfs, Orcs & Goblins and Beasts of Chaos. Most of the rest are in the middle ground, and we cannot rule out the current results being entirely due to chance.

As for the results of the logistic regression: The odds of a player placing in the top 95th percentile of players was 1.35 (95% CI: 1.09, 1.67) times that of a player using the next oldest army book, indicating an increasing likelihood of doing well using books that have been more recently released.


Warhammer 40,000 players had some clear favorites as well, with Chaos Space Marines, Space Marines and Eldar occupying the top three spots (Figure 3).

From the results of the ANOVA, Orks, the 4th most popular army, had the highest mean score (102.5, again out of a possible 180), while the Black Templars had the lowest (67.6). Figure 4 has the boxplots of the various army performances.

Observe there’s somewhat less variation than in the Fantasy data – the only statistically significant difference is between Orks and the Space Marines, a surprisingly popular yet low performing army (mean score = 76.0). The logistic regression yielded similarly contrary results. The odds of placing in the 95th percentile for a player is 1.07 (95% CI: 0.95, 1.21) times that of a player using the next oldest codex – a relationship that, again, may very well be due to chance alone.


So what’s this all mean? Well, first, a caveat. As with all studies, this one has limitations. Of most concern to me is the fact that in the Fantasy data, the most popular armies are also the best performing. It is possible that the best players gravitate towards “better” armies, and what we are actually seeing is better players, not better armies, placing higher. This might also explain why the Space Marines have a lower average score than the Blood Angels, Dark Angels and Space Wolves – all Space Marines derivatives with older, and arguably less mechanistically powerful, army books. In short, it is likely our study has some residual confounding, although I hoped by using active tournament players we can at least partially account for player skill. An independent, quantitative measure of “skill” independent from tournament standings eludes me. As an aside, this problem is extremely common in Epidemiology, and is known as “confounding”. Hence the name of this blog.

But, in short, the answer is no, the game is not balanced. Warhammer Fantasy especially suffers from several overperforming army lists, as well as statistically significant Codex creep. Warhammer 40,000 seems to suffer less from these issues, although the overwhelming popularity of the Space Marines (and their Chaos cohorts) among new players may be masking some effects. Never the less, for the moment, it appears in Warhammer 40K tournaments, the winner may be comfortable in the conclusion that his victory is due to skill and the dice, rather than what book he bought.


  1. Awesome post. Essay 40k FTW! I’m a scientist as well so I really appreciated the layout of this post and so on. I think one of the problems with your analysis is the samples are not independent, since they were all in the same tournament together, they all influence each other in terms of score.
    Have you looked into Torrent of Fire’s data on warhammer 40k? The guy who runs it has this enormous data set but no background in statistics. All he can really show is percentages, without any error bars or significance.


    1. Tim – The data is actually from Torrent of Fire. Within a single tournament I think the non-independence is fine, because, well, I’m mostly talking about the LVO. But over multiple tournaments, which I hope to do over time, it’s definitely something that will need to be taken into account. And that will definitely be somewhat more complex.


Leave a Reply

Your email address will not be published. Required fields are marked *