Statistical Threats to Metawatch

January is often a time when graduate students everywhere begin frantically cramming for various exams with words like “Qualifying”, “Comprehensive” and “Defense” in their names, which means it’s also the season of detailed statistical questions like “What are the assumptions of a t-test?”, “How do you test for proportional hazards and why do you care?” and “Who is Claude Shannon?” (this last one is reserved for people doing various metagenomics dissertations built on information theory without knowing they’re build on information theory).

So let’s talk about some of the statistical threats to how we look at the wargaming hobby.

Most readers of this blog are likely familiar with Metawatch, the Games Workshop approach to analyzing win rates in competitive 40K, which is used to monitor the health of the game and – presumably – to inform the development of upcoming Balance Data Sheets and other moves to bring the game into balance. A recent White Dwarf even had a look at some of the awesome analytical tools the AoS team has access to. It’s a genuinely laudable effort, but one with a few glaring vulnerabilities. Let’s talk about some!

Confounding

Confounding, in a general sense, is when a relationship between two variables is induced by both of those variables being associated with a third variable that isn’t being considered. A classic example is the fact that smoking increases one’s risk of being murdered. Now there are a bunch of amusing “Just So” stories one can tell explaining why that might be (smoking breaks in poorly lit alleys for example), but they tend not to be the kind of things that would be detected at a population level. The answer is actually a depressing society-level problem: poor people are more likely both to be murdered and smoke – which creates an association between smoking and being murdered.

Epidemiologists use something called a Directed Acyclic Graph (DAG) to visualize this (usually much more complex examples):

What’s this got to do with Warhammer?

Player Skill.

Let’s imagine there is some relationship between a given army choice and winning. But there’s also relationships between player skill and army choice, and player skill and winning. So a DAG that looks something like this:

Note here that I’ve deliberately played with the line thickness – it doesn’t take much of an association between army selection and game success for good players to flock to a particular army – but good players are also good at spotting opportunities in new armies before other people do. Here, because player skill is associated with both army skill and winning, it will amplify the measured effect of army selection on winning – essentially, misattributing the impact of good players to the army itself.

If tournaments were the only way to play 40K, this wouldn’t actually be a problem, because you’d effectively only be looking at good players (who are the only people who can win a tournament), which would break the association between skill and winning (or at least weaken it considerably).

But the results of Metawatch, while derived from tournaments, presumably aren’t simply restricted to them, as the Balance Dataslates are much more widely adopted than that. Which means this problem is back – if most of the performance of an army is actually attributable to player skill, for average or below average players (by definition most players…), the effect of a nerf to those armies will have an outsized effect.

Which brings us to…sampling bias.

Sampling Bias

Epidemiologists think about sampling bias a lot. Essentially what this is is that nearly all studies are not of the entire population, because that would be ludicrously expensive. Instead, what we do is we take a sample of the population, and apply that result to the whole population – generalizing the result.

In a well conducted, unbiased sample of the population, this is pretty easy.

In practice? It’s not.

For example, if you take a sample of people based on employment, you have to deal with what’s called the “healthy worker bias” – that is, people who are employed are, on average, healthier than people who aren’t (because some number of the people who are unemployed are too sick to do so). People you can reach on phone surveys are on average older than the general population. Your study conducted in New York is probably an okay fit for Pennsylvania, middling for Virginia, and just broken for North Dakota – or is it? The list goes on. A tremendous amount of what epidemiologists, biostatisticians, and clinical trial researchers do is trying to fix this problem (clinical trials get rid of confounding in exchange for inducing some sampling bias).

Metawatch is using an extremely biased sample – competitive players who participate in events that are recorded/observed.

That means they miss anyone who doesn’t pay competitively, as well as competitive players who don’t make it to large events for any number of reasons. Instead, we assume that the competitive scene is a representative sample of everyone – and it’s…not.

How big of a deal is this?

We don’t know, because you can’t measure the thing you’re not measuring (while Donald Rumsfeld has much to answer for, “Unknown Unknowns” and “Known Unknowns” are profoundly useful concepts).

This one is, in my mind, more intractable. Confounding by player skill is something you can at least begin to approach using only the data Metawatch has (indeed, I did some very simple approaches to this a few years back). Dealing with this means getting data on the population you are trying to generalize to, which is by definition much harder. The players who enter the Metawatch data set are passively collected – they’re coming to event organizers, and providing their data by virtue of playing. The people who aren’t have to be found, have to be surveyed, etc.

Measurement Error

Measurement error is actually one of the smaller threats to what Metawatch does – in my opinion – but one of the bigger threats when we look at how the community reacts to how tournaments are reported.

Measurement error is exactly what it sounds like. When you were weighted on a scale, you were wearing hiking boots and holding a 300 page hardcover book (true story). Our instrument to collect some variable is poorly calibrated. Like your buddy’s new Cheatmaster 5000 Combat Gauge that he 3D printed. Or are apparently Games Workshop.

If this is random, it’s fine – annoying, but fine. If it’s not random, it’s a problem.

Where does this come into a discussion of balance? When you take “An Ork Army Played by Jimmy Bestintheworld, composed of the following units” and translate that down to “Orks”. So then we can have Twitter and Reddit conversations about how “Orks are broken” that apply equally to Jimmy’s weird skew build, and some guy who has build a bunch of battle wagons out of vintage toy cars.

This one is again, not Metawatch’s fault, as much as something we as a community need to be mindful of. The biased sample of tournament-attending armies is not your friend’s army.

Feedback Loops

This last one is less of a methods issue and more of a…well…why balance is hard. For most conventional statistics, there’s an assumption you’re not trying to hit a moving target. There is an effect of Drug X on Cancer Y, and it doesn’t matter if Cancer Y read a paper about Drug X a week ago.

The problem with tournament statistics is that tournaments respond to the results of tournaments. This is the whole reason “The Meta” exists as a concept. And the meta evolves toward an equilibrium if left to its own devices (see a simple example of how this works here). The new World Eaters book is really good, and the Bloodslayer Eight-Fold Slaughter Pack ™ is wildly underpriced, and all of a sudden there’s a number of commission painted World Eaters armies skewed heavily into Bloodslayer Eight-Fold Slaughter Packs at the next tournament. They win, and suddenly the local competitive folks have some primer grey ones that have some suspicious print lines on them, and so it goes.

The tricky bit is that the meta won’t necessarily head to a good equilibrium – just an equilibrium. “Stores now only need to carry Bloodslayer Eight-Fold Slaughter Packs” (I need to stop typing that) is an equilibrium. Everyone playing with the exact same army is technically a balanced game – there’s now a 50/50 win rate dependent only on skill.

It’s just boring AF and a financial disaster.

This is the hard part of balance. Applying enough pressure to nudge the game toward a good equilibrium – where there are diverse armies, etc., while fighting against the tendency of any system like this to head toward an equilibrium that might be easier to reach. And this is the hard part of Metawatch as a job – you can observe past states of the game, but building predictive models of how things will react in the future is a fairly complicated process, and one that’s ever so slightly adversarial (it is not in the interest of any given competitive player that the game be balanced, even if as a whole we would benefit from it). That’s messy, and well beyond what descriptive statistics like the ones Metawatch discusses (at least publicly) are capable of handling.

So…What Then?

Should they just give up?

Well…no. And I say this despite there being a dearth of Warhammer statistical analysis in recent years on this blog, despite that being the foundation of…whatever terribly modest impact it’s had.

But it is important that we recognize the limits of what our current level of Warhammer-focused statistics can accomplish. Metawatch is an extremely laudable effort, done by the company that should be doing it, and with access to tools and information that are, to be frank, things that take some degree of funded staff to be really worthwhile. But it cannot tell us the whole story.

What are my fixes?

First, and one thing I talk about a lot, is not assuming the competitive scene is the only, or the dominant, way to play 40K. Games Workshop doesn’t. The second is to use knowledge you have that Games Workshop doesn’t – your local scene. It doesn’t matter if Metawatch says an army has a 43% win rate in tournaments if your local player of that army has an 80% win rate – in a casual setting, or a narrative campaign, it’s still okay to ask them to tone it down. In my own local group, we have implemented a fairly heavy nerf to Necrons, not because Games Workshop has told us to, but because the local Necron players are very good at what they do. Nor should you assume some stranger is necessarily fielding a tournament optimized list, and hassle them for whatever they’re playing before you know what they’re playing.]

And most of all? Recognize that this is hard, and that Games Workshop doesn’t have the tools – either in the fidelity of the data they collect, nor in how they can influence the game – to fine tune things in the way they might need to for a truly balanced competitive scene. When all you have is a hammer, everything looks like free wargear for Space Marines.

Enjoy what you read? Enjoyed that it was ad free? Both of those things are courtesy of our generous Patreon supporters. If you’d like more quantitatively driven thoughts on 40K and miniatures wargaming, and a hand in deciding what we cover, please consider joining them.

Confounding

Sampling Bias

Measurement Error

Feedback Loops

So…What Then?

Share this:

Leave a Reply Cancel reply