This post:
"For an typical rating formula, we're going to make two assumptions:
1) Every player does have an underlying level of skill, and on average they play each game with that skill. However, because Diplomacy is not a deterministic game like chess or go, the results for a given skill are pretty disparate. In other words, in chess an 1800 will beat a 1400 a lot more than in Diplomacy, a GR 180 will beat a GR 140.
2) That underlying level of skill does change over time. However -- and this assumption lies at the core of what we're going to do -- we assume that this change is relatively slow in the sense that it takes a number of games played to become substantially better or worse. If it didn't, then you would want the rating system to be based upon just the last few games, after all, because the others would be irrelevant.
OK, so given both of those things, how should a rating system be designed? First off, it should have two numbers, not one. The current number simply has a rating. However, we need to keep track two numbers: a rating and an uncertainty. The rating should be, quite simply, our best attempt at describing the skill with which the player has played all of their games so far. So, if a player has played just one game in his life, but soloed against the top 6 players in the world, his rating would be the highest in the world, because that's the level of play he's shown over his entire career. Note that treating the rating this way also helps with another aspect of GR; currently it discourages you from playing newbies with good records, because they're probably stronger than GR gives them credit for. This isn't something we want to be actively discouraging, right?
The uncertainty tells you how likely we are to be wrong about our measurement of their skill. For a player who has played just one game, we are very, very uncertain. Using the current GR formula, our player who soloed against the top 6 in the world is unlikely to, in fact, turn out to be a GR 50 player, or even a GR 100 player. But, it wouldn't surprise you too much to see them, in the long run, as a GR 200 player or as a GR 600 player, right? So there's an enormous uncertainty in that rating. On the other hand, a player who has played 1000 games has a very low uncertainty: we know almost exactly how strong he is.
When you tabulate a new game, both of these numbers matter. You use the current ratings as a measure of the skill you are up against, but you also use the uncertainties. The more certain you are about the skill of the players you're up against, the stronger weight the game has, because it's better information. The more certainty in your own rating, the less weight the game has, because you already have a lot of information, so it's not adding as much that's new. Finally, note that uncertainty should be time-based, too. Your uncertainty should always be slowly growing over time, but reduced every game you play. This means that older games progressively fade into the background, which we wanted (see #2 above)
Finally, one last problem: the reason that GR was done this way, presumably, was that you didn't want a player who played one game and soloed against a strong table to end up atop the list. One option is just to only report players who have a low enough uncertainty, meaning they've played enough games recently. But, I understand wanting every player to appear on the list. Hence another idea: rather than reporting our rating, we report a number that is, say, 2 standard deviations *below* our measured rating, using the uncertainties we are also tracking. In other words, we are reporting not the level of skill they have shown, but the minimum level of true skill we are pretty certain that they have. And this would solve the problem of how to include players in the list who have played a small number of games, without having to distort the entire system because of it.
Finally, orathaic, you're right that gunboat has higher variance. How would we account for that in this system? Simple: your point is well taken that gunboat provides less information from one game than full press, so you treat it as less information. Meaning, it's information, but information which is not going to reduce the uncertainty as much as better information would. So, all things being equal, it will take more gunboat games to learn how strong a player is than full press games, which is the behavior you're going for."
Makes me more interested to speak to CS. The reason I haven't done this is two fold:
1. I started off trying to keep it as simple as possible, so it was more likely to be adopted.
2. I have some more ideas which I would want to do if I were going to rewrite GR.