"I don't see the convergence quality too critical. A single diplomacy game's result is by far less representative for the players 'qualities' than e.g., Chess, or especially Go, where the results of single games are _very_ predictable if there is a relevant difference in the players' scores. "
So, now you're getting to the core of the problem, which is that the current GR formula is poorly chosen regardless of what weighting you pick. Let me be a little bit more abstract for a moment, apologies in advance!
For an typical rating formula, we're going to make two assumptions:
1) Every player does have an underlying level of skill, and on average they play each game with that skill. However, because Diplomacy is not a deterministic game like chess or go, the results for a given skill are pretty disparate. In other words, in chess an 1800 will beat a 1400 a lot more than in Diplomacy, a GR 180 will beat a GR 140.
2) That underlying level of skill does change over time. However -- and this assumption lies at the core of what we're going to do -- we assume that this change is relatively slow in the sense that it takes a number of games played to become substantially better or worse. If it didn't, then you would want the rating system to be based upon just the last few games, after all, because the others would be irrelevant.
OK, so given both of those things, how should a rating system be designed? First off, it should have two numbers, not one. The current number simply has a rating. However, we need to keep track two numbers: a rating and an uncertainty. The rating should be, quite simply, our best attempt at describing the skill with which the player has played all of their games so far. So, if a player has played just one game in his life, but soloed against the top 6 players in the world, his rating would be the highest in the world, because that's the level of play he's shown over his entire career. Note that treating the rating this way also helps with another aspect of GR; currently it discourages you from playing newbies with good records, because they're probably stronger than GR gives them credit for. This isn't something we want to be actively discouraging, right?
The uncertainty tells you how likely we are to be wrong about our measurement of their skill. For a player who has played just one game, we are very, very uncertain. Using the current GR formula, our player who soloed against the top 6 in the world is unlikely to, in fact, turn out to be a GR 50 player, or even a GR 100 player. But, it wouldn't surprise you too much to see them, in the long run, as a GR 200 player or as a GR 600 player, right? So there's an enormous uncertainty in that rating. On the other hand, a player who has played 1000 games has a very low uncertainty: we know almost exactly how strong he is.
When you tabulate a new game, both of these numbers matter. You use the current ratings as a measure of the skill you are up against, but you also use the uncertainties. The more certain you are about the skill of the players you're up against, the stronger weight the game has, because it's better information. The more certainty in your own rating, the less weight the game has, because you already have a lot of information, so it's not adding as much that's new. Finally, note that uncertainty should be time-based, too. Your uncertainty should always be slowly growing over time, but reduced every game you play. This means that older games progressively fade into the background, which we wanted (see #2 above)
Finally, one last problem: the reason that GR was done this way, presumably, was that you didn't want a player who played one game and soloed against a strong table to end up atop the list. One option is just to only report players who have a low enough uncertainty, meaning they've played enough games recently. But, I understand wanting every player to appear on the list. Hence another idea: rather than reporting our rating, we report a number that is, say, 2 standard deviations *below* our measured rating, using the uncertainties we are also tracking. In other words, we are reporting not the level of skill they have shown, but the minimum level of true skill we are pretty certain that they have. And this would solve the problem of how to include players in the list who have played a small number of games, without having to distort the entire system because of it.
Finally, orathaic, you're right that gunboat has higher variance. How would we account for that in this system? Simple: your point is well taken that gunboat provides less information from one game than full press, so you treat it as less information. Meaning, it's information, but information which is not going to reduce the uncertainty as much as better information would. So, all things being equal, it will take more gunboat games to learn how strong a player is than full press games, which is the behavior you're going for.