So, we've discussed integrating ratings with the site before. This would be awesome, because it would mean we don't have to wait for the end of the month (ish) for ratings, and could do other fun things like have graphs of how much your rating changed on a per game basis, and automatic categories. Maybe graphs of your rating vs chosen rival players so you can compete? There are many possibilities.
It would also mean that we could put your rating number where points are on the profile, and potentially move towards replacing points (although, yes, points do a number of things like preventing newbies from signing up for 100 games, which we'd have to make sure are handled by other features).
The first iteration of ratings integration won't be a removal of points. But, we would like to replace GR.
I'm making this post because that's obviously a pretty major change, and without the support of the community, I'm not prepared to do it (especially after the SWS debate).
So! The summary of this post is going to be "Would you support integrated ratings if they were a different system to GR?"
Broadly, I'm proposing the following changes:
* We'd no longer produce a GR list every month
* Ratings would be integrated into the site, and shown on the users profile/hall of fame/username
* There'd be a graph on your profile page showing your rating change over some period.
* There would be categories - probably just gunboat and FP to start with
* It is likely that we'd only include Classic games to start with
I'm going to go into some gory details of why we think this would be a good change. You can skip this bit if you're not a ratings nerd. If you are - well, strap in!
----
Let's start with some quick definitions- in the context of rating players in games, we have:
Score: The outcome from a game. This tells you how good a particular game result is under a particular scoring system. On webdip, we have three scoring systems, including past games - DSS, SWS, and SoS.
Rating: Rating tells you how good a particular game score is for a particular user. Did a newbie get eliminated vs the best on the site? Well, that's not a bad result. Did the best on the site get eliminated vs a bunch of newbies? That's a pretty terrible result.
Our scoring systems are very easy to model (ignoring points for a moment): In DSS, you score 1/(number of winners), where "winners" includes draw participants. You score 0 otherwise. SWS scores 1/(draw size) in a draw, or (number of centres)/34 in a solo. You score 0 otherwise. Modelling the score this way is what GR currently does.
It would be possible to include the point pot as part of the scoring system (higher pot games being worth more), but I haven't looked in to it. Personally, I don't play with the point pot in mind, and I'm assuming most players in the top 2-300 or so don't either.
So, what's wrong with GR?
GR is (loosely) based on the Elo rating system. Elo is primarily used in the chess world, and is a 0-sum way of redistributing rating points between two players to infer a ranking. The basic idea is that if two players A (ranked highly) and B (ranked low) play a game, then we expect A to beat B.
If A beats B, then we figure that our ranking is correct, and A only gets a small boost to their rating, while B gets a small decrease. However, if the unexpected happens and B beats A, we figure both are ranked incorrectly, and A gets a large drop in rating, whereas B gets a correspondingly large increase.
The formula for each player after a game is:
R(new) = R(old) + K * (A-E)
R = rating
A = actual outcome
E = expected outcome, typically modelled as a logistic curve
K = some constant to determine the maximum rating change
This is a good system; it means that if you get beaten by someone who is (rated) much better than you, your rating doesn't really suffer. It also means that if you beat someone who is (rated) much better than you, then you get a big boost in rating. Elo has some drawbacks (ratings inflation, incentive for highly ranked players not to play), but I'm not actually going to go into them here.
This sounds great, what's wrong with GR?
GR doesn't follow some of the principles of Elo. Here's how GR works in DSS:
K=(R(old)/sum(R(everyone in this game))/17.5
E=R(old)/sum(R(everyone in this game))
This K factor is incredibly unusual, since it means that the rating change possible after a game depends on the players in it. It also has some strange issues, since the term in the K factor cancels out the term in the expected outcome calculation. Specifically:
1) Losses always cost the same in GR.
GR can be thought of as "betting" about 6% percent of your points. Everyone puts 6% of their rating in a pot, and the pot is divided amongst the winners. This means that you lose 6% of your rating in a loss, no matter whether we expected you to lose, or expected you to win. This seems wrong.
If the top player on the site gets beaten by a bunch of low ranked players, they should take a big rating hit. GR treats that loss as equivalent to being beaten by other top players.
2) A solo against equally rated players is "worth" more if all the players are top rated.
So, if 7 players rated at 100 play a game, and one of them solos, they get a rating boost of 34.2.
If 7 players rated at 200 play a game, and one of them solos, THEY get a rating boost of 68.5.
It seems wrong for the boost to be larger when you're all rated higher (but equal).
In Elo, you're supposed to get bigger ratings boost for beating players who are better than you - not for beating players who are equally rated. It's always impressive to solo over 7 players who are as good as you are - and I don't think it gets more impressive as those players get better.
Both of these issues are violations of the behaviour of Elo. You might argue that it doesn't matter - but the problem is measurable.
You can measure the prediction error between the expected outcome and the actual outcome, and see how "good" a system is at predicting game outcomes we can do this with RMSE - the Root Means Square Error (smaller is better). If you just look at DSS games, then GR has a RMSE of 0.68. This is pretty high. You can drop that a bit, by using a constant K-factor (which makes GR a lot more-Elo like). This produced a RMSE of around 0.6. By playing around with a system similar to vDip's rating system, I was able to get RMSE down to 0.44 - a vast improvement.
(for the pedantic: yes, I know measuring within-sample isn't the best way to evaluate a rating, and that measuring this way is measuring the "residuals" rather than "prediction error". But the result is still meaningful).
However, even though systems like the vDip system are clearly an improvement, Elo-type rating systems aren't good fits for Diplomacy. Elo isn't really designed to work with more than two players (which is why vDip's rating treats one Diplomacy game as a number of two player games between all the players - a simplification that has its own problems - including introducing new incentives like "kill the highest rated player first"), and also doesn't work well with a variety of outcomes - Elo prefers Win/Draw/Loss.
So, a new rating system for Diplomacy is an open research project. However! There's already a rating system that exists, is well respected, works for any number of players, is scoring system agnostic, and works for a variety of game outcomes.
The TrueSkill ( http://research.microsoft.com/en-us/projects/trueskill/ ) rating system would potentially work very well for our needs, and I believe would be measurably better than GR. It also has the advantage that it is designed to react well to changes in players skill - something that happens all the time on webdip, as players improve very quickly over their first few games.
Using an established rating system would *also* neatly sidestep religious wars over the precise implementation details of GR 2.0 (or whatever).
So - in summary - would you, as the community, support integrated ratings if we chose a different system than GR? Do you have any strong objections to TrueSkill? It is a near perfect fit for our needs.
The intention here is to replace the GR system with a system designed to produce more accurate ratings (which would give a more meaning to the ratings), and to integrate the ratings (which would take ratings out of the cupboard of players who've been around for a while, and into the light where even newbies know what their rating is).
Thoughts and comments welcome, either here or to the mod email address,
[email protected].
The primary question is "Would you support site-integrated ratings if we moved away from GR?"