Rating Q&A
Q: How are team games rated
QLStats doesn't care about teams. This may sound wrong, and to a certain degree it is. But there are reasons for this:
In general, public Quake Live servers do not enforce balancing. Making someone lose points for playing in a bad team is unfair.
The social aspect of using team win/loss is that nobody wants to join a team mid-game when it is likely to lose.
The mathematical aspect is that win/loss of unbalanced teams adds volatility to the rating data and makes it less accurate for predicting results and matchmaking.
Aside from that, Glicko (and Elo) can only deal with 1-vs-1 outcomes, so splitting up the bounty between winners is not handled by the system. (There are approaches like an imaginary aggregated/averaged opponent, but that also has its flaws).
Therefore, qlstats rates players individually, regardless of teams.
Q: How are players rated after a match
First step is to filter out matches and players which don't meet certain requirements (round limit, match participation time, ...).
Then a score is calculated for each player's performance in the match. The exact formula depends on game type.
The score is adjusted for the player's match participation, either based on time or rounds where applicable. This is what you see in the "Perf" column.
Then the player list is sorted descending by score. Each player loses to all player above him and wins against all players below him - with a small margin to allow draws, which depends on game type.
Then for each player pair their ratings and the outcome ("win/loss/draw" according to scoreboard position) is passed to the Glicko system.
Glicko does not care about the calculated scores, only win/loss/draw matters. By 1 point or 1000 is irrelevant. Your rating is adjusted by winning multiple times and not by winning 1 time with a big margin. On Toxicity 20:0 wins are common, on Lost World not so much. The score is only used to put you in order with other players in the same match.
Each multiplayer match is treated like n*(n-1)/2 duels. All the gains/losses are accumulated and then the ratings are updated.
Q: What does the rating value mean
Glicko ratings consist of 2 values, the estimated rating "r" and a rating deviation "RD" which expresses confidence/uncertainty in that estimate.
In general RD gets lower with every match you play (except for very unlikely outcomes) and increases over time for inactivity.
So Glicko gives you a rating as r±RD as an interval for a player's skill rather than a single number. It's like an estimate and a standard deviation.
To keep things simple, qlstast.net shows r-RD as as the rating value, which puts the player somewhere at the lower side of the confidence interval.
The Glicko rating system distributes players and their ratings in a way that forms a statistical standard distribution (Gaussian bell curve) in the range 0-3000. That makes it possible to say something like "If player A is rated 200 point higher than player B, he has a 68% chance of winning".
(The numbers here are examples, the exact values would have to be derived from the data first, but that's past my math skills).
Q: My team won 10:3 but I lost points. This system is broken!
Not necessarily. If teams were unbalanced to begin with, winning is easy. If you were overrated or other players underrated, it is a normal that your rating will go down in a single match. Statistics works on averages and large data sets, not single data points.
Q: Why do I only gain 3 points but lose 50
Same as above. Statistics work over averages and large data sets, not single events (match results).
If you are higher rated and expected to win with 80%, it means you will often gain a small set of points and lose a big amout a few times so on average you will stay on the skill the system estimated you.
Normally ratings should not be applied after a single match, but instead over a rating period. The you would not observe these jumps and just stay on your estimated skill. But the jumps are the price to pay when you want to see immediate changes after each game instead of averaging over a week.
Q: Why Glicko and not Elo, Glicko2 or TrueSkill?
TrueSkill is considered "the best/only rating system for multiplayer games", but it is patentend and none of several Javascript implementations I tested produced the correct results for even a single match. Only a Python implementation worked, but I don't speak Python.
All other rating systems were designed for 1-on-1 matches (chess) and require some workarounds to deal with multi-player games, which have an impact on data quality.
Elo is mathematically simple, uses a single rating value per player and the total amount of points lost/gained in a match is always 0.
If an unrated Pro beats you in his first match, you'll lose a big share of points that you might need quite some time to recover from.
Glicko improves Elo by adding the RD value for uncertainty. If there is low uncertainty in your rating but high uncertainty in your opponent's rating, your rating will change by a smaller amount than his.
Glicko-2 adds a measure for volatility on top of that and more magic parameters, and when I put to the test with QL match data, the results showed some anomalies.
So Glicko-1 was the only viable solution for me with a working implementation and plausible results.
Q: Why are there no global rankings
If you let 1000 bad players play only on one server and a 1000 top players play on another server and they never mix, then both groups will create a standard distribution in the range 0-3000 independantly. So you have a 2200 rated Noob on one server and a 1300 rated Pro on the other.
For matchmaking purposes the numbers make perfekt sense. They are good for the players on the Noob server and also for the players on the Pro server.
But you can't compare them. Only when they all play against eachother and ratings adjust to eachother, you can compare them.
For the same reason you can't compare AU, EU, US ratings or ratings from a European Vampiric PQL 8v8 CA server with a closed community with the rest of the European CA players.
Q: Which matches are rated, how is the score calculated, ...
Match filters: https://github.com/PredatH0r/XonStat/blob/master/feeder/modules/gamerating.js#L108
Match/player filters: https://github.com/PredatH0r/XonStat/blob/master/feeder/modules/gamerating.js#L374
Score formulas: https://github.com/PredatH0r/XonStat/blob/master/feeder/modules/gamerating.js#L503
QLStats doesn't care about teams. This may sound wrong, and to a certain degree it is. But there are reasons for this:
In general, public Quake Live servers do not enforce balancing. Making someone lose points for playing in a bad team is unfair.
The social aspect of using team win/loss is that nobody wants to join a team mid-game when it is likely to lose.
The mathematical aspect is that win/loss of unbalanced teams adds volatility to the rating data and makes it less accurate for predicting results and matchmaking.
Aside from that, Glicko (and Elo) can only deal with 1-vs-1 outcomes, so splitting up the bounty between winners is not handled by the system. (There are approaches like an imaginary aggregated/averaged opponent, but that also has its flaws).
Therefore, qlstats rates players individually, regardless of teams.
Q: How are players rated after a match
First step is to filter out matches and players which don't meet certain requirements (round limit, match participation time, ...).
Then a score is calculated for each player's performance in the match. The exact formula depends on game type.
The score is adjusted for the player's match participation, either based on time or rounds where applicable. This is what you see in the "Perf" column.
Then the player list is sorted descending by score. Each player loses to all player above him and wins against all players below him - with a small margin to allow draws, which depends on game type.
Then for each player pair their ratings and the outcome ("win/loss/draw" according to scoreboard position) is passed to the Glicko system.
Glicko does not care about the calculated scores, only win/loss/draw matters. By 1 point or 1000 is irrelevant. Your rating is adjusted by winning multiple times and not by winning 1 time with a big margin. On Toxicity 20:0 wins are common, on Lost World not so much. The score is only used to put you in order with other players in the same match.
Each multiplayer match is treated like n*(n-1)/2 duels. All the gains/losses are accumulated and then the ratings are updated.
Q: What does the rating value mean
Glicko ratings consist of 2 values, the estimated rating "r" and a rating deviation "RD" which expresses confidence/uncertainty in that estimate.
In general RD gets lower with every match you play (except for very unlikely outcomes) and increases over time for inactivity.
So Glicko gives you a rating as r±RD as an interval for a player's skill rather than a single number. It's like an estimate and a standard deviation.
To keep things simple, qlstast.net shows r-RD as as the rating value, which puts the player somewhere at the lower side of the confidence interval.
The Glicko rating system distributes players and their ratings in a way that forms a statistical standard distribution (Gaussian bell curve) in the range 0-3000. That makes it possible to say something like "If player A is rated 200 point higher than player B, he has a 68% chance of winning".
(The numbers here are examples, the exact values would have to be derived from the data first, but that's past my math skills).
Q: My team won 10:3 but I lost points. This system is broken!
Not necessarily. If teams were unbalanced to begin with, winning is easy. If you were overrated or other players underrated, it is a normal that your rating will go down in a single match. Statistics works on averages and large data sets, not single data points.
Q: Why do I only gain 3 points but lose 50
Same as above. Statistics work over averages and large data sets, not single events (match results).
If you are higher rated and expected to win with 80%, it means you will often gain a small set of points and lose a big amout a few times so on average you will stay on the skill the system estimated you.
Normally ratings should not be applied after a single match, but instead over a rating period. The you would not observe these jumps and just stay on your estimated skill. But the jumps are the price to pay when you want to see immediate changes after each game instead of averaging over a week.
Q: Why Glicko and not Elo, Glicko2 or TrueSkill?
TrueSkill is considered "the best/only rating system for multiplayer games", but it is patentend and none of several Javascript implementations I tested produced the correct results for even a single match. Only a Python implementation worked, but I don't speak Python.
All other rating systems were designed for 1-on-1 matches (chess) and require some workarounds to deal with multi-player games, which have an impact on data quality.
Elo is mathematically simple, uses a single rating value per player and the total amount of points lost/gained in a match is always 0.
If an unrated Pro beats you in his first match, you'll lose a big share of points that you might need quite some time to recover from.
Glicko improves Elo by adding the RD value for uncertainty. If there is low uncertainty in your rating but high uncertainty in your opponent's rating, your rating will change by a smaller amount than his.
Glicko-2 adds a measure for volatility on top of that and more magic parameters, and when I put to the test with QL match data, the results showed some anomalies.
So Glicko-1 was the only viable solution for me with a working implementation and plausible results.
Q: Why are there no global rankings
If you let 1000 bad players play only on one server and a 1000 top players play on another server and they never mix, then both groups will create a standard distribution in the range 0-3000 independantly. So you have a 2200 rated Noob on one server and a 1300 rated Pro on the other.
For matchmaking purposes the numbers make perfekt sense. They are good for the players on the Noob server and also for the players on the Pro server.
But you can't compare them. Only when they all play against eachother and ratings adjust to eachother, you can compare them.
For the same reason you can't compare AU, EU, US ratings or ratings from a European Vampiric PQL 8v8 CA server with a closed community with the rest of the European CA players.
Q: Which matches are rated, how is the score calculated, ...
Match filters: https://github.com/PredatH0r/XonStat/blob/master/feeder/modules/gamerating.js#L108
Match/player filters: https://github.com/PredatH0r/XonStat/blob/master/feeder/modules/gamerating.js#L374
Score formulas: https://github.com/PredatH0r/XonStat/blob/master/feeder/modules/gamerating.js#L503
However, it seems like you've really thought about how a new stats page should be implemented and improved the system a lot from qlranks.
I wonder if you've studied statistics before? If not you have certainly spent a lot of time working with them!
For matchmaking it's a necessary evil to collect stats and when I heard that other people also started tracking stats, I pulled QLStats out of the drawer, hoping that XonStat's Elo rating system would be good enough. I'm still not a fan of ranking (ordering players by rating), but it was highly demanded, so I gave in.
I studied information systems and had statistics during my 2nd term at university - and it was the worst of all my grades :) I did a little research when I started implementing Glicko, but I still only have a very basic understanding of the math behind it. What I do know is that manipulating the rating numbers in any way would break the normal distribution and the probablility model required for matchmaking.
I can tune the way I calculate scores and decide win/loss/draw between players, but that shouldn't be arbitrary either. Both old and new formulas need to be compared by calculating the mean square error of expected vs. observed match results and then the one with the lower error would be best, even if it's unpopular.
/cha0z_
However, I think this system will get many people upset (count me among them).
Let me list some points with respect to CA, where the score (as I understand) is calculated as score=dmg/100 + 0.25*frags and compared across all players. This has some consequences:
- players who play individually for damage and frags are rewarded. No reward is given to teamplay. Teamplay in CA might actually result in a serious worsening of elo. This may happen when for a large proportion of rounds you get fragged at the beginning because a teammate failed to participate to the collective fight.
- comparing all players regardless of team means that if teams were unbalanced to begin with, you have a high probability of worsening your elo if you are in the lower-ranking team. In fact it is much more difficult to do damage when the enemy is much stronger. Think of two teams made up this way: [2000, 2000, 1900, 1800] vs [2000, 1900, 1700, 1400]. We could assume the elos are "true", so the two guys with 1900 do have the same skill. However, it is more likely for the guy in the lower-ranked team to be able to do LESS damage and less frags than the 1900-guy in the higher-ranked team. This means that his elo will go down. This is without considering players' behavior (ie camping or teamplay). In this case we would need the elo system to weigh the fact that teams were unbalanced, hence avoid inflicting a strong punishment on the players in the lower-rank team.
Think of it another way: in this scenario current-elo predicts the higher-ranked team to win, obviously, but the two 2000 and 1900 players in the lower-ranking team are expected to do the same score as their counterparts in the higher-ranked team. I think this is an unreasonable expectation.
All in all I think the worst thing about the current elo is that it considers individual performance in team games.
If a team-based elo calculation is not possible, I would go for a second-best option: within-team elo system, meaning that my scores get compared to those in my team and not those of the other team. I get punished if I do worse than lower-ranked players in my team, but not necessarily if I do worse than lower ranked players in the other team.
Finally, and I really do hope you read this, I would be willing to study the problem a little bit further to test some of the assumptions I've made. I am a Phd student in statistics so this kind of problems are inherently interesting to me. You can find me in steam with the same name :-)
NLKM
Comparing only within the team would be possible. The question is, if that really has much impact on the overall ratings. Sometimes you're on the stronger team, sometimes on the weaker, and things will average out. The real problem is that servers should enforce balancing.
If you want I can give you access to all the rating data I have.
Another point: oftentimes, matches that start balanced end up being not-so balanced, because of a player who just quits. Example: teams are balanced, one of them has a lower skill player, this guy quits after a few rounds. the substitute is high skill, his team wins, his score will not count towards elo, but the losing team scores will. This happened to me today: I did pretty well overall, but lost a sh*tload of elo.
By the way, the system right now works well when everyone plays thinking of the team, and if skill ranges are not too far off. It actually gives accurate predictions as to who will score how much.
Still, even if teams are balanced (which I take as a given), there are two borderline scenarios that I can think of.
(1) a player hides and plays alone, (2) there is an outlier (too low skill or too high skill) that messes up things.
point (1) is also a matter of incentives. hiding would be much less profitable if losing a match means no elo points can be gained.
I see point (2) as a matter of the way scores are summarized when balancing teams. i.e. taking the team average. Now this is an interesting statistical problem (ie among all the distributions of team-elos with the same mean, which one is most likely to win matches?)
Other ideas, based on the available data:
- consider time-alive instead of number of rounds: the damage you do is spread over how many seconds it took for you to do it. Hence it doesnt matter if you hide because you've wasted time so the damage you do counts less. IDK if this would be possible
- count the number of "last-standings" a player receives.
But these two are negative-feedback options which might have unintended consequences as to which incentives they bring about :[
- a mix of the two: reverse-rank players based on the order they are fragged every round, sum at the end of the match, apply penalizing factor if 'waiting score' goes over threshold, or dynamically based on overall ranking
Now this is all based on CA, and I'm not sure how CTF works, though I'm guessing that different team-behaviors should be rewarded similarly, even if they result in very different scores.
I would be really curious to take a look at all the data. let's talk on steam.
NLKM
OLG GLICKO - 1770 ± 46 GLICKO CHANGE - -2 / -2
You were shuffled based on that rating and after the match your rating was changed by -2 and your RD also by -2 (which means less uncertainty and smaller rating changes in the future).
Based on the calculated "Perf(ormance)" value, if you get beat by or are within the draw margin of someone who had a worse old rating than you, then you lose points. If you beat (or draw with) someone who had a higher rating, you gain points. All gains/losses are accumulated to the final Glicko Change.
red skull incarnate 0:19:59 16 22 7 7546 4926 92 81 2091 ± 89 -12 / -9
22 frags/7 deaths, second highest damage and you get a -12 glicko, despite team winning and not any issues during game. Other player with matching elo to me scored + glicko with about 500 more dmg than me.
// no more posts.
Again, the rating does not care at all about teams. All players, regardless of team, are compared to eachother. If anyone who had a lower rating than you has a better game performance than you (Perf value), you will lose points to that player. At the same time, when you outperform a higher rated player, you will gain points. The actual amount depends on the rating difference and on the confidence in the ratings (RD value).
There is a small margin for draws, so if you and a lower rated player are within +/- 2 performance points, you will lose too, but less than in a loss.
So, all players on the server are compared, the gains and losses accumulated, and that is your final update for a match.
In the Ratings Q&A thread you can find more information about the ratings. Currently qlstats doesn't care about teams at all and the reasons for that are also explained there.
Qlstats uses the Glicko system, which has very clear update rules. When you tamper with the inputs, outputs or the rules, the whole system becomes inconsistent and despite all the best intentions, things will only get worse.
The data provided by QL about a match includes "damage dealt" with uncapped values (meaning hitting a 1hp person with a rail counts as 80 damage) but "damage received" with capped values (if you have 1hp and get hit by a rail, you receive 1 dmg). Ofc it would be better if the game provided it in a consistent way, but I can't change that.
Best regards
Sure_Shot