lichess.org
Donate

Ratings Are Broken

As an 1550 player, I can confirm that playing against 1800-2000 rated players is often easier than playing against lower rated players.
However, I propose different solution: extending the period for having K40 and introducing K60 for players under 15. This would allow new players to avoid being underrated when they reach the K20.
@Parzival_2 said in #41:
> As an 1550 player, I can confirm that playing against 1800-2000 rated players is often easier than playing against lower rated players.
> However, I propose different solution: extending the period for having K40 and introducing K60 for players under 15. This would allow new players to avoid being underrated when they reach the K20.
K factor is not only up but also down. The bigger the K-factor, the bigger the swings. We should avoid that one month somebody goes up with 400 points and next month again goes down with 300 points. Yes your proposal can solve the overall deflation but we don't want to create new problems.
The blind statistician credo might still exist. The statistician that is also a modeler might be a difficult thing to find. But if a blind statistician can notice an effect, I think the modeler should take that into account.
why has it taken a statistician this long to show what we all knew when i was a junior 40 years ago? the elo system is deflationary for 2 main reasons: firstly, as was correctly stated in the report, underrated PLAYERS (and i don't only include juniors here, there is an australian IM named aleks wohl who only started playing when he was 16 or 17) coming into the pool and staying around playing for years and years before retiring at a higher rating than they started, as opposed to players that are not inclined to the game that come and play a handful of tournaments, lose a few hundred points with bad results and stop playing. the imbalance, due to the nature of the calculations in the elo system, has to come from somewhere. the evidence is in every single rating list published since ratings with elo began: how many more 2100s are there in those lists than people with a negative rating?!

glicko is a derivative of the elo system that tries to approach this problem by adding in a time factor, but over time, such that not playing for a long while gives a higher change in rating. this is decidedly NOT the final answer, as some good players don't play for a while, and then gain more points than they should, ending up with higher ratings than they should have, but it does approach the problem of deflation by not having a constant points pool of 1000 points per registered player.

in australia, a lot of people are underrated. i myself grew up in the fortunate (for chess in australia at the time) when the soviet union collapsed and we had a large influx of great (read: over 2000 rated, including some titled) players from former soviet lands. this was great for my chess because, even though i never played many tournaments, i could watch and play with these people, and in so many cases, their passion for the game was catching. my otb rating stayed at 1100-something for many years though, until i went to europe. i don't know my rating there, but i again met a lot of players, playing regularly on the maxeuweplein in amsterdam, and when i returned to australia, i played a couple of tournaments and gained 300 more points, decidedly no longer a junior.

to this day, my rating otb sits at 1400-something with a double questionmark, with my AGE being ... well, i reached the half century a little while back. it's not known otb what my actual strength is. the ratings i have here on lichess are probably close, but there seems also to be a decently sized gap between ratings here and otb. indeed, again when i was a junior, 1100 ACF seemed to be close to 1500 USCF, with the gap closing as the rating got closer to FM level. i don't know whether that still holds today, but at the time i was fortunate enough to be able to play on a giant chessboard that was set up in melbourne city proper against many players, with chessplaying tourists enjoying the social aspect of playing out in the middle of the street on a giant chessboard and shooting the breeze. good times, and playing strong players, but not playing tournaments, though i definitely did know the rules, and in fact even had a rating handbook for some time, so i even knew how to calculate elo as well as the complete fide rules of chess as they were published in 1989.

i can't say i know what the absolute solution is to the deflationary nature that comes with people leaving generally with a higher rating than they started with, but glicko, being a non-constant rating pool, approaches that answer with its time factor involvement. any system with a constant per player pool is, by nature, deflationary, as players leave with a higher rating than they started with. at some point this will have to average out, but it would be quite funny to me, knowing that i was an 1100 rating player in aus as a junior, to think of a day where the gm norm is 1400, my current otb rating, because the deflation has gone that far that people playing for fm norms are just coming into the pool and being given 1000 points!
I am also a data scientists as a full timer. You cannot learn from an ELO.
"Your findings need to present a solution, not a problem."
The point is to VISUALIZE these findings for action.
Especially When you are in the common sense portion of CI/CD.
And with Lichess, we all know what factors are at play. There is no changing the framework or nature of the site at this stage.

*<--- ==THEREFORE: instead of an overhaul to ratings, I suggest:== -->*
The PROFILE stats page needs a BIG overhaul.
<Show the important stats with Visuals and Icons under the profile>
<Create a site-wide profile page for teams and groups as well, for Efficacy>
Example //
[%Lost Advantage] Do they hold advantage or desperado? Categorized by personality icon. %Sac.
[%Frutility and Contempt] yes you can legit tell the actual percent.
[%wins:] over higher rated. color coded [+200, +100, +50]
[String Count of losses:] in a row; to higher and lower rated. Per session. with Red next to the ones that have a bias on the curve.
[%accuracy in 3 sections:] opening, middle, endgame: color coded. (do they blunder middle game position and still win?)
[%blunder in 3 sections: ] when do they blunder most? Opening, middle, endgame. Per ELO Category.
[%wins after blunder:] over higher rated.
[%Change in opening play:] based on losses. (do they keep playing the same system regardless then suddenly win? Do they go on tilt and string 5 losses in a row?)
[%Key Squareplay accuracy:] over wins; (many winboard attacking strategies; will blindly use the same squares regardless of a loss and ditch pawns, then still win...smh)
[%Overvalue per piece] percent of wins against higher rated, despite holding on to a bad bishop for example.
[%Blockade or Pawn Structure Mindfullness] Known Carlsbad and other pawn structures hold 'weight'. You and also "Count" the times these structures are obtained per player. Wins despite poor structure. Or possibly London player who consistently loses his structure in the middle game. Or does not know how to breakthrough.
[%Wins in lost positions] color coded under elo.
[%Endgame Power-Play] percent of EGTB accuracy in category 4-5-6 pieces left, that results in a win. If both play poorly under time pressure or if one person is a flaggert specialist.

I can think of 20 more; but the point is to VISUALIZE this stuff on the profile page. They do it currently for Tactics and Puzzles.

My stats on Lichess, utilizing Python, Wolfram tools, and SCID have just found this is ridiculous: ELO and how you get it, doesn't matter. "If all stats remain the same". +200 in Glicko, or +200 FIDE, is still +200 site-wide. What will you do with this rating? What is its significance?
"Measuring Centrepawn is a better argument".

My honest rating OTB is 200pts higher, I'm crushing 2400 players in the park and at the coffee shop.
But apparently on Lichess this is considered absurd, because everyone on lichess is far higher rated then OTB. I don't think so...smh. Especially when my tactics rating is right on point.

Maybe they can Employ a "Qualifier?" have players do a 30 second tactics buffer randomly-weekly before every live match. That's your rating for the day.
Have it Launch, in the middle of a session when clicking 'New Opponent'. Just like "Captcha". Every so often. I GUARANTEE A RATING DROP.
@robscat said in #44:
> why has it taken a statistician this long to show what we all knew when i was a junior 40 years ago? the elo system is deflationary for 2 main reasons: firstly, as was correctly stated in the report, underrated PLAYERS (and i don't only include juniors here, there is an australian IM named aleks wohl who only started playing when he was 16 or 17) coming into the pool and staying around playing for years and years before retiring at a higher rating than they started, as opposed to players that are not inclined to the game that come and play a handful of tournaments, lose a few hundred points with bad results and stop playing. the imbalance, due to the nature of the calculations in the elo system, has to come from somewhere. the evidence is in every single rating list published since ratings with elo began: how many more 2100s are there in those lists than people with a negative rating?!
>
> glicko is a derivative of the elo system that tries to approach this problem by adding in a time factor, but over time, such that not playing for a long while gives a higher change in rating. this is decidedly NOT the final answer, as some good players don't play for a while, and then gain more points than they should, ending up with higher ratings than they should have, but it does approach the problem of deflation by not having a constant points pool of 1000 points per registered player.
>
> in australia, a lot of people are underrated. i myself grew up in the fortunate (for chess in australia at the time) when the soviet union collapsed and we had a large influx of great (read: over 2000 rated, including some titled) players from former soviet lands. this was great for my chess because, even though i never played many tournaments, i could watch and play with these people, and in so many cases, their passion for the game was catching. my otb rating stayed at 1100-something for many years though, until i went to europe. i don't know my rating there, but i again met a lot of players, playing regularly on the maxeuweplein in amsterdam, and when i returned to australia, i played a couple of tournaments and gained 300 more points, decidedly no longer a junior.
>
> to this day, my rating otb sits at 1400-something with a double questionmark, with my AGE being ... well, i reached the half century a little while back. it's not known otb what my actual strength is. the ratings i have here on lichess are probably close, but there seems also to be a decently sized gap between ratings here and otb. indeed, again when i was a junior, 1100 ACF seemed to be close to 1500 USCF, with the gap closing as the rating got closer to FM level. i don't know whether that still holds today, but at the time i was fortunate enough to be able to play on a giant chessboard that was set up in melbourne city proper against many players, with chessplaying tourists enjoying the social aspect of playing out in the middle of the street on a giant chessboard and shooting the breeze. good times, and playing strong players, but not playing tournaments, though i definitely did know the rules, and in fact even had a rating handbook for some time, so i even knew how to calculate elo as well as the complete fide rules of chess as they were published in 1989.
>
> i can't say i know what the absolute solution is to the deflationary nature that comes with people leaving generally with a higher rating than they started with, but glicko, being a non-constant rating pool, approaches that answer with its time factor involvement. any system with a constant per player pool is, by nature, deflationary, as players leave with a higher rating than they started with. at some point this will have to average out, but it would be quite funny to me, knowing that i was an 1100 rating player in aus as a junior, to think of a day where the gm norm is 1400, my current otb rating, because the deflation has gone that far that people playing for fm norms are just coming into the pool and being given 1000 points!

Because my previous post was epic: Just like "Captcha"
Maybe they can Employ a "Qualifier?" have players do a "30 second tactics buffer" randomly-weekly before every live match.
Factor this into your overall rating.
Have it Launch, in the middle of a session when clicking 'New Opponent'. Just like "Captcha". Every so often.
This way if you string together a few losses, but do well on the "Tactics buffer" you won't lose out.
I just needed first line to see whether blog is true or not.
Even as per psychology, relatively weaker player tend to win more than relatively stronger players.
@robscat said in #44:
> glicko is a derivative of the elo system that tries to approach this problem by adding in a time factor, but over time, such that not playing for a long while gives a higher change in rating. this is decidedly NOT the final answer, as some good players don't play for a while, and then gain more points than they should, ending up with higher ratings than they should have, but it does approach the problem of deflation by not having a constant points pool of 1000 points per registered player.

It does more than that. It does not impose the constraint of knowing what is the population distribution of ratings in the pool.

I think it can only, given some parameters choices about the individual players rating uncertainty distributions and how the game events are interacting over time (and as you mention some model of that time effect on such uncertainty (or is it certainty or just belief, fix my words if needed). But within that approach their could be other individual models and time evolution scenarios I would think.

My point is that the lack of restricted model familiy for the population distribution might promise for population dynamics modeling to "have room" at the population distribution complexity level (as in how many parameters might be needed to describe it wihtin some familiy, that we don't have a clue about without letting the informative data changes show us, if we wanted such a model restriction. Using an apriori limited space of such function as ELO is doing, would pressure more distorstions, in my intuitive reasoning above... don't pressure on the pressure cooker too hard... its cover.. allow for the dynamics to be where they need to be.. not propagated from that population constraint down to invidiual rating estimates.. something like that.

edit: did not finnish a sentence... it can only estimate some emergent moments (average, deviation) of the population distribution, which might allow implementations to choose initial conditions with some control there.. but it does not have function space limitations to that number of parameters, for the emerging population distribution. If my words make sense, or i did not distort what I read some time ago (i.e. if memory digested serves).
@dboing said in #48:
> ...not propagated from that population constraint down to invidiual rating estimates..'

this is actually a point that hasn't been made yet in this discussion. dr. elo never assumed that elo ratings would ever be exactly the showing of someone's strength. quite the opposite. while it is a somewhat useful indicator of strength, it is designed tio always be an approximation, that is to say, if someone is having a bad day/week/month/year compared to someone else, it will reflect in their rating dropping. likewise if they are having a good day vs. tat other someone their rating will increase. those in/decreases are relatively even for players they are expected to score around 50% with, and skewed where they aren't expected to score near 50%. (this is all common knowledge, but seems to have been mostly overlooked in this discussion, so just reminding people of this, not assuming it's not known.)

there is another factor at play when considering elo ratings, and this is actually quite forced and forcing, and always holds true whether on the internet or otb, in a small group or a large one: in a smaller pool there will always be a smaller variance in ratings, simply because the number of points needed isn't there to establish a wider base. that is to say that, where someone in a pool of 1000 players with average rating x might have a rating of x+200, that same person in a pool of 10000 players might have a rating of x+400. this is not a reflection of the inaccuracy of the system, but rather a reflection of the finer grading possible with a higher total player pool. this also doesn't say that either number is right or wrong, merely that the more players there are within a pool of ratings calculated with the elo system, the finer the grading will be, and by force, this means the range of ratings will be wider.

also note that if using the elo system in its purest form, of there is no score by one side, they can never gain a rating. ever. there needs to be scores by both sides before a rating is even possible if elo is used in its purest form, though in practice with k=15 (most national ratings lists based on elo), the widest realistic variation between two players is 774 due to the requirement of a nonzero score on both sides and the unlikelihood of anyone ever being free to have more than 200 games that are rated against a single opponent that has never played someone else. to make the point a different way: if someone never plays against a 1400, but has always played against gms of 2400 or more, they will always have a rating of 1626 or more if they have a nonzero score, regardless of the actual relative strength of that player that's 1626 and the one at 1400.

as if things weren't complex enough due to the players themselves varying in performance from day to day or week to week, the swiss pairing system often pits the top half of the field against the bottom half of the field in the first round, and in most fields, that halfway mark is somewhere around the 1800 mark. thus, in a weekend tournament where there might only be 5 or 7 rounds, being at the bottom of the top half of the field often means you have an easier tournament draw than the top of the bottom half, thus it's not so uncommon to have 'weak' 1800 players and 'strong' 1600 players, just because they play largely different fields even within the same tournament. this is taking nothing away from the 1800s, since they had to fight their way through the same situation to get there, but since elo is only an estimation at best, and since the draw is naturally harder as the top of the bottom half of the field than as the bottom of the top half, often it's impossible to say who is actually the better player anyway.
@dboing:

> It does more than that. It does not impose the constraint of knowing what is the population distribution of ratings in the pool.

Sorry if I misunderstand you, but does not Glicko assume ratings are normally distributed?