News:

Herr Otto Partz says you're all nothing but pipsqueaks!

Main Menu

Race strength estimation, revisited

Started by Duplode, December 27, 2020, 10:52:35 AM

Previous topic - Next topic

Cas

Yes!  I think it'd be nice to have ratings like these in our profiles in the Wiki. In R4K, I've introduced a scoring system just at the start of this season. It's super basic. It'd be nice to have something like Elo ratings inside the profiles there too. On the other hand, Elo ratings are very dynamic. I would like to see if it's possible to categorise the variation of a pipsqueak's rating like, knowing that everybody follows more or less the same shape in the evolution of their rating, only sometimes wider or more narrow and sometimes higher or more flat, we could describe a curve "envelop" and thus, define a constant rating, that is, a "projected" pipsqueak rating. Now with this analysis for two very different sets of rules, as are those of ZakStunts and R4K, the comparison between both will surely give interesting information.

About RH/NoRH, I think that taking the best position regardless of the style is representive of RH, because if one surpasses their own RH lap with a NoRH one, it means one could've done it RH-ly. So I don't think there will be any distortion and the calculated information should be accurate for RH. On the other hand, if we want to analyse NoRH, we should use only NoRH laps. The problem there is that they are a lot fewer.

The current, very simple scoring system in R4K does precisely as Duplode did: take the best position for each pipsqueak, no matter the style. 2 points for participating and getting at least one verified replay. 1 to 3 extra points for reaching the podium positions at a race. I could add an extra point for posting a NoRH lap, but I think it's not good to encourage NoRH because that also encourages cheating, as there is currently no good way to check for NoRH. Most of you will think "But we're all good guys and good friends of one another"... yeah, until one is not and we can't tell who, ha, ha. This could be compensated with a "trustworthiness handicap index" XD
Earth is my country. Science is my religion.

Duplode

#16
Quote from: Overdrijf on January 03, 2021, 12:08:22 AM
Just knowing I have a Stunts elo now makes me nervous about losing points of it.  ;)

I wouldn't read too much into them. As Cas notes, the ratings are very dynamic:



In fact, they probably fluctuate a lot more than the usual chess (or, say, football) Elo ratings. (Keep in mind that, in my implementation, taking part in a race might include a pipsqueak in up to twelve matches, depending on where they land on the scoreboard.) I feel the ratings probably should fluctuate less wildly; however, if I attenuate the changes too much it takes forever for new pipsqueaks to reach a meaningful rating. That is one of the difficulties of trying to shoehorn Elo into a free-for-all format. In any case, it seems any individual ratings should at this point be taken primarily as a measure of recent form, and attaching further meaning to them would require extra rounds of thought, and potentially tuning.

Quote from: Cas on January 03, 2021, 12:39:55 AM
The current, very simple scoring system in R4K does precisely as Duplode did: take the best position for each pipsqueak, no matter the style. 2 points for participating and getting at least one verified replay. 1 to 3 extra points for reaching the podium positions at a race.

I have only now seen the season scores on the R4K site. That's pretty nice!  :) And I agree on not adding NoRH extra points in a mixed scoreboard race.

dreadnaut


Daniel3D

Edison once said,
"I have not failed 10,000 times,
I've successfully found 10,000 ways that will not work."
---------
Currently running over 20 separate instances of Stunts
---------
Check out the STUNTS resources on my Mega (globe icon)

Duplode

#19
Quote from: dreadnaut on January 03, 2021, 06:18:57 PM
Have you been here already, Duplode? https://arxiv.org/pdf/2008.06787.pdf

No, I hadn't; thanks for pointing me to it! This paper was posted on arXiv in August, while most of my research was done between April and May. I see at least three nice things in it:

  • An evaluation of various metrics for evaluating the rating systems, which could be a huge boon when it comes to tuning the system parameters.
  • A concise presentation of TrueSkill, which I might want to try adapting at some point in the future.
  • The way the results suggest all systems are wrong, only in different ways. Besides being psychologically reassuring, that squares with my observation from May that replacing my Elo engine with a Glicko implementation didn't actually improve things.
One interesting detail I noticed is that the paper's adaptation of Elo to free-for-all, described in section III.A, involves calculating winning probabilities by summing pairwise victory probabilities against all pipsqueaks and dividing by the number of pairwise combinations. That looks like a pretty rough approximation! (If I'm reading it right, it's as if each pairwise match were equally likely to decide the winner, which might not be too far from the truth if everyone's ratings are close to each other.) At any rate, unless I'm missing something it doesn't seem that much more sophisticated or principled than whatever it is I have been doing here. It's a tricky problem, really.

Daniel3D

Quote from: Duplode on January 03, 2021, 09:02:55 PM
(If I'm reading it right, it's as if each pairwise match were equally likely to decide the winner, which might not be too far from the truth if everyone's ratings are close to each other.)

I noticed that to when skimming through the article.
My thought was that they assumed single races. Free for all, anyone can participate so anyone can win. The skill of the player is unknown and therefore cannot be a factor in the formula.

But I'm not sure.
Edison once said,
"I have not failed 10,000 times,
I've successfully found 10,000 ways that will not work."
---------
Currently running over 20 separate instances of Stunts
---------
Check out the STUNTS resources on my Mega (globe icon)

Duplode

Quote from: Daniel3D on January 03, 2021, 09:55:46 PM
My thought was that they assumed single races. Free for all, anyone can participate so anyone can win. The skill of the player is unknown and therefore cannot be a factor in the formula.

Yup, and there lies the gap that's hard to bridge. One-to-one comparisons between pipsqueaks are nice and easy to make calculations about, but they don't reflect very well what goes on in a free-for-all match. (If you compete against five pipsqueaks in a race, your performance doesn't have five degrees of freedom, but only one, namely your laptime.)

(The most principled-looking approach I have seen yet to handling free-for-all matches in their own terms, without trying to reduce it to 1v1 matches, is the one in this Glickman paper. The math there, however, is a little intimidating, and I couldn't bring myself to sit down and digest it properly just yet. In the meantime, I keep throwing approximations at it to see what sticks.)

Cas

Just a thought. If you consider a race to be defined by its final scoreboard, then indeed, your final lap time is your only degree of freedom, but I'm thinking that if a race is the period during which you can post replays and look at the scoreboard, download other people's replays, compare strategies, post again, decide when to post your replay taking into consideration how other pipsqueaks will react, etc., then definitely there is at least one more degree of freedom here. One might argue this is irrelevant because in chess, you can computer a player's rating based only on the final result of the match, even though during the game, there can be a reaction at every move, but the truth is a player not always tries his/her best and does not always choose the same strategy. If you have the ability to make your opponent change strategy during the game, then we could see that opponent as a "set of opponents" instead for each of which, we have a degree of freedom.

This may sound pretty crazy, but I'll extrapolate this in a purely philosophical way towards physics. Would it be accurate to call these extra degrees of freedom "internal"?  And if so, can we, in the same way that we ignore internal forces within a floating object, or in which charges inside a Faraday cage remain isolated from the environment, just leave these freedoms out?
Earth is my country. Science is my religion.

Duplode

Quote from: Cas on January 03, 2021, 11:31:09 PM
This may sound pretty crazy, but I'll extrapolate this in a purely philosophical way towards physics. Would it be accurate to call these extra degrees of freedom "internal"?  And if so, can we, in the same way that we ignore internal forces within a floating object, or in which charges inside a Faraday cage remain isolated from the environment, just leave these freedoms out?

Yup, that makes sense. A bare Elo rating abstracts away all those internal degrees of freedom. To put it in another way, if a strong pipsqueak only has time for five minute listfillers for a whole season, the other pipsqueaks will eventually come to expect the listfillers, regardless of what could be possible in more favourable conditions. Perhaps there is a way to split ratings into a long-term baseline and a more volatile part by somehow accounting by "internal" factors such as pipsqueak activity; it's territory that has yet to be charted.

Duplode

#24
I have implemented the NDCG metric from the paper dreadnaut had suggested. (NDCG is one way to measure how close the results you might have predicted from the ratings before the race were to what actually happened, with a value of 1 meaning perfect agreement, and accuracy at higher positions on the scoreboard being given extra weight.)

As one might expect, the NDCG for ZakStunts races fluctuates wildly:



An exponential smoothing of the time series, though, shows that for the last several years of ZakStunts the fluctuations have been around 0.55, which is in the same ballpark of the results in the paper:



One thing I would like to achieve with NDCG thus far is using it to fine tune the parameters of my Elo engine. While parameter adjustments lead to localised changes in NDCG here and there, looking at the timeline as a whole the overall effect is very small. I have also attempted to use predicted positions obtained from simulations instead of just using the order of the ratings (I suspect that may ultimately give more informative NDCG results), but as far as the influence of Elo parameters is concerned that made no difference. To make progress on this front, I may also have to turn to the questions Cas has been raising here and try to think of a metric for how reasonable the evolution of individual pipsqueak ratings is.

Cas

Besides individual pipsqueak evolution, looking at the first graph on your last post, with its fluctuations, makes me think about the track characteristics. When a measurement of a certain function of time f(t) against t looks this wild in a graph, my first thought is "maybe t is not my variable". In other words, these fluctuations suggest that races have qualitative aspects (or maybe quantitative too) that change from race to race, but that do not "evolve" with time, but instead, depend on something else. It may be not just interesting, but also perhaps even useful, to analyse the characteristics of the track for each race putting in different baskets the ones that gave high and respectively low values, both in just strength and in the precision of the measurement of strength. I think the odds are high that we find a pattern. It may be as simple as complex tracks vs simple tracks or long tracks vs short tracks.

It's 3:00am right now. Otherwise, instead of suggesting this, I would first go and check the tracks myself and then directly post my first impression :P
Earth is my country. Science is my religion.

Duplode

Quote from: Cas on January 16, 2021, 07:15:27 AM
When a measurement of a certain function of time f(t) against t looks this wild in a graph, my first thought is "maybe t is not my variable".

Though I haven't gotten to the qualitative part yet, your suggestion made me have a closer look at a few selected points in the graph. That ultimately led to a course correction, so thanks for getting me back on track  :) More specifically, I realised that:


  • Due to a programming mistake, my NDCG code only accounted for newbies from their second race onward, messing up the calculations for races in which there were debuts. Fixing that removed a lot of artifacts from the results (I have replaced the charts in my previous post; if you look at them now you'll see the fluctuations got noticeably tamer).
  • After the fix, doing comparative tests with varying Elo parameters has quickly shown the NDCG as specified in the paper is not as robust as we'd like it to be. The main problem is that predicting race results simply by putting ratings in order means we don't distinguish between a pipsqueak coming out ahead against a rating gap of, say, 1 point from doing the same against a 100 point gap. That being so, if ratings are close negligible fluctuations can have a disproportionate effect over the NDCG.

Those observations have rekindled my interest in using simulations to obtain the predicted results for the NDCG calculations, as a way of dealing with the robustness issue (for instance, the average positions of two pipsqueaks over a batch of simulations will be much closer if they have a 1 point gap than in the case of a 100 point gap). Here are some charts, without and with smoothing:





Quite a bit better. I don't think this alternative strategy is flawless, either (for one, it probably underrates predictions for races in which ratings are on the whole close to each other), but I'm far more confident about the results now.

Cas

Oh!  Yes, I definitely see a difference now and even without the smoothing, I can see a t-based variation there!  The noise is not so wild now so most of it (or all) could be accounted for with just chance. That is, track qualities surely still have an impact, but it's not as significant as the first graphics led me to think. I can't help by seeing a very marked change around C75. The whole "attitude" of the curve changes from there on.... The "Paleo-ZakStunts" and the "Neo-ZakStunts"   ???
Earth is my country. Science is my religion.

Duplode

Quote from: Cas on January 17, 2021, 07:45:10 AM
I can't help by seeing a very marked change around C75. The whole "attitude" of the curve changes from there on.... The "Paleo-ZakStunts" and the "Neo-ZakStunts"   ???

I can think of a few factors that may cause that, the main one perhaps being the large fields typical of 2002-2006 races. Position prediction errors are more likely with more pipsqueaks involved, and my alternative strategy for computing the NDCG probably exacerbates that. It might be possible to offset that to some extent by further increasing the weights given to higher positions in the NDCG.