Print Page - Race strength estimation, revisited

Title: Race strength estimation, revisited
Post by: Duplode on December 27, 2020, 10:52:35 AM

For a long while, one of my favourite Stunts investigation topics has been the evaluation of race strength, be it merely for the enjoyment of historians and pundits, or to inform some kind of spiritual successor to Mark L. Rivers' SWR Ranking (https://digilander.libero.it/stunts.SDR/SWR/SWR.htm). Some of you might even remember my 2012 thread on the matter (http://forum.stunts.hu/index.php?topic=2735.0). Now, after a long time with this project in the back burner, I have made enough progress to feel like posting about it again. So, without further ado, here is a plot of race strengths covering the 236 ZakStunts races so far:

(https://i.imgur.com/st3JfQl.png)

(Attached below is the Excel file this chart belongs to, so you can have a closer look at the data.)

While this chart might look rather like the ones I shown you years ago (http://forum.stunts.hu/index.php?topic=2735.msg67944#msg67944), there is one major difference: this time, a clearer procedure to obtain the data has led to values that are meaningful on their own. For instance, consider the massive spike you see just left of the middle of the chart. That is ZCT100 (http://zak.stunts.hu/tracks/ZCT100), whose strength is around 70. According to the model underpinning the calculations, that number means a pipsqueak of Elo rating 1500 (which generally amounts to lower midfield) would, if they joined ZCT100, have a 1 in 70 chance of reaching a top five result on the scoreboard.

The numbers here aren't definitive yet, as I still want to check whether there is any useful tuning of parameters to be done, as well as to figure out how to estimate some of the involved uncertainties. In any case, I believe they look fairly reasonable. Within each season, the ranking of races is generally very sensible. Comparing different eras of ZakStunts is, as one might expect, trickier. In particular, I feel the model might be overrating the 2010 races a little bit. Also, it is hard to tell whether the model underrates races from the first few seasons (2001-2004) as it moves towards a steadier state. Still, the chart does seem to capture the evolutionary arcs of ZakStunts: a steady increase in the level of the competition over the initial years, culminating in the 2005-2006 high plateau, followed by a sharp drop in 2007, and so forth.

I will now outline how this new strength estimation works. Some of what follows might be of interest beyond mere technical curiosity, for parts of the procedure can be useful for other investigations in Stunts analytics.

(By the way, you can check the source code of my program (https://github.com/duplode/elo-zs) on GitHub, if you are so inclined.)

When I set about resuming this investigation early this year, I decided to, instead of rolling yet another quirky algorithm from scratch, start from well-understood building blocks, so that, if nothing else, I would get something intelligible at the end. Balancing that principle with the known limitations of my chosen methods (and there are quite a few of them), I eventually ended up with the following pipeline of computations:

From the ZakStunts results, compute Elo ratings (https://en.wikipedia.org/wiki/Elo_rating_system) at every race.
Obtain, from the Elo ratings, victory probabilities against a hypothetical 1800-rated pipsqueak, and use those probabilities to parameterise a rough performance model, which amounts to the probability distribution of lap times (relative to an ideal lap) for a pipsqueak.
Add a ficticious 1500-rated pipsqueak to the list of race entrants, and either:
- Use the performance model to implement a race result simulator, which spits out possible outcomes when given a list of pipsqueaks and their ratings, and run the simulation enough times to be able to give a reasonable estimate of the likelihood of a top 5 finish by the ficticious pipsqueak; or
- Numerically integrate the aporopriately weighed probability density for the ficticious pipsqueak to obtain, as far as the model allows, an exact result for said likelihood.

(Implicit in the above is that my code includes both a Elo rating calculator and a race result simulator, which can be put to use in other contexts with minimal effort.)

Let's look at each step a little closer. When it comes to ratings of competitors, Elo ratings are a pretty much universal starting point. They are mathematically simple and very well understood, which was a big plus given the plan I had at the outset. For our current purposes, though, the Elo system has one major disadvantage: it is designed for one-versus-one matches, and not for races. While it is certainly possible to approach a race as if it were the collection of all N*(N-1)/2 head-to-head matchups among the involved pipsqueaks, doing so disregards how the actual head-to-head comparisons are correlated with each other, as they all depend on the N pipsqueak laptimes. (To put it in another way: if you beat, say, FinRok in a race, that means you have achieved a laptime good enough to beat FinRok, and so such a laptime will likely be good enough to defeat most other pipsqueaks.) All that correlation means there will be a lot of redundant information in the matchups, the practical consequence being that a single listfiller or otherwise atypical result can cause wild swings in a pipsqueak's rating. Trying to solve the problem by discarding most of the matchups (say, by only comparing a pipsqueak with their neighbours on the scoreboard) doesn't work well either: since we only have ~12 races a year to get data out of, that approach will make the ratings evolve too slowly to be of any use. Eventually, I settled for a compromise of only using matchups up to six positions away on the scoreboard (in either direction), which at least curtails some of the worst distortions in races with 20+ entrants. Besides that, my use of the Elo system is pretty standard. While new pipsqueaks are handled specially over their initial five races for the sake of fairer comparisons and faster steadying of ratings, that is not outside the norm (for instance, chess tournaments generally take similar measures).

Elo ratings are not enough to simulate race results, precisely because of the distinction between a collection of matchups and a single race discussed above. A simulation requires a model of the pipsqueak performance, so that individual simulated results for each pipsqueak can be put together in a scoreboard. One workaround to bridge this gap relies on victory probabilities. It is possible, given Elo ratings for a pair of pipsqueaks, to calculate how likely one is to defeat the other in a matchup. Similarly, if you have the laptime probability distributions for a pair of pipsqueaks, you can calculate how likely it is for one of them to be faster than the other. A few seat-of-the-pants assumptions later, we have a way to conjure a laptime probability distribution that corresponds to an Elo rating. As for the distributions, the ones I am using look like this:

(https://i.imgur.com/mbhvmpo.png)

This is a really primitive model, perhaps the simplest thing that could possibly work. It is simple enough that there are victory probability formulas that can be calculated with pen and paper. There is just one pipsqueak-dependent parameter. As said parameter increases, the distribution is compressed towards zero (the ideal laptime), which implies laptimes that are typically faster and obtained more consistently (in the plot above, the parameter is 1 for the blue curve and 2 for the red one). While I haven't seriously attempted to validate the model empirically, the features it does have match some of the intuition about laptimes. (On the matter empirical validation, one might conceivably drive five laps on Default (http://forum.stunts.hu/index.php?topic=18.msg50237#msg50237) every day for a month and see how the resulting laptimes are spread. That would be a very interesting experiment, though for our immediate purposes the differences between RH and NoRH might become a confounding factor.)

Having the laptime distributions for all entrants in a race makes it possible to figure out a formula that can be used to, in principle, numerically compute victory and top-n probabilities against its set of pipsqueaks. In practice, it turns out that victory probabilities aren't a good race strength metric, as the results tend to be largely determined by a small handful of pipsqueaks with very high ratings. To my eyes, the top-5 probabilities are at the sweet spot for strength estimations. I originally believed calculating the probabilities by numerical integration would be too computationally expensive (as the number of integrals to be numerically calculated grows combinatorially as the n in top-n grows), so I used the alternative strategy of simulating the races and afterwards check how often top-5 results happen. The chart at the top of the post was generated after 100,000 runs per race, that is, twenty three million and six hundred thousand runs to cover all ZakStunts races, which took fifteen and a half minutes to perform on my laptop. Later, I figured out that, with sufficiently careful coding, the numerical method, which has the advantage of giving essentially exact results, is feasible for top-5 probabilities; accordingly, an alternative Excel file with those results is also attached. (The simulations remain useful for wider top-n ranges, or for quickly obtaining coarse results with 1,000 to 10,000 runs per race while tuning the analysis parameters.)

To turn the discussion back to sporting matters, the troubles with wild rating swings and outlier results alluded to above brought me back to the question of listfillers, already raised by Bonzai Joe all those years ago (http://forum.stunts.hu/index.php?topic=2735.msg49468#msg49468). Left unchecked, a particularly weak listfiller in a busy race can wreck havoc upon the Elo rating of its unfortunate author. That ultimately compelled me to look for objective criteria according to which at least some of the obvious listfillers can be exclued. For the current purposes, I ultimately settled on the following three rules:

Results above 300% of the winning time and more than two standard deviations away from the average of laptimes are to be excluded. (The Bonzai Joe rule.)
GAR and NoRH replays are only counted if the fastest lap on the parallel scoreboard they belong to is, or would be, above the bottom quarter (rounding towards the top) of the scoreboard. (The Marco rule.)
For our current purposes, a car is deemed "competitive" if it can be found above the bottom quarter (rounding towards the top) of the scoreboard, or if it was used to defeat a pipsqueak using a competitive car whose lap was not excluded according to the previous two rules. Only laps driven with competitive cars count. (The Alan Rotoi rule).

These rules were applied to the full list of ZakStunts race entries that I'm using (it was one of those quarantine days back in May). Disqualified race results were also removed; there were a few curious findings in that respect I should write about one of these days. (By the way, ghosts are not included in the calculations, regardless of what their race entries look like.)

(A footnote: the bar for applying the first rule above looks, at first, incredibly low. I considered using a lower percentage, like 250%, and would rather not have the frankly bizarre standard deviation additional condition. It turns out, however, that ZCT029, a difficult dual-way full Vette PG track from 2003, had an extraordinarily broad spectrum of laptimes, including several pipsqueaks with non-listfiller laps beyond the 300% cutoff which would have been excluded without the standard deviation test. Faced with such a peculiar scoreboard, I opted to err on the side of circumspection.)

It remains a tall order to find objective criteria to discard listfillers that won't exclude too many proper competitive laps as a collateral effect. Ultimately, if we were to establish new pipsqueak rankings I suspect different use cases would call for different kinds of ratings. An Elo-like ranking is appropriate for race strength estimations, simulations and predictions, when what is needed is a picture of how well someone is racing at a specific moment in time. For comparing performances within the last several months, though, a ranking of weighed (for instance, by recency or race strengths) race scores within a time window, in the style of SWR, might prove more appropriate. With this kind of ranking, it becomes reasonable to have, for instance, ZakStunts-style worst results discards, which could definitely help dealing with listfillers.

Anyway, by now I probably should stop rambling, at least for a little while :D Questions, comments, criticism, suggestions about the metric and ideas on cool stuff to do with those algorithms are all welcome!

Title: Re: Race strength estimation, revisited
Post by: dreadnaut on December 27, 2020, 01:25:16 PM

My main takeaway is that I won the races with the lowest scores :P

Title: Re: Race strength estimation, revisited
Post by: Duplode on December 27, 2020, 04:10:37 PM

Quote from: dreadnaut on December 27, 2020, 01:25:16 PM
My main takeaway is that I won the races with the lowest scores :P

I think you were looking one race ahead in that crowded chart. Let's zoom into 2017 and 2019:

(https://i.imgur.com/XxYSNZ4.png)

(https://i.imgur.com/1AndU3s.png)

According to the model, both Z194 and Z210 were the fifth strongest races in their respective seasons, with Z194 being the strongest race in the second half of 2017.

Title: Re: Race strength estimation, revisited
Post by: Cas on December 27, 2020, 07:50:50 PM

Wow! This blew my mind! I had been thinking about something like this... Correction, definitely not like this, not this elaborate! Something about the same matter, I had been thinking in the past. I didn't get to the part of running experiments and doing any actual math, although I did have in my mind (usually while in the shower or while walking) a mathematical fight on how to determine two things. The analysis here is so good that what I'm going to say probably won't add much, but in case it might help in any way, here it is:

I wanted to give tracks and pipsqueaks numbers. In the case of the track, there should be a theoretical best possible lap, even in free style. The possibilities are so huge that we can't find this value, but we can know it has to exist. Now, for a human, theory suggests that while possible, this lap is unobtainable, because there are so many lower laps with higher probability. But if you go down, at some point, you'll find the next best thing: the lap the best pipsqueak is most likely to achieve after the duration of the race if he is allowed to try his best (and he does). Again, super hard to obtain... but this one could be estimated if we knew the proportion of how good each pipsqueak is relative to the "best pipsqueak" and how well they perform on the track when given the same time and when we know they are trying their best.

This leads to the other thing I wanted to calculate: how good a pipsqueak is, in numbers. This is complex because pipsqueak quality is dynamic and because tracks don't just vary in length, but some stunts may be easier for some pipsqueaks and others for others. But I think what has the greatest effect is the evolution of a pipsqueak. A pipsqueak may start participating and not get such good results, until he understands what the thing is about, becomes interested and that's when he starts trying his best. This not necessarily will happen at the beginning. A pipsqueak could spend years just posting to fill the scoreboard and one day decide to take it seriously. When he tries his best, he'll start improving quickly and will get to a maximum. After some time, even his first try on a race will be much better than before this improvement so even when he doesn't have much time, his posting will always be reasonable. From this point on, his laps will fluctuate depending on how much time or interest he had during that race, almost always staying above that lap time and below the maximum he obtained in the previous period. Of course, seeing that your lap time has been superseded by somebody else usually pushes you to try a better lap, unless the difference is too much and you don't think you can do it. So, in my opinion, the most stable point to qualify how good a pipsqueak is is not that of his best results, but that lower "mesa" after his peak. It seems to me that, to compare pipsqueaks, we should compare their first replays on a track, performed relaxed. When they try their best, it can vary much more.

It would be interesting to make a test in R4K... a completely "blind race", in which the results of replays are never posted and you only see the whole scoreboard at the end. Like it is all quiet days. Then compare that to a similar track on a regular race. Another test could be to announce a race with time for everybody to get ready, but only give them a couple of days to post replays, but the results there would be similar to those of Le Stunts races, perhaps, or at least, moderately proportional.

Alright, this is brainstorming as usual. Hope anything of what I wrote will be of any use. There's no math there, so it's not as robust, but when one speaks and speaks, something has to be useful.

Title: Re: Race strength estimation, revisited
Post by: Duplode on December 28, 2020, 05:53:30 AM

Quote from: Cas on December 27, 2020, 07:50:50 PM
I wanted to give tracks and pipsqueaks numbers. In the case of the track, there should be a theoretical best possible lap, even in free style. The possibilities are so huge that we can't find this value, but we can know it has to exist. Now, for a human, theory suggests that while possible, this lap is unobtainable, because there are so many lower laps with higher probability. But if you go down, at some point, you'll find the next best thing: the lap the best pipsqueak is most likely to achieve after the duration of the race if he is allowed to try his best (and he does). Again, super hard to obtain... but this one could be estimated if we knew the proportion of how good each pipsqueak is relative to the "best pipsqueak" and how well they perform on the track when given the same time and when we know they are trying their best.

Ah, the mythical perfect lap... I had actually been thinking about this matter back in May, and now that I understand the problem a little better I will have another look at it. My plan would be roughly like this: suppose we have a race scoreboard with laptimes, plus the performance model for the pipsqueaks. If we trust the models (and that's a big if!) and can guess where the laptime of each pipsqueak falls in their probability distribution curve (maybe by speculating about how well they have raced that month by their standards), we can plot the laptimes against the positions along the curve and estimate the ideal laptime by doing a linear regression. Later I will find a ZakStunts race I know enough about to attempt that.

On the rankings, you are certainly right in that there are lots and lots of confounding factors, and some of them, like RH optimisation effort, are extremely difficult to quantify (unless the pipsqueak keeps a diary!). If you look closely at what happens on a track, other difficult questions appear (for instance: how meaningful it is to compare laptimes obtained using different racing lines?). Any attempt at an all-time ranking based upon scoreboard data will run against this sort of limitation.

If I were to propose an all-time ranking based on Elo ratings, taking each pipsqueak's highest rating looks appealing at first, because it is easy to figure out, and also because it reflects more than a single moment in time (while Elo ratings are primarily a measure of current performance levels, some information about past results does go into the number). Still, your thoughts about the "mesa" and steady performances are interesting. I wonder how those could be measured (moving averages? uncertainty of ratings?). Perhaps patterns along which ratings evolve are in themselves worthy of investigation. (This discussion, by the way, reminds me a bit of the fabulous F1Metrics model (https://f1metrics.wordpress.com/2019/11/22/the-f1metrics-top-100/), which kind of takes a hybrid approach, by using average scores over the best three consecutive years to build an all-time ranking.)

Title: Re: Race strength estimation, revisited
Post by: Cas on December 28, 2020, 06:42:06 AM

We should make a study. We have abundant information from previous ZakStunts races and R4K is now starting to accumulate a reasonable amount as well. OWOOT vs FreeStyle is a very interesting comparison to make in this regard!

I am not very familiar with measurements and statistics outside Stunts and that really is a bad thing. I should be more educated on this so that I can add better value to the topic. Yet, I do have some instinct that saves me sometimes. My feeling is that we should start by analysing the fluctuations in the performance of a pipsqueak through time and do the same with other pipsqueaks and try to figure out structures like the ones I described that really are my guesses, my feelings. It'd be good to see how true or not they are and if there are others instead. With that knowledge, we could begin to analyse a race in a different way: knowing that each pipsqueak participating in it is not just a pipsqueak, but a pipsqueak at a certain point of its evolution, with that point added as a parameter.

Another thing that probably would be useful comes from what you just mentioned... changing the line. It'd be interesting to generate "scoreboards" for partial races. That is, a scoreboard at the first half minute, at the first minute, at the third half minute, and so on. Or, instead of making it space-based, make it time-based. A scoreboard after the first 5 days of the race going on, another after 10 days, etc. There will be missing pipsqueaks in some of them, but we can complete the thing. We could also make the same experiment but counting time not from the beginning of the race, but from each pipsqueak's first posted lap. This assumes a pipsqueak will first post probably on the same day of first sitting to try the race.

This track/pipsqueak problem sounds to me like a two-unknown, two-equation system in which unknowns are hyper-complex vectors and the equations are non-linear.

Title: Re: Race strength estimation, revisited
Post by: Daniel3D on December 28, 2020, 08:52:28 AM

I've been reading this with interest. And it is beautiful , I think I understand most of it. But it made me wonder. In cas his bliss there is a track analysis, that can generate estimated lap times from a number of raçers.

Can that be adapted to generate a difficulty modifier for the race strength estimation?

Title: Re: Race strength estimation, revisited
Post by: Cas on December 28, 2020, 09:02:04 AM

Well, Duplode is much more educated than I am about statistics and data analysis and I must admit I understood a good portion of his work on this topic, but not all, so it all depends. As I understand it, "race strength" is a concept that has to do with the race, not the track, so for Bliss to be able to do anything like this, it would have to access scoreboard information for races. The Track Analysis menu option actually works with tracks, so it's a different monster.

In theory, I could add a completely new menu dedicated to different kinds of analysis about races and pipsqueaks or this could be a sub-menu of the Tournament menu, but doing this would add so much code to the project that it probably makes more sense to develop it separately. Besides, at this point, my insight is very vague on the topic. I would need a very strong and reliable knowledge and understanding of these statistics to be able to do something useful with them. But who knows? Maybe if we first get to elaborate more on this and discover something new, eventually Bliss, or another tool, could fulfill this purpose :)

Title: Re: Race strength estimation, revisited
Post by: Duplode on December 29, 2020, 02:38:41 AM

Quote from: Daniel3D on December 28, 2020, 08:52:28 AM
I've been reading this with interest. And it is beautiful , I think I understand most of it. But it made me wonder. In cas his bliss there is a track analysis, that can generate estimated lap times from a number of raçers.

Can that be adapted to generate a difficulty modifier for the race strength estimation?

Thank you :) On the track analysis, it is pretty much as Cas says: as I'm currently doing it, the race strength analysis abstracts away all information about tracks, cars, racing lines and so forth. There might be ways to incorporate at least a little of that knowledge -- as an off-the-cuff example, a plausible experiment would be twisting the Elo parameters so that, say, powergear or IMSA races have extra influence over the ratings. Still, in general it can be a challenge to define such variables in a way that isn't too subjective and then to quantify them in an effective way. Even something as intuitive as the notion of track difficulty can be tricky to translate to a race analysis context. For instance, ZCT086 (http://zak.stunts.hu/tracks/ZCT086) is one of the simplest tracks ever raced in ZakStunts, and yet this very simplicity made figuring out places where tenths could be gained a challenge of its own, resulting in a pretty hard race with some very strong replays.

Quote from: Cas on December 28, 2020, 06:42:06 AM
My feeling is that we should start by analysing the fluctuations in the performance of a pipsqueak through time and do the same with other pipsqueaks and try to figure out structures like the ones I described that really are my guesses, my feelings. It'd be good to see how true or not they are and if there are others instead. With that knowledge, we could begin to analyse a race in a different way: knowing that each pipsqueak participating in it is not just a pipsqueak, but a pipsqueak at a certain point of its evolution, with that point added as a parameter.

I, for one, will try to keep an eye on that sort of thing when I look at the data. (While I didn't say much about the ratings in the opening post here, the program can do a quite a few things with the data already, including extracting the rating history of a selected pipsqueak.)

(On the topic of uncertainity, it is worth mentioning that there are more sophisticated systems that try to keep track of them, so that, for instance, pipsqueaks that took part in a lot of recent pipsqueaks are given more stable ratings, on the grounds of uncertainty being presumably reduced by activity. I briefly tried Glicko, which is one of those systems, back in May, but either I didn't figure out how to tune its configuration parameters correctly or the quirks of the ZakStunts dataset -- the free-for-all problem I mentioned earlier, the relatively low number of events per year, the fluctuations in pipsqueak activity -- make it hard to effectively calculate uncertainties for that purpose.)

Title: Re: Race strength estimation, revisited
Post by: Cas on December 29, 2020, 03:55:13 AM

It does sound super complicated and I figure Free-style makes it even harder, because on one hand, "track length" is a very ambiguous concept and cannot be utilised to estimate optimal laps out of real laps in this style. And on the other hand, the "virtual track length" as understood at a point of the race may suddenly change when a pipsqueak figures out a trick and posts a replay that's made public or even worse, if it is not public, one or maybe a few pipsqueaks will know the trick and others won't, so there will be two different theoretical optimal laps. That is, there will be two races running in one scoreboard. The more I think of this, the hardest it seems to analyse.

Needless to say, if it's useful to you, I could pack all the data from R4K and send it to you. That is, the scoreboards for all races, for example, in their native text format so that you can process them quickly.

Title: Re: Race strength estimation, revisited
Post by: Duplode on December 29, 2020, 04:16:29 AM

Quote from: Cas on December 29, 2020, 03:55:13 AM
Needless to say, if it's useful to you, I could pack all the data from R4K and send it to you. That is, the scoreboards for all races, for example, in their native text format so that you can process them quickly.

Yup, please send them; it would be nice to feed them to the program and see what happens :) The races with mixed cars and rules shouldn't cause problems, as with the full scoreboard data we can figure out programmatically which results to include.

Title: Re: Race strength estimation, revisited
Post by: Cas on December 29, 2020, 05:48:43 AM

Alright. Here it is. It also contains the replays. What it does not include is the few most recent races because those are only in the server so far (I still haven't backed them up in my computer) and at this moment, KyLiE is upgrading the server, so the site will be offline briefly, but this is a lot of info already. I can add the other two races later.

Look at the thisrace.sb files. These contain the scoreboard. The format is probably not exactly what you were expecting, but when you see it, you'll understand it immediately. Very easy to parse. It includes links to replays that are not participating in the race, like superseded and rejected ones, so you may want to do some filtering when processing the data.

Title: Re: Race strength estimation, revisited
Post by: Duplode on December 30, 2020, 04:51:00 AM

Thanks for the files, Cas; once I get to perform the calculations with the R4K data I will post the results here.

On another note, I realised that, contrary to what I thought at first, it is feasible in terms of how long the computations take to, for top-5 strength at least, integrate the probabilities and obtain exact results (as far as the numerical integration allows). I have added a second Excel file to the initial post with those exact results. (I didn't replace the chart on the post because the results are qualitatively similar, so that you won't see much difference unless you zoom in.)

Title: Re: Race strength estimation, revisited
Post by: Duplode on January 02, 2021, 11:54:22 PM

As promised, here is a strength chart for R4K (the spreadsheet from where it came from is attached):

(https://i.imgur.com/7B2E0Ph.png)

A couple of notes:

These strengths are based on top 3 probabilities, rather than top 5 as in the ZakStunts charts, to better fit the typical field size.
R4K has combined RH+NoRH scoreboards. In accordance, I have used the best lap of each pipsqueak regardless of modality to obtain the race classifications. Were we publishing the individual Elo ratings, it would arguably be fairer to separate RH and NoRH. In the context of a strength metric for a contest with combined scoreboards, though, I'd say it is not unreasonable to keep them together.

Title: Re: Race strength estimation, revisited
Post by: Overdrijf on January 03, 2021, 12:08:22 AM

Just knowing I have a Stunts elo now makes me nervous about losing points of it. ;)

Title: Re: Race strength estimation, revisited
Post by: Cas on January 03, 2021, 12:39:55 AM

Yes! I think it'd be nice to have ratings like these in our profiles in the Wiki. In R4K, I've introduced a scoring system just at the start of this season. It's super basic. It'd be nice to have something like Elo ratings inside the profiles there too. On the other hand, Elo ratings are very dynamic. I would like to see if it's possible to categorise the variation of a pipsqueak's rating like, knowing that everybody follows more or less the same shape in the evolution of their rating, only sometimes wider or more narrow and sometimes higher or more flat, we could describe a curve "envelop" and thus, define a constant rating, that is, a "projected" pipsqueak rating. Now with this analysis for two very different sets of rules, as are those of ZakStunts and R4K, the comparison between both will surely give interesting information.

About RH/NoRH, I think that taking the best position regardless of the style is representive of RH, because if one surpasses their own RH lap with a NoRH one, it means one could've done it RH-ly. So I don't think there will be any distortion and the calculated information should be accurate for RH. On the other hand, if we want to analyse NoRH, we should use only NoRH laps. The problem there is that they are a lot fewer.

The current, very simple scoring system in R4K does precisely as Duplode did: take the best position for each pipsqueak, no matter the style. 2 points for participating and getting at least one verified replay. 1 to 3 extra points for reaching the podium positions at a race. I could add an extra point for posting a NoRH lap, but I think it's not good to encourage NoRH because that also encourages cheating, as there is currently no good way to check for NoRH. Most of you will think "But we're all good guys and good friends of one another"... yeah, until one is not and we can't tell who, ha, ha. This could be compensated with a "trustworthiness handicap index" XD

Title: Re: Race strength estimation, revisited
Post by: Duplode on January 03, 2021, 04:47:48 AM

Quote from: Overdrijf on January 03, 2021, 12:08:22 AM
Just knowing I have a Stunts elo now makes me nervous about losing points of it. ;)

I wouldn't read too much into them. As Cas notes, the ratings are very dynamic:

(https://i.imgur.com/phCmHNy.png)

In fact, they probably fluctuate a lot more than the usual chess (or, say, football (https://eloratings.net/latest)) Elo ratings. (Keep in mind that, in my implementation, taking part in a race might include a pipsqueak in up to twelve matches, depending on where they land on the scoreboard.) I feel the ratings probably should fluctuate less wildly; however, if I attenuate the changes too much it takes forever for new pipsqueaks to reach a meaningful rating. That is one of the difficulties of trying to shoehorn Elo into a free-for-all format. In any case, it seems any individual ratings should at this point be taken primarily as a measure of recent form, and attaching further meaning to them would require extra rounds of thought, and potentially tuning.

Quote from: Cas on January 03, 2021, 12:39:55 AM
The current, very simple scoring system in R4K does precisely as Duplode did: take the best position for each pipsqueak, no matter the style. 2 points for participating and getting at least one verified replay. 1 to 3 extra points for reaching the podium positions at a race.

I have only now seen the season scores on the R4K site. That's pretty nice! :) And I agree on not adding NoRH extra points in a mixed scoreboard race.

Title: Re: Race strength estimation, revisited
Post by: dreadnaut on January 03, 2021, 06:18:57 PM

Have you been here already, Duplode? https://arxiv.org/pdf/2008.06787.pdf

Title: Re: Race strength estimation, revisited
Post by: Daniel3D on January 03, 2021, 06:53:48 PM

Quote from: dreadnaut on January 03, 2021, 06:18:57 PM
Have you been here already, Duplode? https://arxiv.org/pdf/2008.06787.pdf

Can't use it myself but quite interesting for sure..

Title: Re: Race strength estimation, revisited
Post by: Duplode on January 03, 2021, 09:02:55 PM

Quote from: dreadnaut on January 03, 2021, 06:18:57 PM
Have you been here already, Duplode? https://arxiv.org/pdf/2008.06787.pdf

No, I hadn't; thanks for pointing me to it! This paper was posted on arXiv in August, while most of my research was done between April and May. I see at least three nice things in it:

An evaluation of various metrics for evaluating the rating systems, which could be a huge boon when it comes to tuning the system parameters.
A concise presentation of TrueSkill, which I might want to try adapting at some point in the future.
The way the results suggest all systems are wrong, only in different ways. Besides being psychologically reassuring, that squares with my observation from May that replacing my Elo engine with a Glicko implementation (http://hackage.haskell.org/package/rating-systems-0.1/docs/Ratings-Glicko.html) didn't actually improve things.

One interesting detail I noticed is that the paper's adaptation of Elo to free-for-all, described in section III.A, involves calculating winning probabilities by summing pairwise victory probabilities against all pipsqueaks and dividing by the number of pairwise combinations. That looks like a pretty rough approximation! (If I'm reading it right, it's as if each pairwise match were equally likely to decide the winner, which might not be too far from the truth if everyone's ratings are close to each other.) At any rate, unless I'm missing something it doesn't seem that much more sophisticated or principled than whatever it is I have been doing here. It's a tricky problem, really.

Title: Re: Race strength estimation, revisited
Post by: Daniel3D on January 03, 2021, 09:55:46 PM

Quote from: Duplode on January 03, 2021, 09:02:55 PM
(If I'm reading it right, it's as if each pairwise match were equally likely to decide the winner, which might not be too far from the truth if everyone's ratings are close to each other.)

I noticed that to when skimming through the article.
My thought was that they assumed single races. Free for all, anyone can participate so anyone can win. The skill of the player is unknown and therefore cannot be a factor in the formula.

But I'm not sure.

Title: Re: Race strength estimation, revisited
Post by: Duplode on January 03, 2021, 10:24:58 PM

Quote from: Daniel3D on January 03, 2021, 09:55:46 PM
My thought was that they assumed single races. Free for all, anyone can participate so anyone can win. The skill of the player is unknown and therefore cannot be a factor in the formula.

Yup, and there lies the gap that's hard to bridge. One-to-one comparisons between pipsqueaks are nice and easy to make calculations about, but they don't reflect very well what goes on in a free-for-all match. (If you compete against five pipsqueaks in a race, your performance doesn't have five degrees of freedom, but only one, namely your laptime.)

(The most principled-looking approach I have seen yet to handling free-for-all matches in their own terms, without trying to reduce it to 1v1 matches, is the one in this Glickman paper (http://www.glicko.net/research/multicompetitor.pdf). The math there, however, is a little intimidating, and I couldn't bring myself to sit down and digest it properly just yet. In the meantime, I keep throwing approximations at it to see what sticks.)

Title: Re: Race strength estimation, revisited
Post by: Cas on January 03, 2021, 11:31:09 PM

Just a thought. If you consider a race to be defined by its final scoreboard, then indeed, your final lap time is your only degree of freedom, but I'm thinking that if a race is the period during which you can post replays and look at the scoreboard, download other people's replays, compare strategies, post again, decide when to post your replay taking into consideration how other pipsqueaks will react, etc., then definitely there is at least one more degree of freedom here. One might argue this is irrelevant because in chess, you can computer a player's rating based only on the final result of the match, even though during the game, there can be a reaction at every move, but the truth is a player not always tries his/her best and does not always choose the same strategy. If you have the ability to make your opponent change strategy during the game, then we could see that opponent as a "set of opponents" instead for each of which, we have a degree of freedom.

This may sound pretty crazy, but I'll extrapolate this in a purely philosophical way towards physics. Would it be accurate to call these extra degrees of freedom "internal"? And if so, can we, in the same way that we ignore internal forces within a floating object, or in which charges inside a Faraday cage remain isolated from the environment, just leave these freedoms out?

Title: Re: Race strength estimation, revisited
Post by: Duplode on January 05, 2021, 04:37:30 PM

Quote from: Cas on January 03, 2021, 11:31:09 PM
This may sound pretty crazy, but I'll extrapolate this in a purely philosophical way towards physics. Would it be accurate to call these extra degrees of freedom "internal"? And if so, can we, in the same way that we ignore internal forces within a floating object, or in which charges inside a Faraday cage remain isolated from the environment, just leave these freedoms out?

Yup, that makes sense. A bare Elo rating abstracts away all those internal degrees of freedom. To put it in another way, if a strong pipsqueak only has time for five minute listfillers for a whole season, the other pipsqueaks will eventually come to expect the listfillers, regardless of what could be possible in more favourable conditions. Perhaps there is a way to split ratings into a long-term baseline and a more volatile part by somehow accounting by "internal" factors such as pipsqueak activity; it's territory that has yet to be charted.

Title: Re: Race strength estimation, revisited
Post by: Duplode on January 16, 2021, 12:59:46 AM

I have implemented the NDCG metric from the paper dreadnaut had suggested (https://arxiv.org/pdf/2008.06787.pdf). (NDCG is one way to measure how close the results you might have predicted from the ratings before the race were to what actually happened, with a value of 1 meaning perfect agreement, and accuracy at higher positions on the scoreboard being given extra weight.)

As one might expect, the NDCG for ZakStunts races fluctuates wildly:

(https://i.imgur.com/SXHpHJu.png)

An exponential smoothing of the time series, though, shows that for the last several years of ZakStunts the fluctuations have been around 0.55, which is in the same ballpark of the results in the paper:

(https://i.imgur.com/j5kA15S.png)

One thing I would like to achieve with NDCG thus far is using it to fine tune the parameters of my Elo engine. While parameter adjustments lead to localised changes in NDCG here and there, looking at the timeline as a whole the overall effect is very small. I have also attempted to use predicted positions obtained from simulations instead of just using the order of the ratings (I suspect that may ultimately give more informative NDCG results), but as far as the influence of Elo parameters is concerned that made no difference. To make progress on this front, I may also have to turn to the questions Cas has been raising here and try to think of a metric for how reasonable the evolution of individual pipsqueak ratings is.

Title: Re: Race strength estimation, revisited
Post by: Cas on January 16, 2021, 07:15:27 AM

Besides individual pipsqueak evolution, looking at the first graph on your last post, with its fluctuations, makes me think about the track characteristics. When a measurement of a certain function of time f(t) against t looks this wild in a graph, my first thought is "maybe t is not my variable". In other words, these fluctuations suggest that races have qualitative aspects (or maybe quantitative too) that change from race to race, but that do not "evolve" with time, but instead, depend on something else. It may be not just interesting, but also perhaps even useful, to analyse the characteristics of the track for each race putting in different baskets the ones that gave high and respectively low values, both in just strength and in the precision of the measurement of strength. I think the odds are high that we find a pattern. It may be as simple as complex tracks vs simple tracks or long tracks vs short tracks.

It's 3:00am right now. Otherwise, instead of suggesting this, I would first go and check the tracks myself and then directly post my first impression :P

Title: Re: Race strength estimation, revisited
Post by: Duplode on January 16, 2021, 11:51:39 PM

Quote from: Cas on January 16, 2021, 07:15:27 AM
When a measurement of a certain function of time f(t) against t looks this wild in a graph, my first thought is "maybe t is not my variable".

Though I haven't gotten to the qualitative part yet, your suggestion made me have a closer look at a few selected points in the graph. That ultimately led to a course correction, so thanks for getting me back on track :) More specifically, I realised that:

Due to a programming mistake, my NDCG code only accounted for newbies from their second race onward, messing up the calculations for races in which there were debuts. Fixing that removed a lot of artifacts from the results (I have replaced the charts in my previous post; if you look at them now you'll see the fluctuations got noticeably tamer).
After the fix, doing comparative tests with varying Elo parameters has quickly shown the NDCG as specified in the paper is not as robust as we'd like it to be. The main problem is that predicting race results simply by putting ratings in order means we don't distinguish between a pipsqueak coming out ahead against a rating gap of, say, 1 point from doing the same against a 100 point gap. That being so, if ratings are close negligible fluctuations can have a disproportionate effect over the NDCG.

Those observations have rekindled my interest in using simulations to obtain the predicted results for the NDCG calculations, as a way of dealing with the robustness issue (for instance, the average positions of two pipsqueaks over a batch of simulations will be much closer if they have a 1 point gap than in the case of a 100 point gap). Here are some charts, without and with smoothing:

(https://i.imgur.com/G4I5nmE.png)

(https://i.imgur.com/R1rOMRn.png)

Quite a bit better. I don't think this alternative strategy is flawless, either (for one, it probably underrates predictions for races in which ratings are on the whole close to each other), but I'm far more confident about the results now.

Title: Re: Race strength estimation, revisited
Post by: Cas on January 17, 2021, 07:45:10 AM

Oh! Yes, I definitely see a difference now and even without the smoothing, I can see a t-based variation there! The noise is not so wild now so most of it (or all) could be accounted for with just chance. That is, track qualities surely still have an impact, but it's not as significant as the first graphics led me to think. I can't help by seeing a very marked change around C75. The whole "attitude" of the curve changes from there on.... The "Paleo-ZakStunts" and the "Neo-ZakStunts" ???

Title: Re: Race strength estimation, revisited
Post by: Duplode on January 17, 2021, 03:54:44 PM

Quote from: Cas on January 17, 2021, 07:45:10 AM
I can't help by seeing a very marked change around C75. The whole "attitude" of the curve changes from there on.... The "Paleo-ZakStunts" and the "Neo-ZakStunts" ???

I can think of a few factors that may cause that, the main one perhaps being the large fields typical of 2002-2006 races. Position prediction errors are more likely with more pipsqueaks involved, and my alternative strategy for computing the NDCG probably exacerbates that. It might be possible to offset that to some extent by further increasing the weights given to higher positions in the NDCG.

Stunts Forum

Stunts - the Game => Stunts Chat => Topic started by: Duplode on December 27, 2020, 10:52:35 AM