News:

Herr Otto Partz says you're all nothing but pipsqueaks!

Main Menu

Race strength estimation, revisited

Started by Duplode, December 27, 2020, 10:52:35 AM

Previous topic - Next topic

Duplode

For a long while, one of my favourite Stunts investigation topics has been the evaluation of race strength, be it merely for the enjoyment of historians and pundits, or to inform some kind of spiritual successor to Mark L. Rivers' SWR Ranking. Some of you might even remember my 2012 thread on the matter. Now, after a long time with this project in the back burner, I have made enough progress to feel like posting about it again. So, without further ado, here is a plot of race strengths covering the 236 ZakStunts races so far:



(Attached below is the Excel file this chart belongs to, so you can have a closer look at the data.)

While this chart might look rather like the ones I shown you years ago, there is one major difference: this time, a clearer procedure to obtain the data has led to values that are meaningful on their own. For instance, consider the massive spike you see just left of the middle of the chart. That is ZCT100, whose strength is around 70. According to the model underpinning the calculations, that number means a pipsqueak of Elo rating 1500 (which generally amounts to lower midfield) would, if they joined ZCT100, have a 1 in 70 chance of reaching a top five result on the scoreboard.

The numbers here aren't definitive yet, as I still want to check whether there is any useful tuning of parameters to be done, as well as to figure out how to estimate some of the involved uncertainties. In any case, I believe they look fairly reasonable. Within each season, the ranking of races is generally very sensible. Comparing different eras of ZakStunts is, as one might expect, trickier. In particular, I feel the model might be overrating the 2010 races a little bit. Also, it is hard to tell whether the model underrates races from the first few seasons (2001-2004) as it moves towards a steadier state. Still, the chart does seem to capture the evolutionary arcs of ZakStunts: a steady increase in the level of the competition over the initial years, culminating in the 2005-2006 high plateau, followed by a sharp drop in 2007, and so forth.




I will now outline how this new strength estimation works. Some of what follows might be of interest beyond mere technical curiosity, for parts of the procedure can be useful for other investigations in Stunts analytics.

(By the way, you can check the source code of my program on GitHub, if you are so inclined.)

When I set about resuming this investigation early this year, I decided to, instead of rolling yet another quirky algorithm from scratch, start from well-understood building blocks, so that, if nothing else, I would get something intelligible at the end. Balancing that principle with the known limitations of my chosen methods (and there are quite a few of them), I eventually ended up with the following pipeline of computations:


  • From the ZakStunts results, compute Elo ratings at every race.
  • Obtain, from the Elo ratings, victory probabilities against a hypothetical 1800-rated pipsqueak, and use those probabilities to parameterise a rough performance model, which amounts to the probability distribution of lap times (relative to an ideal lap) for a pipsqueak.
  • Add a ficticious 1500-rated pipsqueak to the list of race entrants, and either:

    • Use the performance model to implement a race result simulator, which spits out possible outcomes when given a list of pipsqueaks and their ratings, and run the simulation enough times to be able to give a reasonable estimate of the likelihood of a top 5 finish by the ficticious pipsqueak; or
    • Numerically integrate the aporopriately weighed probability density for the ficticious pipsqueak to obtain, as far as the model allows, an exact result for said likelihood.

(Implicit in the above is that my code includes both a Elo rating calculator and a race result simulator, which can be put to use in other contexts with minimal effort.)

Let's look at each step a little closer. When it comes to ratings of competitors, Elo ratings are a pretty much universal starting point. They are mathematically simple and very well understood, which was a big plus given the plan I had at the outset. For our current purposes, though, the Elo system has one major disadvantage: it is designed for one-versus-one matches, and not for races. While it is certainly possible to approach a race as if it were the collection of all N*(N-1)/2 head-to-head matchups among the involved pipsqueaks, doing so disregards how the actual head-to-head comparisons are correlated with each other, as they all depend on the N pipsqueak laptimes. (To put it in another way: if you beat, say, FinRok in a race, that means you have achieved a laptime good enough to beat FinRok, and so such a laptime will likely be good enough to defeat most other pipsqueaks.) All that correlation means there will be a lot of redundant information in the matchups, the practical consequence being that a single listfiller or otherwise atypical result can cause wild swings in a pipsqueak's rating. Trying to solve the problem by discarding most of the matchups (say, by only comparing a pipsqueak with their neighbours on the scoreboard) doesn't work well either: since we only have ~12 races a year to get data out of, that approach will make the ratings evolve too slowly to be of any use. Eventually, I settled for a compromise of only using matchups up to six positions away on the scoreboard (in either direction), which at least curtails some of the worst distortions in races with 20+ entrants. Besides that, my use of the Elo system is pretty standard. While new pipsqueaks are handled specially over their initial five races for the sake of fairer comparisons and faster steadying of ratings, that is not outside the norm (for instance, chess tournaments generally take similar measures).

Elo ratings are not enough to simulate race results, precisely because of the distinction between a collection of matchups and a single race discussed above. A simulation requires a model of the pipsqueak performance, so that individual simulated results for each pipsqueak can be put together in a scoreboard. One workaround to bridge this gap relies on victory probabilities. It is possible, given Elo ratings for a pair of pipsqueaks, to calculate how likely one is to defeat the other in a matchup. Similarly, if you have the laptime probability distributions for a pair of pipsqueaks, you can calculate how likely it is for one of them to be faster than the other. A few seat-of-the-pants assumptions later, we have a way to conjure a laptime probability distribution that corresponds to an Elo rating. As for the distributions, the ones I am using look like this:



This is a really primitive model, perhaps the simplest thing that could possibly work. It is simple enough that there are victory probability formulas that can be calculated with pen and paper. There is just one pipsqueak-dependent parameter. As said parameter increases, the distribution is compressed towards zero (the ideal laptime), which implies laptimes that are typically faster and obtained more consistently (in the plot above, the parameter is 1 for the blue curve and 2 for the red one). While I haven't seriously attempted to validate the model empirically, the features it does have match some of the intuition about laptimes. (On the matter empirical validation, one might conceivably drive five laps on Default every day for a month and see how the resulting laptimes are spread. That would be a very interesting experiment, though for our immediate purposes the differences between RH and NoRH might become a confounding factor.)

Having the laptime distributions for all entrants in a race makes it possible to figure out a formula that can be used to, in principle, numerically compute victory and top-n probabilities against its set of pipsqueaks. In practice, it turns out that victory probabilities aren't a good race strength metric, as the results tend to be largely determined by a small handful of pipsqueaks with very high ratings. To my eyes, the top-5 probabilities are at the sweet spot for strength estimations. I originally believed calculating the probabilities by numerical integration would be too computationally expensive (as the number of integrals to be numerically calculated grows combinatorially as the n in top-n grows), so I used the alternative strategy of simulating the races and afterwards check how often top-5 results happen. The chart at the top of the post was generated after 100,000 runs per race, that is, twenty three million and six hundred thousand runs to cover all ZakStunts races, which took fifteen and a half minutes to perform on my laptop. Later, I figured out that, with sufficiently careful coding, the numerical method, which has the advantage of giving essentially exact results, is feasible for top-5 probabilities; accordingly, an alternative Excel file with those results is also attached. (The simulations remain useful for wider top-n ranges, or for quickly obtaining coarse results with 1,000 to 10,000 runs per race while tuning the analysis parameters.) 




To turn the discussion back to sporting matters, the troubles with wild rating swings and outlier results alluded to above brought me back to the question of listfillers, already raised by Bonzai Joe all those years ago. Left unchecked, a particularly weak listfiller in a busy race can wreck havoc upon the Elo rating of its unfortunate author. That ultimately compelled me to look for objective criteria according to which at least some of the obvious listfillers can be exclued. For the current purposes, I ultimately settled on the following three rules:


  • Results above 300% of the winning time and more than two standard deviations away from the average of laptimes are to be excluded. (The Bonzai Joe rule.)
  • GAR and NoRH replays are only counted if the fastest lap on the parallel scoreboard they belong to is, or would be, above the bottom quarter (rounding towards the top) of the scoreboard. (The Marco rule.)
  • For our current purposes, a car is deemed "competitive" if it can be found above the bottom quarter (rounding towards the top) of the scoreboard, or if it was used to defeat a pipsqueak using a competitive car whose lap was not excluded according to the previous two rules. Only laps driven with competitive cars count. (The Alan Rotoi rule).

These rules were applied to the full list of ZakStunts race entries that I'm using (it was one of those quarantine days back in May). Disqualified race results were also removed; there were a few curious findings in that respect I should write about one of these days. (By the way, ghosts are not included in the calculations, regardless of what their race entries look like.)

(A footnote: the bar for applying the first rule above looks, at first, incredibly low. I considered using a lower percentage, like 250%, and would rather not have the frankly bizarre standard deviation additional condition. It turns out, however, that ZCT029, a difficult dual-way full Vette PG track from 2003, had an extraordinarily broad spectrum of laptimes, including several pipsqueaks with non-listfiller laps beyond the 300% cutoff which would have been excluded without the standard deviation test. Faced with such a peculiar scoreboard, I opted to err on the side of circumspection.)

It remains a tall order to find objective criteria to discard listfillers that won't exclude too many proper competitive laps as a collateral effect. Ultimately, if we were to establish new pipsqueak rankings I suspect different use cases would call for different kinds of ratings. An Elo-like ranking is appropriate for race strength estimations, simulations and predictions, when what is needed is a picture of how well someone is racing at a specific moment in time. For comparing performances within the last several months, though, a ranking of weighed (for instance, by recency or race strengths) race scores within a time window, in the style of SWR, might prove more appropriate. With this kind of ranking, it becomes reasonable to have, for instance, ZakStunts-style worst results discards, which could definitely help dealing with listfillers.

Anyway, by now I probably should stop rambling, at least for a little while :D Questions, comments, criticism, suggestions about the metric and ideas on cool stuff to do with those algorithms are all welcome!

dreadnaut

My main takeaway is that I won the races with the lowest scores :P

Duplode

Quote from: dreadnaut on December 27, 2020, 01:25:16 PM
My main takeaway is that I won the races with the lowest scores :P

I think you were looking one race ahead in that crowded chart. Let's zoom into 2017 and 2019:





According to the model, both Z194 and Z210 were the fifth strongest races in their respective seasons, with Z194 being the strongest race in the second half of 2017.

Cas

Wow!  This blew my mind!  I had been thinking about something like this... Correction, definitely not like this, not this elaborate!  Something about the same matter, I had been thinking in the past. I didn't get to the part of running experiments and doing any actual math, although I did have in my mind (usually while in the shower or while walking) a mathematical fight on how to determine two things. The analysis here is so good that what I'm going to say probably won't add much, but in case it might help in any way, here it is:

I wanted to give tracks and pipsqueaks numbers. In the case of the track, there should be a theoretical best possible lap, even in free style. The possibilities are so huge that we can't find this value, but we can know it has to exist. Now, for a human, theory suggests that while possible, this lap is unobtainable, because there are so many lower laps with higher probability. But if you go down, at some point, you'll find the next best thing: the lap the best pipsqueak is most likely to achieve after the duration of the race if he is allowed to try his best (and he does). Again, super hard to obtain... but this one could be estimated if we knew the proportion of how good each pipsqueak is relative to the "best pipsqueak" and how well they perform on the track when given the same time and when we know they are trying their best.

This leads to the other thing I wanted to calculate: how good a pipsqueak is, in numbers. This is complex because pipsqueak quality is dynamic and because tracks don't just vary in length, but some stunts may be easier for some pipsqueaks and others for others. But I think what has the greatest effect is the evolution of a pipsqueak. A pipsqueak may start participating and not get such good results, until he understands what the thing is about, becomes interested and that's when he starts trying his best. This not necessarily will happen at the beginning. A pipsqueak could spend years just posting to fill the scoreboard and one day decide to take it seriously. When he tries his best, he'll start improving quickly and will get to a maximum. After some time, even his first try on a race will be much better than before this improvement so even when he doesn't have much time, his posting will always be reasonable. From this point on, his laps will fluctuate depending on how much time or interest he had during that race, almost always staying above that lap time and below the maximum he obtained in the previous period. Of course, seeing that your lap time has been superseded by somebody else usually pushes you to try a better lap, unless the difference is too much and you don't think you can do it. So, in my opinion, the most stable point to qualify how good a pipsqueak is is not that of his best results, but that lower "mesa" after his peak. It seems to me that, to compare pipsqueaks, we should compare their first replays on a track, performed relaxed. When they try their best, it can vary much more.

It would be interesting to make a test in R4K... a completely "blind race", in which the results of replays are never posted and you only see the whole scoreboard at the end. Like it is all quiet days. Then compare that to a similar track on a regular race. Another test could be to announce a race with time for everybody to get ready, but only give them a couple of days to post replays, but the results there would be similar to those of Le Stunts races, perhaps, or at least, moderately proportional.

Alright, this is brainstorming as usual. Hope anything of what I wrote will be of any use. There's no math there, so it's not as robust, but when one speaks and speaks, something has to be useful.
Earth is my country. Science is my religion.

Duplode

Quote from: Cas on December 27, 2020, 07:50:50 PM
I wanted to give tracks and pipsqueaks numbers. In the case of the track, there should be a theoretical best possible lap, even in free style. The possibilities are so huge that we can't find this value, but we can know it has to exist. Now, for a human, theory suggests that while possible, this lap is unobtainable, because there are so many lower laps with higher probability. But if you go down, at some point, you'll find the next best thing: the lap the best pipsqueak is most likely to achieve after the duration of the race if he is allowed to try his best (and he does). Again, super hard to obtain... but this one could be estimated if we knew the proportion of how good each pipsqueak is relative to the "best pipsqueak" and how well they perform on the track when given the same time and when we know they are trying their best.

Ah, the mythical perfect lap... I had actually been thinking about this matter back in May, and now that I understand the problem a little better I will have another look at it. My plan would be roughly like this: suppose we have a race scoreboard with laptimes, plus the performance model for the pipsqueaks. If we trust the models (and that's a big if!) and can guess where the laptime of each pipsqueak falls in their probability distribution curve (maybe by speculating about how well they have raced that month by their standards), we can plot the laptimes against the positions along the curve and estimate the ideal laptime by doing a linear regression. Later I will find a ZakStunts race I know enough about to attempt that.

On the rankings, you are certainly right in that there are lots and lots of confounding factors, and some of them, like RH optimisation effort, are extremely difficult to quantify (unless the pipsqueak keeps a diary!). If you look closely at what happens on a track, other difficult questions appear (for instance: how meaningful it is to compare laptimes obtained using different racing lines?). Any attempt at an all-time ranking based upon scoreboard data will run against this sort of limitation.

If I were to propose an all-time ranking based on Elo ratings, taking each pipsqueak's highest rating looks appealing at first, because it is easy to figure out, and also because it reflects more than a single moment in time (while Elo ratings are primarily a measure of current performance levels, some information about past results does go into the number). Still, your thoughts about the "mesa" and steady performances are interesting. I wonder how those could be measured (moving averages? uncertainty of ratings?). Perhaps patterns along which ratings evolve are in themselves worthy of investigation. (This discussion, by the way, reminds me a bit of the fabulous F1Metrics model, which kind of takes a hybrid approach, by using average scores over the best three consecutive years to build an all-time ranking.)

Cas

We should make a study. We have abundant information from previous ZakStunts races and R4K is now starting to accumulate a reasonable amount as well. OWOOT vs FreeStyle is a very interesting comparison to make in this regard!

I am not very familiar with measurements and statistics outside Stunts and that really is a bad thing. I should be more educated on this so that I can add better value to the topic. Yet, I do have some instinct that saves me sometimes. My feeling is that we should start by analysing the fluctuations in the performance of a pipsqueak through time and do the same with other pipsqueaks and try to figure out structures like the ones I described that really are my guesses, my feelings. It'd be good to see how true or not they are and if there are others instead. With that knowledge, we could begin to analyse a race in a different way: knowing that each pipsqueak participating in it is not just a pipsqueak, but a pipsqueak at a certain point of its evolution, with that point added as a parameter.

Another thing that probably would be useful comes from what you just mentioned... changing the line. It'd be interesting to generate "scoreboards" for partial races. That is, a scoreboard at the first half minute, at the first minute, at the third half minute, and so on. Or, instead of making it space-based, make it time-based. A scoreboard after the first 5 days of the race going on, another after 10 days, etc. There will be missing pipsqueaks in some of them, but we can complete the thing. We could also make the same experiment but counting time not from the beginning of the race, but from each pipsqueak's first posted lap. This assumes a pipsqueak will first post probably on the same day of first sitting to try the race.

This track/pipsqueak problem sounds to me like a two-unknown, two-equation system in which unknowns are hyper-complex vectors and the equations are non-linear.
Earth is my country. Science is my religion.

Daniel3D

I've been reading this with interest. And it is beautiful , I think I understand most of it. But it made me wonder. In cas his bliss there is a track analysis, that can generate estimated lap times from a number of raçers.

Can that be adapted to generate a difficulty modifier for the race strength estimation?
Edison once said,
"I have not failed 10,000 times,
I've successfully found 10,000 ways that will not work."
---------
Currently running over 20 separate instances of Stunts
---------
Check out the STUNTS resources on my Mega (globe icon)

Cas

Well, Duplode is much more educated than I am about statistics and data analysis and I must admit I understood a good portion of his work on this topic, but not all, so it all depends. As I understand it, "race strength" is a concept that has to do with the race, not the track, so for Bliss to be able to do anything like this, it would have to access scoreboard information for races. The Track Analysis menu option actually works with tracks, so it's a different monster.

In theory, I could add a completely new menu dedicated to different kinds of analysis about races and pipsqueaks or this could be a sub-menu of the Tournament menu, but doing this would add so much code to the project that it probably makes more sense to develop it separately. Besides, at this point, my insight is very vague on the topic. I would need a very strong and reliable knowledge and understanding of these statistics to be able to do something useful with them. But who knows?  Maybe if we first get to elaborate more on this and discover something new, eventually Bliss, or another tool, could fulfill this purpose :)
Earth is my country. Science is my religion.

Duplode

Quote from: Daniel3D on December 28, 2020, 08:52:28 AM
I've been reading this with interest. And it is beautiful , I think I understand most of it. But it made me wonder. In cas his bliss there is a track analysis, that can generate estimated lap times from a number of raçers.

Can that be adapted to generate a difficulty modifier for the race strength estimation?

Thank you  :) On the track analysis, it is pretty much as Cas says: as I'm currently doing it, the race strength analysis abstracts away all information about tracks, cars, racing lines and so forth. There might be ways to incorporate at least a little of that knowledge -- as an off-the-cuff example, a plausible experiment would be twisting the Elo parameters so that, say, powergear or IMSA races have extra influence over the ratings. Still, in general it can be a challenge to define such variables in a way that isn't too subjective and then to quantify them in an effective way. Even something as intuitive as the notion of track difficulty can be tricky to translate to a race analysis context. For instance, ZCT086 is one of the simplest tracks ever raced in ZakStunts, and yet this very simplicity made figuring out places where tenths could be gained a challenge of its own, resulting in a pretty hard race with some very strong replays.

Quote from: Cas on December 28, 2020, 06:42:06 AM
My feeling is that we should start by analysing the fluctuations in the performance of a pipsqueak through time and do the same with other pipsqueaks and try to figure out structures like the ones I described that really are my guesses, my feelings. It'd be good to see how true or not they are and if there are others instead. With that knowledge, we could begin to analyse a race in a different way: knowing that each pipsqueak participating in it is not just a pipsqueak, but a pipsqueak at a certain point of its evolution, with that point added as a parameter.

I, for one, will try to keep an eye on that sort of thing when I look at the data. (While I didn't say much about the ratings in the opening post here, the program can do a quite a few things with the data already, including extracting the rating history of a selected pipsqueak.)

(On the topic of uncertainity, it is worth mentioning that there are more sophisticated systems that try to keep track of them, so that, for instance, pipsqueaks that took part in a lot of recent pipsqueaks are given more stable ratings, on the grounds of uncertainty being presumably reduced by activity. I briefly tried Glicko, which is one of those systems, back in May, but either I didn't figure out how to tune its configuration parameters correctly or the quirks of the ZakStunts dataset -- the free-for-all problem I mentioned earlier, the relatively low number of events per year, the fluctuations in pipsqueak activity -- make it hard to effectively calculate uncertainties for that purpose.)

Cas

It does sound super complicated and I figure Free-style makes it even harder, because on one hand, "track length" is a very ambiguous concept and cannot be utilised to estimate optimal laps out of real laps in this style. And on the other hand, the "virtual track length" as understood at a point of the race may suddenly change when a pipsqueak figures out a trick and posts a replay that's made public or even worse, if it is not public, one or maybe a few pipsqueaks will know the trick and others won't, so there will be two different theoretical optimal laps. That is, there will be two races running in one scoreboard. The more I think of this, the hardest it seems to analyse.

Needless to say, if it's useful to you, I could pack all the data from R4K and send it to you. That is, the scoreboards for all races, for example, in their native text format so that you can process them quickly.
Earth is my country. Science is my religion.

Duplode

Quote from: Cas on December 29, 2020, 03:55:13 AM
Needless to say, if it's useful to you, I could pack all the data from R4K and send it to you. That is, the scoreboards for all races, for example, in their native text format so that you can process them quickly.

Yup, please send them; it would be nice to feed them to the program and see what happens :) The races with mixed cars and rules shouldn't cause problems, as with the full scoreboard data we can figure out programmatically which results to include.

Cas

Alright. Here it is. It also contains the replays. What it does not include is the few most recent races because those are only in the server so far (I still haven't backed them up in my computer) and at this moment, KyLiE is upgrading the server, so the site will be offline briefly, but this is a lot of info already. I can add the other two races later.

Look at the thisrace.sb files. These contain the scoreboard. The format is probably not exactly what you were expecting, but when you see it, you'll understand it immediately. Very easy to parse. It includes links to replays that are not participating in the race, like superseded and rejected ones, so you may want to do some filtering when processing the data.
Earth is my country. Science is my religion.

Duplode

Thanks for the files, Cas; once I get to perform the calculations with the R4K data I will post the results here.

On another note, I realised that, contrary to what I thought at first, it is feasible in terms of how long the computations take to, for top-5 strength at least, integrate the probabilities and obtain exact results (as far as the numerical integration allows). I have added a second Excel file to the initial post with those exact results. (I didn't replace the chart on the post because the results are qualitatively similar, so that you won't see much difference unless you zoom in.)

Duplode

As promised, here is a strength chart for R4K (the spreadsheet from where it came from is attached):



A couple of notes:

  • These strengths are based on top 3 probabilities, rather than top 5 as in the ZakStunts charts, to better fit the typical field size.
  • R4K has combined RH+NoRH scoreboards. In accordance, I have used the best lap of each pipsqueak regardless of modality to obtain the race classifications. Were we publishing the individual Elo ratings, it would arguably be fairer to separate RH and NoRH. In the context of a strength metric for a contest with combined scoreboards, though, I'd say it is not unreasonable to keep them together.

Overdrijf

Just knowing I have a Stunts elo now makes me nervous about losing points of it.  ;)