News:

Herr Otto Partz says you're all nothing but pipsqueaks!

Main Menu

Unstable replays at ZakStunts

Started by Duplode, April 10, 2023, 02:48:47 AM

Previous topic - Next topic

Friker

Hm.. there is a good argument about what a person see at evaluation screen should be valid. Then again - if it is a result shown only on somebody's machine and not reproducible anywhere else - should it be valid? In very extreme case - after memory manipulation - it can end badly as a verification tool.
Because we do not know what is causing inconsistencies I would propose to have validation procedure as Overdrijf proposed - if replay finishes in any way successfully on (let's say) 3 independent machines (pipsqueaks from 3 different teams) it should be valid. And apply this only on questionable replays (someone brings that up let's say max. 3 days after race).

How often this thing happens? Once in a year/two years? Agreeing on a difficult validation process seems not worthy to me.

Quote from: Frieshansen on April 10, 2023, 10:00:45 PMMy problem with DOSBox is that it plays some replays differently on different computers despite having the same version and config. That's why it's out of the question for me for a predictable system.
I've never heard about this. Do you have such replay/config?

Quote from: Duplode on April 11, 2023, 11:53:16 AMFirstly, the result of such a validator is not truer than the in-game experience of the pipsqueak: as per my argument above, the evaluation screen rules supreme, and the accuracy of a validator is a matter of how well it can reproduce it.
It is not any truer but from what I saw and it seems other experienced too it is not less useful than the in-game experience and having something independent seems like a step forward.
Quote from: Duplode on April 11, 2023, 11:53:16 AMSecondly, to the best of my knowledge a repldump validator will give the same results as the streamlined verification procedure that I'm proposing.
I would love to be sure that this is true. That's why I would like to see/understand/debug code/program. I assume that when you start a new drive and start driving you get the same result as while RHing and replaying the file with fast-forwarding. I would like to have that confirmed.

Because of my assumption - I think that when you press Play when replaying there is something different going on than when you start driving from menu.

Daniel's comment about driving with opponents triggered a memory supporting my assumption - I think there were cases when re-Playing a replay (saving a replay, loading, rewinding to start and pressing Play) with opponent ended with a premature crashing. But I/we did not bother to rewind a little bit to see if it is "for real". Maybe that is another case where simply Playing a replay is something different than driving. And - I think that opponent's behavior is deterministic (as a developer you would want it to be deterministic when you have a replay system). Which is another thing which would be great to verify.



Duplode

Quote from: Friker on April 11, 2023, 01:06:00 PMHm.. there is a good argument about what a person see at evaluation screen should be valid. Then again - if it is a result shown only on somebody's machine and not reproducible anywhere else - should it be valid?

In practice, it is certain that if a replay shows up as complete with my proposed verification procedure (or with repldump), the pipsqueak saw the evaluation screen in the expected way. The odds against the opposite possibility are astronomical.

Quote from: Friker on April 11, 2023, 01:06:00 PMHow often this thing happens? Once in a year/two years? Agreeing on a difficult validation process seems not worthy to me.

My proposal is the simplest reliable validation procedure there is. It is done entirely in-game, and takes around 20 to 30 seconds per replay. (Doing it with repldump would be more complex, for involving an external tool, but could be even easier for the manager, as it can be automated.)

(Also, I'd say once per season is a lot for something that can remove people from races, and which ultimately is a fairness issue.)

alanrotoi

#17
I don't want to make decisions only towards my preferences but what is best for the community.

In one hand I'm ussualy affected by this rule and I found hours of work in a replay in the trash can. In the other hand there must be a limit to admit a replay and don't give extra work for the manager. There should be a solution between both.

Friker

Quote from: Duplode on April 11, 2023, 05:37:04 PMMy proposal is the simplest reliable validation procedure there is. It is done entirely in-game, and takes around 20 to 30 seconds per replay.
And I agree and fully support this if - if we can prove that it is the same thing as driving it from the menu or at least it is the most coherent thing in case that everything (a new drive, re-Playing a replay, fast-forwarding a replay, RH handling) behaves differently. But first I would like to know for sure/I would like to check the code.

Repldump still can be a valuable tool - similar to RPLInfo - verification before posting. In fact Repldump validation can be as a section in RPLInfo output (another feature request maybe? ;) ). :)

Cas

About the rules
I really hadn't noticed or didn't remember that, for unstable replays, the view at the last frame when you just open it is normally correct. I kind of "remembered" that even when you just opened it, it looked crashed. Now, if it really is the case that most of the time this works well, then I definitely agree that we should try this as a new rule, because it's no doubt a very simple one to use. Because I'm not very skilled at PG and my replays aren't the fastest, I don't come across with unstable replays of my own very often, but it has happened to me.

About "why" there are unstable replays. My thoughts
I have gone through the messages in this thread rather quickly, so please excuse me if there's something crucial I haven't seen. It called my attention that DOSBox may sometimes play the same replay differently with the same version?  I'd like to see that because that is something I surely cannot explain right now.

On the other hand, I can give you an educated guess on what I think causes unstable replays, from a programmer's perspective, as long as DOSBox (same version on same computer) is deterministic. My view is that this is a problem of quantisation. We see this often in digital music, in MIDI, in particular. Stunts runs a loop and every iteration of that loop does a few things. When everything you had to do is complete, it waits until the frame time is completed (1/20th of a second, minus the already elapsed time during the current iteration); then, it proceeds to the next iteration.

The problem here is that there are two tasks that occur during the first part of the loop and that are crucial. One is the moment when the input configuration is gathered and the other is the moment when this information is stored. When the input comes from a replay, there should be consistency because the same input configuration is assumed to rule during the whole frame, but when you're playing live, such as when you're recording a replay, they keys you're pressing will be recorded at some point, which is not at the beginning, nor at the end of the iteration. When this occurs depends on how fast the computer is because the length of the active portion of the iteration varies. But what if you just pressed the braking key a moment before it was recorded, or just before the end of the iteration?

Every time you convert a continuous flow of inputs into a discrete sequence, you quantise, and you're adding an error. This can't be prevented. What the game should do is, instead of taking the live input for the determination of physics, it should take the quantised results. But it's not doing it, because most of the time, the result is pretty similar.

Now this wouldn't explain why the replay looks good when you just open it. Here I have to guess that the "super-fast-forwarding" routine that places the pointer at the end of the replay when you just open it has a loop that's not quite the same as the normally reproducing one (that's also the one that's used when you fast forward with the controls). Maybe frames are quantised at the beginning in one and at the end in the other?  Anyway, although it's not the same situation as with live recording, I'm pretty sure the problem must have to do with quantisation.

Repldump verification might or might not work, depending on whether the loop used by the repldump routine is the one obtained from normal reproduction or from the fast reproduction that's done at the beginning of loading a replay. We have to try and see. It'd be very interesting to know this.
Earth is my country. Science is my religion.

Overdrijf

In my ideal world, any replay you can spend hours on without ever realizing it might be crooked would be seen as valid, as such my first suggestion for a rule about this would be "any replay that can be made to play in any way using Stunts 1.1 or a fully compatible program is counted as valid".

Other people have raised valid points. It would be nice if the race administrator wouldn't have to regularly spend an hour trying to get a replay to work. I've also learned from Argammon that apparently there is under some circumstances a way to use replay handling to intentionally reset to the exact moment of certain select high speed powergear crashes and continue from the frame of the crash. I didn't know that. Sure, that one borders on cheating, even if it is very situational and probably requires quite some driving skill to use well. I am therefore open to any compromise that reduces administrator time spent, reduces the possibility of anything that may be considered cheating* or helps in any other way. But my basic stance on the matter is that this is RH Freestyle, we drive on the edge, and we regularly sort of break the game in various ways to get the fastest times. Breaking a replay while doing that is just the risk we take, and I'd be happy if as many questionable but ultimately fairly raced replays as possible were able to be counted as valid.



*(On the other hand, the Freestyle RH format was born because cheating through replay handling was undetectable. There is something to be said for allowing as many techniques as possible in the format, even if they look a lot like driving straight through a crash...)

dreadnaut

#21
Quote from: Duplode on April 11, 2023, 06:21:08 AMIn particular, if there is a simple validation procedure that doesn't report invalid laps as valid, and reports valid laps as invalid far less often than the current procedure, we should switch to using it -- ergo, my proposal.

Props for defining unstable replays as "valid", and the arguing that we should not invalidate "valid" replays ;D  Ready for politics!

But although well explained, this is not The Issue. The real problem is human: we don't like that pipsqueaks work hard on a replay, but it ends up in the bin. This is what this rule change is about.


Now, let's define unstable replays as replays that load in a crashed state, or load in an OK state but play to a crash. Both are a thing, and I don't have statistics on whether a kind is easier to generate than the other.

My position is that invalidating unstable replays affects only a few high level pipsqueaks, and rarely, but accepting unstable replays affects everyone else, today and tomorrow:

- it is difficult to rewind / replay an unstable replay in a way that reaches the end
- the above is most problematic with "broken at loading" replays, where we have to argue for validity, and is a cause of arguments and unhappiness (see past races)
- they cannot be replayed without knowing what to do, reducing their archival value, and...
- they are frustrating for beginners (powergear AND extra bugs!)
- you cannot make videos out of them
- (any automated tool we might create in the future will have to deal with them)

Pro pipsqueaks are already used to failure and repetition, and I thought they could deal with the extra difficulty. You can drive a high level powergear lap? Sorry, there's an extra validation for you. It's easy and you can do it on your computer. I have been that driver: powergear is the only trick that I can perform at 'gold' level.


Looking at the problem again (invalidating replays is cruel), I would say — yes, it's true! Sorry, the choice is between a cost for the individual, or a cost for the community.

From this point of view, I am not interested in high tech solutions to validation. Spending time reviewing replays is not really a problem — I get to see a lot of cool stuff! What I'm interested in is "Can we reduce the cost for the community?" or "Is the community happy to accept the cost?"


Friker

Cas, I am sorry, but I completely disagree with your points. It would be a really big oversight if a variable (which keys are pressed down) which is used at least in two places (computing next position of a car and saving a replay string) could be modified in the middle of the process. If the variable is modified outside this critical section all is fine (because nothing is done at that portion of time). Furthermore your points do not explain discrepancy between fast-forwarding and playing a replay. But.. all of that text and Frieshansen's magic DOSBox behavior got me thinking..

And I did some testing. First of all, there is an sub-menu "Options" -> "Set graphic level" which contains some magic text and options I do not understand. I tried to play with them in the past - no change. But after reading last comments.. I did try to lower the DOSBox's CPU speed - to cycles=fixed 2500 (Stunts visibly stutters). Lo and behold - playing a replay with Play button (or Play Fast button) produces the same result as fast-forwarding!!! !!! !!! I was pleasantly surprised. Then I tried pretty high cycle speed (150000) so my machine does not catch up - same result as during normal speed - de-sync and crash. Then I tried to lower cycles gradually and the magic happens. And somewhere between 4000 and 5000 cycles there is a magic area where a replay plays "correctly" (that means does not crash) or fails depending on graphical settings. Higher graphical settings means a simulation is not catching up and is skipping some drawing routines - maybe. That draws me to a conclusion that when there is a pause between frames the Stunts' drawing routine somehow messes with "world's state (car rotation/car position/random number generator function/whatever)". Maybe it is enough that the drawing routine finishes once, maybe twice, maybe it corrupts the state after running 100 times (not to get into politics but also changing a government is considered a good thing).

These observations somehow fits into my imagination of what is/should be "a correct/valid replay". I am repeating myself - I think that only "weird/incorrect" behavior is produced by playing a replay with "Play" button (now I can add) with high-enough cycle speed, and that 1. driving a new drive 2. replaying a replay with fast-forward 3. playing a replay with Play button with low-enough cycle speed (4. repldump) produces the same result (and should be accepted as valid).

Btw - while testing I always rewinded/rewound my replay at around 1:33:10 - so I assume that there is no accumulative error from the beggining but rather the section from 1:38:00 to 1:46:10 produces something magical. Also - it is something very consistent/deterministic so I would assume that the drawing routing corruption is also pretty deterministic and we could spot it in the code/debugger.

Friker

Quote from: dreadnaut on April 11, 2023, 11:51:48 PMNow, let's define unstable replays as replays that load in a crashed state, or load in an OK state but play to a crash. Both are a thing
Ugh, really there exist replays which load crashed but replays ok? And you can continue driving before the finish line and they produce the evaluation screen? I want to see some to analyze them with other settings. Still, it seems my last comment holds true - the problem is with the Play button. :)

Duplode

#24
Quote from: Friker on April 12, 2023, 12:02:08 AMI did try to lower the DOSBox's CPU speed - to cycles=fixed 2500 (Stunts visibly stutters). Lo and behold - playing a replay with Play button (or Play Fast button) produces the same result as fast-forwarding!!! !!! !!! I was pleasantly surprised. Then I tried pretty high cycle speed (150000) so my machine does not catch up - same result as during normal speed - de-sync and crash. Then I tried to lower cycles gradually and the magic happens. And somewhere between 4000 and 5000 cycles there is a magic area where a replay plays "correctly" (that means does not crash) or fails depending on graphical settings.

Thanks for doing these experiments, Friker! I have tried it with my unstable winning replay from ZCT099, and it goes exactly as you describe: it plays from the beginning to the finish with 2271 cycles, and veers off-track at 20000 cycles. I don't know how y'all see it, but I feel this being a DOSBox induced (or maybe even hardware induced, who knows) issue is a strong argument for erring on the side of leniency towards unstable replays.



@dreadnaut: There are a few things I'd like to reply to. What follows comes across as rather confrontational, and I'm sorry about that. It does seem that we're entering this discussion from very different positions, and that I haven't quite managed to make mine clear yet. That being so, I'll try to state my priorities more directly, rather like you have done. If where I'm coming from becomes clearer, that will already have been some progress.

Quote from: dreadnaut on April 11, 2023, 11:51:48 PMBut although well explained, this is not The Issue. The real problem is human: we don't like that pipsqueaks work hard on a replay, but it ends up in the bin. This is what this rule change is about.

I agree the issue is human, and wasted effort from pipsqueaks is a very important part of it. It is not the only one, though. If you allow me an absurd example, no one would care about the wasted effort if this were about someone spending fifty hours crafting a replay byte by byte on an hex editor. Things are different, though, if we are talking about a pipsqueak who drove their lap and had the game confirm it as complete. That being so, I see this as a matter of fairness, with the root issue being the one I have described earlier. The other human (or social) issues, chiefly among them the waste of effort and the joy-killing effects of result deletion, are very important consequences of that.

Quote from: dreadnaut on April 11, 2023, 11:51:48 PMMy position is that invalidating unstable replays affects only a few high level pipsqueaks, and rarely, but accepting unstable replays affects everyone else, today and tomorrow: [...]

Stated in this way, this looks like a straightforward utility calculus. However, I don't find it so simple. The main point of a Stunts competition is offering pipsqueaks the conditions for fair and enjoyable racing. Everything else follows from and depends on that. To my eyes, your stance amounts to sacrificing the gameplay experience and enjoyment of pipsqueaks for the sake of a number of meta concerns that are either hypothetical or secondary (we can ask Alan for confirmation, but I suppose we won't be bothered by having to capture a video in two separate parts once in a blue moon), and I don't see this as a good tradeoff. Things would be different if all unstable replays were like that one lap from Overdrijf, and every powergear race became a verification nightmare, but that's clearly not the case.

I'd like to comment specifically on the archival value point, given that, being the Competition Archive maintainer, I have some skin in the game. In my opinion, it is not an issue at all. Unstable replays have been known to exist for very long, and used to be broadly regarded as unproblematic by competitions -- just another peculiarity of RH powergear racing. As a consequence, the archives already feature unstable replays, and their value is no lower because of that. In fact, it is quite the opposite: the archives would be poorer without them. Stunts is a deeply weird game, and unstable replays are nothing but a tiny part of that weirdness. There's no point in outlawing bits of the gameplay, quirks and all, just to fit some ideal of clarity.

Quote from: dreadnaut on April 11, 2023, 11:51:48 PMYou can drive a high level powergear lap? Sorry, there's an extra validation for you. It's easy and you can do it on your computer.


It's not quite the walk in the park you make it sound like. Since I don't want to bog down this reply in discussion of gameplay minutiae, I'll just point any curious readers to the shoutbox archive from April 8th and 9th, where Friker and me look at the ZCT260 in-game situation in some detail.

Quote from: dreadnaut on April 11, 2023, 11:51:48 PMLooking at the problem again (invalidating replays is cruel), I would say — yes, it's true! Sorry, the choice is between a cost for the individual, or a cost for the community.


No, this is not a choice between individual and community. We pipsqueaks are part of the community as well -- even the "pro" ones. When pipsqueaks are asked to jump through hoops or to give up some of their enjoyment in the name of some greater good, there is a cost for the community too, just in a different way.

Daniel3D

@dreadnaut and @Duplode.
Both of you speak on behalf of the community. There are a lot of factors involved and a lot of uncertainties.

Both opinions are valid.

However. Unstable replays are appearendly circumstance dependent. What I mean by that is. There is no way of knowing if (PG) replays that did verify are stable. They were just fine under the conditions set by Dreadnaut.
If changing the amount of cycles is of influence. Then I feel there is reason for investigating.

Ultimately I think that it is up to the community to state what they are more comfortable with.

For me a PG lap is magical, stable or not. Rejecting a replay based on stability is doesn't matter to me, the second or third best is still way up there. It only affects PG drivers themselves.
With that I think that Dreadnaut community argument doesn't apply to me.

I do feel the disappointment of other players and that influences me more....
Edison once said,
"I have not failed 10,000 times,
I've successfully found 10,000 ways that will not work."
---------
Currently running over 20 separate instances of Stunts
---------
Check out the STUNTS resources on my Mega (globe icon)

dreadnaut

#26
@Duplode confrontation happens, it's part of discussing things one cares about. How we react and resolve the confrontation is the important bit 👍

Things are much more clear I think. It seems that I have discounted the effort (or likelihood?) of checking one own replays. For me it became rapidly a habit: if I go through a tough PG section, I save-load-play to check that I came out of it "clean". In a team, it happens naturally if you share your replays to discuss them: the first team mate who looks at your replay will spot the issue. Alan did not [have time to] share his last attempts in ZCT260, and we didn't get a chance.

The disappointment of discovering my replay is wobbly, at most a few hours after driving, is definitely lower than the disappointment of those who discover after the race is over. I thought the simple test would allow everyone to check early, but apparently it did not pick up. I feel a bit like we are relaxing a rule because pipsqueaks do not follow it, and are upset that it still applies. (Friker excluded) — that's the feeling, but I know the problem is elsewhere. Wass the rule unclear? Was the test not clear? How can we advertise things better?  — is it possible to improve the situation without removing the rule? Or does the rule have to go or be relaxed for this to ever work?


Quote from: Daniel3D on April 12, 2023, 09:09:29 AMIf changing the amount of cycles is of influence. Then I feel there is reason for investigating.

Maybe we need to ask Stan or Marco to play samples of unstable replays (play to crash, load to crash) on an old 386-486 PC, and see how they behave on original hardware 🤔
 

Frieshansen

Quote from: Duplode on April 11, 2023, 06:21:08 AM(Still, the issue of replays playing differently across DOSBox-running computers does sound worth investigating. Do you have any example replays at hand?)

Attached are two of my replays with this problem from ZCT251. I tried to collect all available data - thanks to @dreadnaut and @alanrotoi for the support. All information is from 06/2022.

1-17-50.rpl      
dreadnaut:
<= 6000 cycles: completes
>= 7000 cycles: crashes in the cork r/l just before the finish line
alanrotoi:
completes
frieshansen:
completes

1-17-45.rpl      
dreadnaut:
<= 6000 cycles: completes
>= 7000 cycles: crashes into the tunnel just after the cork l/r
alanrotoi:
completes
frieshansen:   
completes

Systems

dreadnaut:
DOSBox staging 0.76.0, DOSBox staging 0.78.1 -> all on Linux (not 100% sure about the OS anymore)
with the attached config dosbox-0.76-stunts.conf

alanrotoi:
DOSBox 0.74-3 with cpu speed 20000 cycles

frieshansen:   
DOSBox 0.74-3, DOSBox staging 0.76.0, DOSBox staging 0.78.0 -> all on Windows 10
DOSBox staging 0.76.0 -> on Linux (in a VM on Windows)
all with config file from dreadnaut


I tried a lot of things back then to somehow recreate the problem on my system, but failed with everything. At this point I lost confidence in a reliable replay by DOSBox and thought that the final decision about a valid replay should be decided with the help of the most accurate possible instance of the game itself. I have no problem with DOSbox per se and it is certainly the easiest way for us to drive, but for validation in extreme cases, I think it is poorly suited.
But I'm probably going a bit overboard here. If even the gurus of stunts like Duplode hardly know this problem, it is probably very rare. I also can't say for sure if it's really DOSBox - it's just a guess for which I have no proof.

Argammon

Wow, I have missed a lot of posts while I was away. Without further ado, here is some more research:

I found that the problematic frame in Friker's replay is 1.46.00, in which the car completes the sharp turn. All pictures below are generated from the same replay:

Picture 1: Press play button from 1.46.00 or before
Picture 2: Load the replay from options / Press the play button or the 0.05 forward button from 1.46.05 or later
Picture 3: Press the 0.05 forward button from 1.46.00 or before
You cannot view this attachment.You cannot view this attachment.You cannot view this attachment.   

So there are actually 3 different states  ::)


Overdrijf

3 states, of which 2 don't crash.

Presumably only one state eventually finishes, so what does the Friker of the third timeline do? Park at Joe's for a cup of coffee?

I need to write a Stunts based alternative universes story.