News:

Herr Otto Partz says you're all nothing but pipsqueaks!

Main Menu

Uniquely identifying tracks and replays

Started by Cas, March 02, 2019, 11:44:58 PM

Previous topic - Next topic

Cas

What we have
This is about one particular point that could later help for the global registry, but also for many other things. As I had said before in another thread, we don't currently have a way to uniquely identify tracks, let alone replays. For tracks, in ZakStunts, there's the ZCT number, which does a very good job, but it's not perfect, since one track may have more than one ZCT number and of course, tracks from other tournaments or well-known tracks that don't particularly belong to a tournament, don't have to have a ZCT number. Also, fantasy-names, which Bliss usually refers to as "track-titles" are good too, but are long to type and may contain non-ASCII characters and besides, many tracks don't have them, so they are good for looks, but not for general identification.

What could be done
I was thinking of a general track identification number system. There are two approaches at this: one would be to directly assign as many tracks as possible a number directly and maybe even reserve numbers depending on where the track originated or how it's been used. The other is to just start from zero or one and assign numbers as tracks are given them. The first option is more orderly, although the second one has the advantage of leaving no "holes". Furthermore, there's the thing about how to format these track IDs. A decimal number?  An alpha-numeric string?  Maybe a hex number?  Could also be a leading letter and a number. Anybody has thought about something like this before?

Estimations
I reckon we have a few thousands of tracks counthing what each of us has in their hard drives, plus old tracks from old championships, etc. Four decimal digits should be enough, but could be surpassed with some effort. Five or six should definitely be good. But usually, we're interested in about 500 tracks only. Something like #00921 could look good for a general track number, or maybe like 14-0122 where the first two digits would identify the source or origin of the track. I kind of prefer the first.

On replays
We currently have no general ID for replays, I think, other than them being saved in folders with their corresponding race in tournaments. A replay ID number could contain the track ID as part of it, but could also be completely general or be based on date-and-time when it was first registered anywhere (such us uploaded to a championship). In that case, it'd be complicated to work out the numbers for the thousands of replay files we already have.

On track hashes and circuit hashes
As I was taking I walk, I thought, it's great to be able to quickly compare tracks by using hashes based on all of the bytes in the file. Then, when they match, if we want, we may go ahead and compare the whole thing byte-by-byte to make sure it's the same, or not. But, depending on what we're looking for, sometimes it could make more sense to compare not the whole contents, but only what's on the accessible path of the track. A sequential follow-up of the track and it's branches could result in a parallel cricuit-hash that could be used to tell if two tracks are very similar or just versions of the same track. We could even be able to tell a rotation. Of course, the general track hash is still very important. I'm just saying the second one could be a plus.

Finally
Thanks, guys for reading my frequent brainstorming. Most people don't like reading long, but I think it's a good idea to just drop this here in case now or anytime anybody would like to think about it and add or maybe I suddenly light up a new idea one of you may have that may be much better than mine. My grain of sand.
Earth is my country. Science is my religion.

dreadnaut

#1
I think different identification systems might work better for different tasks.

For example, books are classified in libraries with different schemes: you have an author and a title, an isle and a shelf number to find the physical book, a code following the Dewey Classification, but also an ISBN number.

Each way to "address" a book is actually answering a question: who wrote this book and what is it about? where do I find it? how can I identify it and order it from the publisher? Which questions do we have about Stunts tracks and replay?

In ZakStunts I need to answer the questions "which of the known tracks does this replay belong to?" and "have I seen this replay before?". For this, I have been using a full-file SHA1 hash, which has good distribution guarantees and makes collisions unlikely. Given the small space of the existing tracks and replay, you could keep the first 7-8 characters of the hash and still avoid collisions. But these identifiers are opaque and difficult to remember, they are not meant for humans.

This is also a strict file comparison, while other "similarity" measures are of course possible. One could consider only the track, minus scenery elements—but sometimes a well placed building can make a difference. Maybe there's a good "checksum" based on adding the "values" of each tile in the track, beginning with the start line and following all paths. And going full data-science, if you consider the map of a track as an image, you can apply feature recognition techniques to compare tracks. But what question does this answer? Can it map tracks on a space which gives us useful insights, or is it merely a numeric exercise, and looking at the maps would be faster?

If the question instead is about storing files on disc, the ZCTnnnn format is not bad, and could be extended to other competitions. However, I've recently been working on splitting the concepts for "track" and "race", at least in the back-end.

A track is just a track file, paired with a record (title, author, preferred 8-character name). A race has a ZCT number, a track, start and end dates, extra rules, etc. So ZCT079 is a race that ran between the 1st and the 31st of December 2007, on the DEFAULT track, which was created byt Distinctive Software in 1990. The split can of course be confusing, so when you download the track it's still called ZCT079.TRK.

While for competition tracks we could use a "competition + counter" id, I can't think of a good way to classify non-competition tracks, which also abound. The reason might be that we don't have corridors and shelves to explore, and that's what we need to find out first, because it will inform our classification system.

Maybe it's by "track type" (flat, aerial, ...), or by "number of minutes for a Lambo track", or... The number of numbering systems is infinite, but each one will suffer from the mismatch between universality and the subjectivity of defining it. For this reason, I am not too interested in finding the perfect naming pattern, and I'll stick to unreadable hashes. On top of that, we can build facets and tags and any sort of partial classification, each a different view of the track-space, answering different questions :)


Cas

Quote from: DreadnautWhich questions do we have about Stunts tracks and replay?

Yes, this is a very good way of seeing it. Inside a tournament, there are of course, internal questions that don't require anything universal. But if it is about the general user, what would he be looking for?  I think there are two kinds of questions. One is of the sort "I have this track or replay and I want to know more about it". In that case, the user would give a whole file, the registry or any other system would use its own hash methods, that wouldn't need to match the methods of other systems, and bring up all the info it can about the track or replay that's relevant to the system. The other type would be "I'm looking for a track or replay that has these characteristics". There could be many answers. Now, for none of these it's necessary to agree on a numbering system, it seems. It would, however, be necessary to internally identify the item uniquely with a non-changing ID so that, for example, one can grab a link from a site and put it in another and that will direct people to the page regarding that item. Things like that.

Quote from: DreadnautBut what question does this answer? Can it map tracks on a space which gives us useful insights, or is it merely a numeric exercise, and looking at the maps would be faster?

Well, this is a thing separate from the numbering idea. What I envisioned does not necessarily have to be calculated the way I described, but the question it should answer is whether a track or a replay was (probably) created from another, if it's a version of another. The utility in this would be to associate items when coincidence is not perfect. Say somebody drove a replay on a well known track, but first made a couple of touches. Some time later, another person is looking up the replay but can't find it because it turns out the track is not the same. Things like that. For example, my track "Abusái" was slightly modified before entering ZakStunts, which is OK. But I do have the original version. Again, this is an idea and a reason why it could be useful. I'm not saying that it's a requirement in anyway for things to work. But consider also replays that are identical up to a certain byte, meaning they were creating by "continue driving" from a point. I figure it'd be useful to tell that's the case (which brings me to a question I've had for a long time and I'll ask it at the end of this post, ha, ha).

Quote from: DreadnautI've recently been working on splitting the concepts for "track" and "race", at least in the back-end.

I believe this is a very good approach. It does make a lot more sense that way. Of course "season x, race y" also solves the problem, but the ZCT system is more orderly in my opinion, especially since races in ZakStunts are not aligned with a day of the month, but weekly, so it's not always easy to say "October's race".

Quote from: DreadnautI can't think of a good way to classify non-competition tracks

If we follow the example of the ZCT system, then I think the natural expansion would be to have a leading thing such as "NTB" (non-tournament-based) or each author could have a leading ID and local numbers. The thing is, as you said before, the ZCT method is better to define races than tracks and the same thing happens with these other possibilities. One track could, for some reason, make sense to have more than one number in the same classification or even be present in several classifications.

Now to the question I've had for a long time

I've always assumed that, in ZakStunts, once you've posted a replay, it's not correct to post another that's based on the one you've already posted. That is, I can do all the RH I want, but once the replay is out, I have to start from scratch for the next one. Yet, I have to admit, I've never read that rule anywhere, so I don't know if my assumption is correct. So... is it?  And does ZakStunts check for this at all?  As I've never done that, I haven't tested it.
Earth is my country. Science is my religion.

Overdrijf

Quote from: Cas on March 03, 2019, 08:30:54 PMNow to the question I've had for a long time

I've always assumed that, in ZakStunts, once you've posted a replay, it's not correct to post another that's based on the one you've already posted. That is, I can do all the RH I want, but once the replay is out, I have to start from scratch for the next one. Yet, I have to admit, I've never read that rule anywhere, so I don't know if my assumption is correct. So... is it?  And does ZakStunts check for this at all?  As I've never done that, I haven't tested it.

I don't think so. It never occurred to me to interpret the rules that way. I've done plenty of "I don't have much time, I'll just redo the last part" improvements.

In fact, I've learned a time hiding trick from CTG: if your replay is much faster than the current record, create several versions of it by slowing down before the finish line. You post the slowest one and keep the faster one(s) as a backup. It seemed like a logical extension of replay handling. (If a bit of a trolly gimmick, but fun to use every now and then.)