Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial cep for repodata state #46

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

wolfv
Copy link
Contributor

@wolfv wolfv commented Jan 9, 2023

Created a quick CEP for the new repodata state format (cc @dholth as we had some discussions about this. Happy to list you as author!).

Also regarding the spec happy to make changes. My hope is just that mamba and conda can both share the same format.

Copy link
Contributor

@baszalmstra baszalmstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Looking forward to it!

{
// we ensure that state.json and .json files are in sync by storing the file
// last modified time in the state file, as well as the file size
"file_mtime": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can store this in an ISO standard DateTime format instead? Similar to the has_zst.last_checked date. Or is that not precise enough when dealing with file times?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the fractional time os.stat().st_mtime would not be a problem otherwise mtime_ns would be a decent name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to check how exactly the fractional time is computed though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [51]: datetime.datetime.fromtimestamp(1666095612162394000/1000000000).isoformat()
Out[51]: '2022-10-18T08:20:12.162394'

🤔

Copy link
Contributor

@dholth dholth Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python's datetime has microsecond resolution. You would always have to convert st_mtime to a datetime, and then compare the two datetimes instead of trying to convert the stored datetime to .timestamp() and compare with st_mtime. datetime.isoformat shows an offset '2022-11-12T00:39:55.564608+00:00' and if you want Z that's string manipulation. But it is very nice that an iso date is human readable, compared to a very long number.

In [39]: datetime.fromtimestamp(Path('Dockerfile').stat().st_mtime, tz=zoneinfo.
    ...: ZoneInfo('UTC')).resolution
Out[39]: datetime.timedelta(microseconds=1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fractional time is not great because it makes it appear as if it was nanosecond resolution while it isn't. And then we'd have to match the same behavior in C++ or any other language that wants to share the cache.

It seems more natural to store seconds / nanoseconds since UNIX epoch for the mtime since that is what Python does and we don't need to deal with timezones.

However, I am open to other suggestions as long as we can decide to the same behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know about why a . is a problem however a single nanoseconds since epoch number would be tolerable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@dholth dholth Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something extremely pedantic to be said about timestamps; on the other hand comparing two st_mtime_ns is really simple. I'm also happy to see Python's time.time_ns()

// The header values as before
"url": STRING,
"etag": STRING,
"mod": STRING,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change this to last_modified similar to the header name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the old names without a leading _

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to change them, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have implemented (nicer) non-underscored names in a conda branch.

@dholth
Copy link
Contributor

dholth commented Jan 9, 2023

One thing about this is that when you start using alternative formats (.zst, .jlap) the remote headers last-modified, etag, cache-control come from the alternate file.

@dholth
Copy link
Contributor

dholth commented Jan 9, 2023

An example of the current jlap branch's .state.json.

have is the nominal hash (what the hash of the original repodata.json was according to jlap)

have_hash is the actual hash on disk since we don't serialize with exactly the same sorting, formatting as conda-index. Could be used instead of mtime (if file on disk doesn't match have_hash, then it doesn't correspond to this state.json)

jlap includes too many headers from the jlap request, an intermediate hash iv corresponding to pos- bytes in the file, and the last line of the jlap file.

{
 "_url": "https://repo.anaconda.com/pkgs/main/osx-arm64/repodata.json",
 "_mod": "Fri, 06 Jan 2023 20:09:20 GMT",
 "_etag": "W/\"0134da379063a17831ef4ed73d3489dd\"",
 "_cache_control": "public, max-age=30",
 "have": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f",
 "have_hash": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f",
 "mtime": 1673274986.4032788,
 "jlap": {
  "headers": {
   "date": "Mon, 09 Jan 2023 14:36:25 GMT",
   "content-type": "text/plain",
   "transfer-encoding": "chunked",
   "connection": "keep-alive",
   "x-amz-id-2": "81qDgERSlA/bEpQQeL/YBn3BniAaB37uUkbD5ZySYC/h9JWb+8Sbg1ik70ufAvNtzTeGHqiwZHI=",
   "x-amz-request-id": "Q1HHT9KCY69TQXQX",
   "last-modified": "Fri, 06 Jan 2023 20:09:20 GMT",
   "x-amz-version-id": "BaggXYx0RmtOxe4B6PIl3IDX_G8Ryt5X",
   "etag": "W/\"0134da379063a17831ef4ed73d3489dd\"",
   "cf-cache-status": "MISS",
   "expires": "Mon, 09 Jan 2023 14:36:55 GMT",
   "cache-control": "public, max-age=30",
   "set-cookie": "__cf_bm=YfYijeZ0Y8xc_XBZebA1UA1bX9uz47v67b3guqZFfY0-1673274985-0-Ab9B1pIhPfM0fQkGk5rTS9A5vvc3tPD37jnV+pmXSI2C82sgdxdKaBCB3zhj4wQ6P1yVmaYRpioaARDTmt6H5s0=; path=/; expires=Mon, 09-Jan-23 15:06:25 GMT; domain=.anaconda.com; HttpOnly; Secure; SameSite=None",
   "vary": "Accept-Encoding",
   "server": "cloudflare",
   "cf-ray": "786de7b3ffe9244d-ATL",
   "content-encoding": "gzip"
  },
  "iv": "2f598f0587410d455c8370bef19759fd7a25b4aad4d65c3b3b1be7c7422a938c",
  "pos": 1714976,
  "footer": {
   "url": "repodata.json",
   "latest": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f"
  }
 }
}

@dholth
Copy link
Contributor

dholth commented Jan 10, 2023

In the existing format "_url": "https://conda.anaconda.org/conda-forge/osx-arm64", doesn't contain the filename. May want to continue doing that especially since different repodata variants matter. Or ignore the field.

@wolfv
Copy link
Contributor Author

wolfv commented Jan 10, 2023

Any chance you have time to make a PR against my branch with your change suggestions? Or I can also try to give you edit rights, if you want :)

@dholth
Copy link
Contributor

dholth commented Jan 10, 2023

@wolfv do you mean wolfv#1

@dholth
Copy link
Contributor

dholth commented Jan 11, 2023

If we are going to play with nanoseconds, let's go ahead and replace all timestamps (except those that are web server headers) with those numbers. e.g. last_checked.

We will quickly release a conda with the main last_modified, cache_control, etag state but it will take us a few more releases to get to "last checked zstd"

Do we standardize how environment locking works?

@wolfv
Copy link
Contributor Author

wolfv commented Jan 12, 2023

environment locking as in conda-lock or as in filesystem lockfiles to prevent overwriting things?

@wolfv
Copy link
Contributor Author

wolfv commented Jan 12, 2023

My reasoning for the different formats is that for the mtime checking it is "precise" since we actually want to match the file on disk.

For the "last time checked zst" we just want a timestamp so we can check that it's been more than 2 weeks and it doesn't need to be precise. We had this function around to create a RFC3339 string representation so I just used that. We could also use nanoseconds but this one doesn't need to be precise.

@dholth
Copy link
Contributor

dholth commented Jan 25, 2023

I tried out micromamba's January release, and it downloads repodata.json.zst very quickly. Producing a state file included below.

In Python it is easier to store nanoseconds as a single number, time.time_ns().bit_length() is only 61 bits today.

In [5]: datetime.datetime.fromtimestamp(2**64//1e9) Out[5]: datetime.datetime(2554, 7, 21, 19, 34, 33)

In the jlap branch I store "jlap_unavailable" as a timestamp, assuming you check alternative formats in a known order of preference unless you know they are 404's.

{
    "cache_control": "public, max-age=30",
    "etag": "\"a9c77cc4c9b1a947375d53326f1604de\"",
    "file_mtime": {
        "nanoseconds": 960132000,
        "seconds": 1674678293
    },
    "file_size": 6974249,
    "has_zst": {
        "last_checked": "2023-01-27T02:09:53Z",
        "value": true
    },
    "mod": "Wed, 25 Jan 2023 19:54:35 GMT",
    "url": "https://repo.anaconda.com/pkgs/main/osx-arm64/repodata.json.zst"
}

@dholth
Copy link
Contributor

dholth commented Feb 1, 2023

Am working on a "lock byte 21" implementation like mamba; where we lock the .state.json before doing anything else (even before reading or stat'ing cached repodata.json), and .state.json is the only lockfile. Then we always try to keep the lock for as short a time as possible, e.g. download repodata.json to a temp file, then stat it, then move it on top of the cache filename. This locking is only for the integrity of the repodata.json cache, not per-directory locking to prevent package cache overwrites etc.

On Windows, it might be more appropriate to lock and overwrite repodata.json, instead of the unix style of atomically moving a tempfile on top of the desired file (on Windows, you cannot move a file on top of an existing file; you have to delete the existing file first)

@wolfv
Copy link
Contributor Author

wolfv commented Feb 2, 2023

you might also run into issues with atomic renames if your temporary file is on a different fs. /tmp is often a different fs :) So I'd stat it after moving.

@dholth
Copy link
Contributor

dholth commented Feb 2, 2023

The temporary file is in the same cache folder.

@baszalmstra
Copy link
Contributor

@dholth Does your implementation also produce a <hash>.lock file (like mamba) or does the <hash>.state.json file also function as a lock file? I guess that would also makes a lot of sense. Would you be able to formalize that in the CEP @wolfv ?

In the existing format "_url": "https://conda.anaconda.org/conda-forge/osx-arm64", doesn't contain the filename. May want to continue doing that especially since different repodata variants matter. Or ignore the field.

I agree with this. I would keep it though to ensure hash collisions don't form an issue. I also like that I can find the original URL from the hash. Can we maybe formalize that in the CEP @wolfv ?

Have implemented (nicer) non-underscored names in a conda branch.

@dholth What did you end up calling these? I especially think the mod could just be named last_modified, similar to the HTTP header from where it comes. Also, shouldn't a few of these fields be optional, given that not all HTTP servers return these headers. WDYT @wolfv ?

@dholth
Copy link
Contributor

dholth commented Feb 14, 2023

For example I've been working in this branch, link should take you to conda/gateways/repodata/init.py with code to handle the ISO timestamps. The RepodataState class should clearly show the current format.

From my reading of the mamba code, it locks a certain byte in the repodata.json / repodata.state.json files. I didn't notice it creating a .lock file although that is maybe a more old-school technique? and necessary for locking a complete directory - I haven't attempted to lock complete directories in my branch. So far I am only trying to maintain the integrity of the repodata.json cache and don't do anything to prevent e.g. parallel conda's downloading it twice, they simply won't corrupt the cache. Feedback & PR's against PR's welcome.


This is not an ideal approach as it modifies the `repodata.json` file and corrupts e.g. the hash of the file. Also, the repodata files have gotten increasingly large, and parsing these state values can require parsing a large `json` file.

Therefore we propose to store the metadata in a secondary file called `.state.json` file next to the repodata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been reviewing the conda implementation in conda/conda#12425 and have stumbled over the state term there few times.

I don't think "state" correctly describes what's in the proposed file as its content is mostly derived from stat-like calls -- and stat for better or worse refers to "status" and not really the ambiguous term "state". This might be confusing for future maintainers.

I know this sounds like nitpicking, but before we're shipping this in conda, I'd rather mention it before we regret it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have size and mtime_ns from stat (short for status) calls; last-modified and etag headers for caching; "is some number of alternative remote files available", possibly a content hash of the full file; an intermediate hash and position to start fetching an update to the remote jlap file if that is being used; and the time that the remote server was last checked for new content.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we are using 5 meaningless letters "state", we could make it even more meaningless ".s.json" (and nicely shorter), we could roll the dice and call it >>> os.urandom(2).hex() 'ec5a'.json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something that does makes sense? .cache-info.json perhaps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plain info.json would be short

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
// we ensure that state.json and .json files are in sync by storing the file
// last modified time in the state file, as well as the file size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suggest to add a new version entry to the format, to be able to evolve it if needed.

Suggested change
// version of the repodata status format
"version": 1,

Copy link
Contributor

@dholth dholth Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I had to release a version 2, I would say all files without a "version" key are version 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats also fine for me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with Jannis to add version: 1, and an older conda should ignore newer .state.json

@dholth
Copy link
Contributor

dholth commented Mar 2, 2023

@baszalmstra our implementation assumes all the keys are optional, same as if the file was missing. We also treat the keys as missing if state.json doesn't match the cached repodata.json file. We could formally add that to the CEP.

@baszalmstra
Copy link
Contributor

Our state data structure in rattler looks like this.

You may note that besides checking the timestamp and the size we also check the repodata.json against a blake2 hash. If the timestamp and the size won't match but the blake2 hash does still match, we consider the data to be up-to-date.

I like the idea of having all keys be optional. Currently, the mtime_ns, size, and url in our implementation are not optional.

We also create an extra lockfile (.lock) that guards both the repodata.json and the state.json file. I think this is also what mamba is doing by observing its behavior. But looking at the code mamba is creating several lock files throughout the process on several different files.

@dholth
Copy link
Contributor

dholth commented Mar 6, 2023

I was preparing to add the hash as well, at least for jlap. I have two hash fields called NOMINAL_HASH = "nominal_hash" ON_DISK_HASH = "actual_hash". nominal_hash is the hash according to the .jlap file and actual_hash is the hash after we json.dumps the updated data

@dholth
Copy link
Contributor

dholth commented Mar 6, 2023

I assume mamba is also trying to lock whole directories.

The development conda implementation also does the trick of writing the new repodata.json to a new file, stat'ing the temporary name, and moving it on top of the old file.

We are always using BLAKE2(256) which are the same length as sha-256 hashes, but "by default" blake2 produces an overkill 512-bit hash.

@baszalmstra
Copy link
Contributor

baszalmstra commented Mar 6, 2023

The development conda implementation also does the trick of writing the new repodata.json to a new file, stat'ing the temporary name, and moving it on top of the old file.

Yeah Rattler does the same. (On windows we use some Win32 API to achieve this).

We are always using BLAKE2(256) which are the same length as sha-256 hashes, but "by default" blake2 produces an overkill 512-bit hash.

Still, it might be nice to include the algorithm used in either the key or value. It doesn't add overhead but makes it easier for
others to deduce whats going on. We could also do "on_disk_hash" : "blake2:blabla..."?

@dholth
Copy link
Contributor

dholth commented Mar 6, 2023

For anything crypto-adjacent I'd let the version # fix the exact hash used.

@baszalmstra
Copy link
Contributor

In that case, since we can just change keys when changing versions anyway, let's name the key something with blake2. At least then it's clear from reading the file. You also have this in the repodata with sha2 or md5.

@dholth
Copy link
Contributor

dholth commented Mar 8, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants