-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial cep for repodata state #46
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Looking forward to it!
cep-repodata-state.md
Outdated
{ | ||
// we ensure that state.json and .json files are in sync by storing the file | ||
// last modified time in the state file, as well as the file size | ||
"file_mtime": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can store this in an ISO standard DateTime format instead? Similar to the has_zst.last_checked
date. Or is that not precise enough when dealing with file times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the fractional time os.stat().st_mtime
would not be a problem otherwise mtime_ns
would be a decent name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have to check how exactly the fractional time is computed though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [51]: datetime.datetime.fromtimestamp(1666095612162394000/1000000000).isoformat()
Out[51]: '2022-10-18T08:20:12.162394'
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python's datetime has microsecond resolution. You would always have to convert st_mtime to a datetime, and then compare the two datetimes instead of trying to convert the stored datetime to .timestamp()
and compare with st_mtime
. datetime.isoformat
shows an offset '2022-11-12T00:39:55.564608+00:00'
and if you want Z
that's string manipulation. But it is very nice that an iso date is human readable, compared to a very long number.
In [39]: datetime.fromtimestamp(Path('Dockerfile').stat().st_mtime, tz=zoneinfo.
...: ZoneInfo('UTC')).resolution
Out[39]: datetime.timedelta(microseconds=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fractional time is not great because it makes it appear as if it was nanosecond resolution while it isn't. And then we'd have to match the same behavior in C++ or any other language that wants to share the cache.
It seems more natural to store seconds / nanoseconds since UNIX epoch for the mtime since that is what Python does and we don't need to deal with timezones.
However, I am open to other suggestions as long as we can decide to the same behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know about why a . is a problem however a single nanoseconds since epoch number would be tolerable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that we loose precision and a comparison will not be straightforward anymore:
https://docs.python.org/3/library/os.html#os.stat_result.st_ctime_ns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something extremely pedantic to be said about timestamps; on the other hand comparing two st_mtime_ns
is really simple. I'm also happy to see Python's time.time_ns()
// The header values as before | ||
"url": STRING, | ||
"etag": STRING, | ||
"mod": STRING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change this to last_modified
similar to the header name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the old names without a leading _
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to change them, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have implemented (nicer) non-underscored names in a conda branch.
One thing about this is that when you start using alternative formats (.zst, .jlap) the remote headers last-modified, etag, cache-control come from the alternate file. |
An example of the current jlap branch's
{
"_url": "https://repo.anaconda.com/pkgs/main/osx-arm64/repodata.json",
"_mod": "Fri, 06 Jan 2023 20:09:20 GMT",
"_etag": "W/\"0134da379063a17831ef4ed73d3489dd\"",
"_cache_control": "public, max-age=30",
"have": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f",
"have_hash": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f",
"mtime": 1673274986.4032788,
"jlap": {
"headers": {
"date": "Mon, 09 Jan 2023 14:36:25 GMT",
"content-type": "text/plain",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"x-amz-id-2": "81qDgERSlA/bEpQQeL/YBn3BniAaB37uUkbD5ZySYC/h9JWb+8Sbg1ik70ufAvNtzTeGHqiwZHI=",
"x-amz-request-id": "Q1HHT9KCY69TQXQX",
"last-modified": "Fri, 06 Jan 2023 20:09:20 GMT",
"x-amz-version-id": "BaggXYx0RmtOxe4B6PIl3IDX_G8Ryt5X",
"etag": "W/\"0134da379063a17831ef4ed73d3489dd\"",
"cf-cache-status": "MISS",
"expires": "Mon, 09 Jan 2023 14:36:55 GMT",
"cache-control": "public, max-age=30",
"set-cookie": "__cf_bm=YfYijeZ0Y8xc_XBZebA1UA1bX9uz47v67b3guqZFfY0-1673274985-0-Ab9B1pIhPfM0fQkGk5rTS9A5vvc3tPD37jnV+pmXSI2C82sgdxdKaBCB3zhj4wQ6P1yVmaYRpioaARDTmt6H5s0=; path=/; expires=Mon, 09-Jan-23 15:06:25 GMT; domain=.anaconda.com; HttpOnly; Secure; SameSite=None",
"vary": "Accept-Encoding",
"server": "cloudflare",
"cf-ray": "786de7b3ffe9244d-ATL",
"content-encoding": "gzip"
},
"iv": "2f598f0587410d455c8370bef19759fd7a25b4aad4d65c3b3b1be7c7422a938c",
"pos": 1714976,
"footer": {
"url": "repodata.json",
"latest": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f"
}
}
} |
In the existing format |
Any chance you have time to make a PR against my branch with your change suggestions? Or I can also try to give you edit rights, if you want :) |
Cep repodata state
If we are going to play with nanoseconds, let's go ahead and replace all timestamps (except those that are web server headers) with those numbers. e.g. last_checked. We will quickly release a conda with the main last_modified, cache_control, etag state but it will take us a few more releases to get to "last checked zstd" Do we standardize how environment locking works? |
environment locking as in conda-lock or as in filesystem lockfiles to prevent overwriting things? |
My reasoning for the different formats is that for the mtime checking it is "precise" since we actually want to match the file on disk. For the "last time checked zst" we just want a timestamp so we can check that it's been more than 2 weeks and it doesn't need to be precise. We had this function around to create a RFC3339 string representation so I just used that. We could also use nanoseconds but this one doesn't need to be precise. |
I tried out micromamba's January release, and it downloads repodata.json.zst very quickly. Producing a state file included below. In Python it is easier to store nanoseconds as a single number,
In the jlap branch I store "jlap_unavailable" as a timestamp, assuming you check alternative formats in a known order of preference unless you know they are 404's. {
"cache_control": "public, max-age=30",
"etag": "\"a9c77cc4c9b1a947375d53326f1604de\"",
"file_mtime": {
"nanoseconds": 960132000,
"seconds": 1674678293
},
"file_size": 6974249,
"has_zst": {
"last_checked": "2023-01-27T02:09:53Z",
"value": true
},
"mod": "Wed, 25 Jan 2023 19:54:35 GMT",
"url": "https://repo.anaconda.com/pkgs/main/osx-arm64/repodata.json.zst"
} |
Am working on a "lock byte 21" implementation like mamba; where we lock the On Windows, it might be more appropriate to lock and overwrite repodata.json, instead of the unix style of atomically moving a tempfile on top of the desired file (on Windows, you cannot move a file on top of an existing file; you have to delete the existing file first) |
you might also run into issues with atomic renames if your temporary file is on a different fs. |
The temporary file is in the same cache folder. |
@dholth Does your implementation also produce a
I agree with this. I would keep it though to ensure hash collisions don't form an issue. I also like that I can find the original URL from the hash. Can we maybe formalize that in the CEP @wolfv ?
@dholth What did you end up calling these? I especially think the |
For example I've been working in this branch, link should take you to conda/gateways/repodata/init.py with code to handle the ISO timestamps. The RepodataState class should clearly show the current format. From my reading of the mamba code, it locks a certain byte in the repodata.json / repodata.state.json files. I didn't notice it creating a |
|
||
This is not an ideal approach as it modifies the `repodata.json` file and corrupts e.g. the hash of the file. Also, the repodata files have gotten increasingly large, and parsing these state values can require parsing a large `json` file. | ||
|
||
Therefore we propose to store the metadata in a secondary file called `.state.json` file next to the repodata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been reviewing the conda implementation in conda/conda#12425 and have stumbled over the state
term there few times.
I don't think "state" correctly describes what's in the proposed file as its content is mostly derived from stat
-like calls -- and stat
for better or worse refers to "status" and not really the ambiguous term "state". This might be confusing for future maintainers.
I know this sounds like nitpicking, but before we're shipping this in conda, I'd rather mention it before we regret it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have size
and mtime_ns
from stat
(short for status
) calls; last-modified and etag headers for caching; "is some number of alternative remote files available", possibly a content hash of the full file; an intermediate hash and position to start fetching an update to the remote jlap
file if that is being used; and the time that the remote server was last checked for new content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine either way. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we are using 5 meaningless letters "state", we could make it even more meaningless ".s.json" (and nicely shorter), we could roll the dice and call it >>> os.urandom(2).hex() 'ec5a'
.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something that does makes sense? .cache-info.json
perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plain info.json
would be short
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://quotesondesign.com/phil-karlton/
Thats fine by me!
{ | ||
// we ensure that state.json and .json files are in sync by storing the file | ||
// last modified time in the state file, as well as the file size | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly suggest to add a new version
entry to the format, to be able to evolve it if needed.
// version of the repodata status format | |
"version": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I had to release a version 2, I would say all files without a "version"
key are version 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats also fine for me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with Jannis to add version: 1, and an older conda should ignore newer .state.json
@baszalmstra our implementation assumes all the keys are optional, same as if the file was missing. We also treat the keys as missing if |
Our state data structure in rattler looks like this. You may note that besides checking the timestamp and the size we also check the I like the idea of having all keys be optional. Currently, the We also create an extra lockfile ( |
I was preparing to add the hash as well, at least for jlap. I have two hash fields called |
I assume mamba is also trying to lock whole directories. The development conda implementation also does the trick of writing the new repodata.json to a new file, stat'ing the temporary name, and moving it on top of the old file. We are always using BLAKE2(256) which are the same length as sha-256 hashes, but "by default" blake2 produces an overkill 512-bit hash. |
Yeah Rattler does the same. (On windows we use some Win32 API to achieve this).
Still, it might be nice to include the algorithm used in either the key or value. It doesn't add overhead but makes it easier for |
For anything crypto-adjacent I'd let the version # fix the exact hash used. |
In that case, since we can just change keys when changing versions anyway, let's name the key something with |
Parametrize important key names https://github.com/conda/conda/pull/12461/files#diff-813ca3bd61f56355fb3ea7c560d18b892d42d62a07fa0e78666dcfa50c5fda13R64 |
Created a quick CEP for the new repodata state format (cc @dholth as we had some discussions about this. Happy to list you as author!).
Also regarding the spec happy to make changes. My hope is just that mamba and conda can both share the same format.