initial cep for repodata state #46

wolfv · 2023-01-09T09:20:09Z

Created a quick CEP for the new repodata state format (cc @dholth as we had some discussions about this. Happy to list you as author!).

Also regarding the spec happy to make changes. My hope is just that mamba and conda can both share the same format.

baszalmstra

Looks good! Looking forward to it!

baszalmstra · 2023-01-09T12:26:35Z

cep-repodata-state.md

+{
+    // we ensure that state.json and .json files are in sync by storing the file
+    // last modified time in the state file, as well as the file size
+    "file_mtime": {


Maybe you can store this in an ISO standard DateTime format instead? Similar to the has_zst.last_checked date. Or is that not precise enough when dealing with file times?

IMO the fractional time os.stat().st_mtime would not be a problem otherwise mtime_ns would be a decent name.

I would have to check how exactly the fractional time is computed though.

In [51]: datetime.datetime.fromtimestamp(1666095612162394000/1000000000).isoformat() Out[51]: '2022-10-18T08:20:12.162394'

🤔

Python's datetime has microsecond resolution. You would always have to convert st_mtime to a datetime, and then compare the two datetimes instead of trying to convert the stored datetime to .timestamp() and compare with st_mtime. datetime.isoformat shows an offset '2022-11-12T00:39:55.564608+00:00' and if you want Z that's string manipulation. But it is very nice that an iso date is human readable, compared to a very long number.

In [39]: datetime.fromtimestamp(Path('Dockerfile').stat().st_mtime, tz=zoneinfo. ...: ZoneInfo('UTC')).resolution Out[39]: datetime.timedelta(microseconds=1)

I think the fractional time is not great because it makes it appear as if it was nanosecond resolution while it isn't. And then we'd have to match the same behavior in C++ or any other language that wants to share the cache.

It seems more natural to store seconds / nanoseconds since UNIX epoch for the mtime since that is what Python does and we don't need to deal with timezones.

However, I am open to other suggestions as long as we can decide to the same behavior.

Don't know about why a . is a problem however a single nanoseconds since epoch number would be tolerable

The reason is that we loose precision and a comparison will not be straightforward anymore:

https://docs.python.org/3/library/os.html#os.stat_result.st_ctime_ns

https://ciaranm.wordpress.com/2009/11/15/this-week-in-python-stupidity-os-stat-os-utime-and-sub-second-timestamps/

There's something extremely pedantic to be said about timestamps; on the other hand comparing two st_mtime_ns is really simple. I'm also happy to see Python's time.time_ns()

baszalmstra · 2023-01-09T12:30:43Z

cep-repodata-state.md

+    // The header values as before
+    "url": STRING,
+    "etag": STRING,
+    "mod": STRING,


Maybe change this to last_modified similar to the header name.

These are the old names without a leading _

I am open to change them, though.

Have implemented (nicer) non-underscored names in a conda branch.

dholth · 2023-01-09T14:17:34Z

One thing about this is that when you start using alternative formats (.zst, .jlap) the remote headers last-modified, etag, cache-control come from the alternate file.

dholth · 2023-01-09T14:41:55Z

An example of the current jlap branch's .state.json.

have is the nominal hash (what the hash of the original repodata.json was according to jlap)

have_hash is the actual hash on disk since we don't serialize with exactly the same sorting, formatting as conda-index. Could be used instead of mtime (if file on disk doesn't match have_hash, then it doesn't correspond to this state.json)

jlap includes too many headers from the jlap request, an intermediate hash iv corresponding to pos- bytes in the file, and the last line of the jlap file.

{
 "_url": "https://repo.anaconda.com/pkgs/main/osx-arm64/repodata.json",
 "_mod": "Fri, 06 Jan 2023 20:09:20 GMT",
 "_etag": "W/\"0134da379063a17831ef4ed73d3489dd\"",
 "_cache_control": "public, max-age=30",
 "have": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f",
 "have_hash": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f",
 "mtime": 1673274986.4032788,
 "jlap": {
  "headers": {
   "date": "Mon, 09 Jan 2023 14:36:25 GMT",
   "content-type": "text/plain",
   "transfer-encoding": "chunked",
   "connection": "keep-alive",
   "x-amz-id-2": "81qDgERSlA/bEpQQeL/YBn3BniAaB37uUkbD5ZySYC/h9JWb+8Sbg1ik70ufAvNtzTeGHqiwZHI=",
   "x-amz-request-id": "Q1HHT9KCY69TQXQX",
   "last-modified": "Fri, 06 Jan 2023 20:09:20 GMT",
   "x-amz-version-id": "BaggXYx0RmtOxe4B6PIl3IDX_G8Ryt5X",
   "etag": "W/\"0134da379063a17831ef4ed73d3489dd\"",
   "cf-cache-status": "MISS",
   "expires": "Mon, 09 Jan 2023 14:36:55 GMT",
   "cache-control": "public, max-age=30",
   "set-cookie": "__cf_bm=YfYijeZ0Y8xc_XBZebA1UA1bX9uz47v67b3guqZFfY0-1673274985-0-Ab9B1pIhPfM0fQkGk5rTS9A5vvc3tPD37jnV+pmXSI2C82sgdxdKaBCB3zhj4wQ6P1yVmaYRpioaARDTmt6H5s0=; path=/; expires=Mon, 09-Jan-23 15:06:25 GMT; domain=.anaconda.com; HttpOnly; Secure; SameSite=None",
   "vary": "Accept-Encoding",
   "server": "cloudflare",
   "cf-ray": "786de7b3ffe9244d-ATL",
   "content-encoding": "gzip"
  },
  "iv": "2f598f0587410d455c8370bef19759fd7a25b4aad4d65c3b3b1be7c7422a938c",
  "pos": 1714976,
  "footer": {
   "url": "repodata.json",
   "latest": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f"
  }
 }
}

dholth · 2023-01-10T13:54:06Z

In the existing format "_url": "https://conda.anaconda.org/conda-forge/osx-arm64", doesn't contain the filename. May want to continue doing that especially since different repodata variants matter. Or ignore the field.

wolfv · 2023-01-10T14:45:51Z

Any chance you have time to make a PR against my branch with your change suggestions? Or I can also try to give you edit rights, if you want :)

dholth · 2023-01-10T15:01:37Z

@wolfv do you mean wolfv#1

Cep repodata state

dholth · 2023-01-11T22:31:57Z

If we are going to play with nanoseconds, let's go ahead and replace all timestamps (except those that are web server headers) with those numbers. e.g. last_checked.

We will quickly release a conda with the main last_modified, cache_control, etag state but it will take us a few more releases to get to "last checked zstd"

Do we standardize how environment locking works?

wolfv · 2023-01-12T09:19:47Z

environment locking as in conda-lock or as in filesystem lockfiles to prevent overwriting things?

wolfv · 2023-01-12T09:21:52Z

My reasoning for the different formats is that for the mtime checking it is "precise" since we actually want to match the file on disk.

For the "last time checked zst" we just want a timestamp so we can check that it's been more than 2 weeks and it doesn't need to be precise. We had this function around to create a RFC3339 string representation so I just used that. We could also use nanoseconds but this one doesn't need to be precise.

dholth · 2023-01-25T20:40:17Z

I tried out micromamba's January release, and it downloads repodata.json.zst very quickly. Producing a state file included below.

In Python it is easier to store nanoseconds as a single number, time.time_ns().bit_length() is only 61 bits today.

In [5]: datetime.datetime.fromtimestamp(2**64//1e9) Out[5]: datetime.datetime(2554, 7, 21, 19, 34, 33)

In the jlap branch I store "jlap_unavailable" as a timestamp, assuming you check alternative formats in a known order of preference unless you know they are 404's.

{
    "cache_control": "public, max-age=30",
    "etag": "\"a9c77cc4c9b1a947375d53326f1604de\"",
    "file_mtime": {
        "nanoseconds": 960132000,
        "seconds": 1674678293
    },
    "file_size": 6974249,
    "has_zst": {
        "last_checked": "2023-01-27T02:09:53Z",
        "value": true
    },
    "mod": "Wed, 25 Jan 2023 19:54:35 GMT",
    "url": "https://repo.anaconda.com/pkgs/main/osx-arm64/repodata.json.zst"
}

dholth · 2023-02-01T17:11:48Z

Am working on a "lock byte 21" implementation like mamba; where we lock the .state.json before doing anything else (even before reading or stat'ing cached repodata.json), and .state.json is the only lockfile. Then we always try to keep the lock for as short a time as possible, e.g. download repodata.json to a temp file, then stat it, then move it on top of the cache filename. This locking is only for the integrity of the repodata.json cache, not per-directory locking to prevent package cache overwrites etc.

On Windows, it might be more appropriate to lock and overwrite repodata.json, instead of the unix style of atomically moving a tempfile on top of the desired file (on Windows, you cannot move a file on top of an existing file; you have to delete the existing file first)

wolfv · 2023-02-02T07:22:26Z

you might also run into issues with atomic renames if your temporary file is on a different fs. /tmp is often a different fs :) So I'd stat it after moving.

dholth · 2023-02-02T12:43:59Z

The temporary file is in the same cache folder.

baszalmstra · 2023-02-14T14:49:38Z

@dholth Does your implementation also produce a <hash>.lock file (like mamba) or does the <hash>.state.json file also function as a lock file? I guess that would also makes a lot of sense. Would you be able to formalize that in the CEP @wolfv ?

In the existing format "_url": "https://conda.anaconda.org/conda-forge/osx-arm64", doesn't contain the filename. May want to continue doing that especially since different repodata variants matter. Or ignore the field.

I agree with this. I would keep it though to ensure hash collisions don't form an issue. I also like that I can find the original URL from the hash. Can we maybe formalize that in the CEP @wolfv ?

Have implemented (nicer) non-underscored names in a conda branch.

@dholth What did you end up calling these? I especially think the mod could just be named last_modified, similar to the HTTP header from where it comes. Also, shouldn't a few of these fields be optional, given that not all HTTP servers return these headers. WDYT @wolfv ?

dholth · 2023-02-14T14:58:23Z

For example I've been working in this branch, link should take you to conda/gateways/repodata/init.py with code to handle the ISO timestamps. The RepodataState class should clearly show the current format.

From my reading of the mamba code, it locks a certain byte in the repodata.json / repodata.state.json files. I didn't notice it creating a .lock file although that is maybe a more old-school technique? and necessary for locking a complete directory - I haven't attempted to lock complete directories in my branch. So far I am only trying to maintain the integrity of the repodata.json cache and don't do anything to prevent e.g. parallel conda's downloading it twice, they simply won't corrupt the cache. Feedback & PR's against PR's welcome.

jezdez · 2023-03-02T17:16:33Z

cep-repodata-state.md

+
+This is not an ideal approach as it modifies the `repodata.json` file and corrupts e.g. the hash of the file. Also, the repodata files have gotten increasingly large, and parsing these state values can require parsing a large `json` file.
+
+Therefore we propose to store the metadata in a secondary file called `.state.json` file next to the repodata.


I've been reviewing the conda implementation in conda/conda#12425 and have stumbled over the state term there few times.

I don't think "state" correctly describes what's in the proposed file as its content is mostly derived from stat-like calls -- and stat for better or worse refers to "status" and not really the ambiguous term "state". This might be confusing for future maintainers.

I know this sounds like nitpicking, but before we're shipping this in conda, I'd rather mention it before we regret it.

We have size and mtime_ns from stat (short for status) calls; last-modified and etag headers for caching; "is some number of alternative remote files available", possibly a content hash of the full file; an intermediate hash and position to start fetching an update to the remote jlap file if that is being used; and the time that the remote server was last checked for new content.

I'm fine either way. 👍

IMO we are using 5 meaningless letters "state", we could make it even more meaningless ".s.json" (and nicely shorter), we could roll the dice and call it >>> os.urandom(2).hex() 'ec5a'.json

What about something that does makes sense? .cache-info.json perhaps?

plain info.json would be short

https://quotesondesign.com/phil-karlton/

Thats fine by me!

jezdez · 2023-03-02T17:18:07Z

cep-repodata-state.md

+{
+    // we ensure that state.json and .json files are in sync by storing the file
+    // last modified time in the state file, as well as the file size
+


I strongly suggest to add a new version entry to the format, to be able to evolve it if needed.

Suggested change

// version of the repodata status format

"version": 1,

If I had to release a version 2, I would say all files without a "version" key are version 1.

Thats also fine for me!

I think I agree with Jannis to add version: 1, and an older conda should ignore newer .state.json

dholth · 2023-03-02T18:20:16Z

@baszalmstra our implementation assumes all the keys are optional, same as if the file was missing. We also treat the keys as missing if state.json doesn't match the cached repodata.json file. We could formally add that to the CEP.

baszalmstra · 2023-03-06T21:03:51Z

Our state data structure in rattler looks like this.

You may note that besides checking the timestamp and the size we also check the repodata.json against a blake2 hash. If the timestamp and the size won't match but the blake2 hash does still match, we consider the data to be up-to-date.

I like the idea of having all keys be optional. Currently, the mtime_ns, size, and url in our implementation are not optional.

We also create an extra lockfile (.lock) that guards both the repodata.json and the state.json file. I think this is also what mamba is doing by observing its behavior. But looking at the code mamba is creating several lock files throughout the process on several different files.

dholth · 2023-03-06T21:06:02Z

I was preparing to add the hash as well, at least for jlap. I have two hash fields called NOMINAL_HASH = "nominal_hash" ON_DISK_HASH = "actual_hash". nominal_hash is the hash according to the .jlap file and actual_hash is the hash after we json.dumps the updated data

dholth · 2023-03-06T21:16:32Z

I assume mamba is also trying to lock whole directories.

The development conda implementation also does the trick of writing the new repodata.json to a new file, stat'ing the temporary name, and moving it on top of the old file.

We are always using BLAKE2(256) which are the same length as sha-256 hashes, but "by default" blake2 produces an overkill 512-bit hash.

baszalmstra · 2023-03-06T21:24:07Z

The development conda implementation also does the trick of writing the new repodata.json to a new file, stat'ing the temporary name, and moving it on top of the old file.

Yeah Rattler does the same. (On windows we use some Win32 API to achieve this).

We are always using BLAKE2(256) which are the same length as sha-256 hashes, but "by default" blake2 produces an overkill 512-bit hash.

Still, it might be nice to include the algorithm used in either the key or value. It doesn't add overhead but makes it easier for
others to deduce whats going on. We could also do "on_disk_hash" : "blake2:blabla..."?

dholth · 2023-03-06T21:25:25Z

For anything crypto-adjacent I'd let the version # fix the exact hash used.

baszalmstra · 2023-03-06T21:28:25Z

In that case, since we can just change keys when changing versions anyway, let's name the key something with blake2. At least then it's clear from reading the file. You also have this in the repodata with sha2 or md5.

dholth · 2023-03-08T16:22:46Z

Parametrize important key names https://github.com/conda/conda/pull/12461/files#diff-813ca3bd61f56355fb3ea7c560d18b892d42d62a07fa0e78666dcfa50c5fda13R64

wolfv added 2 commits January 9, 2023 10:18

initial cep for repodata state

e00bef5

fix syntax highlighting

33180fc

baszalmstra reviewed Jan 9, 2023

View reviewed changes

proposed edits

d34ea8d

Merge pull request #1 from dholth/cep-repodata-state

c016968

Cep repodata state

dholth mentioned this pull request Jan 12, 2023

Repodata state format per draft CEP conda/conda#12241

Closed

3 tasks

wolfv mentioned this pull request Jan 17, 2023

repodata.zst – improvements mamba-org/mamba#2231

Open

2 tasks

baszalmstra mentioned this pull request Feb 16, 2023

feat: download and cache repodata.json conda/rattler#55

Merged

3 tasks

dholth mentioned this pull request Mar 2, 2023

Features/repodata jlap minus additional test changes conda/conda#12425

Closed

3 tasks

jezdez reviewed Mar 2, 2023

View reviewed changes

jaimergp mentioned this pull request Mar 6, 2023

libmamba Could not parse state file: Could not load cache state conda/conda-libmamba-solver#145

Closed

2 tasks


		This is not an ideal approach as it modifies the `repodata.json` file and corrupts e.g. the hash of the file. Also, the repodata files have gotten increasingly large, and parsing these state values can require parsing a large `json` file.

		Therefore we propose to store the metadata in a secondary file called `.state.json` file next to the repodata.

initial cep for repodata state #46

Are you sure you want to change the base?

initial cep for repodata state #46

Conversation

wolfv commented Jan 9, 2023

baszalmstra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholth Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholth Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholth commented Jan 9, 2023

dholth commented Jan 9, 2023

dholth commented Jan 10, 2023

wolfv commented Jan 10, 2023

dholth commented Jan 10, 2023

dholth commented Jan 11, 2023 • edited Loading

wolfv commented Jan 12, 2023

wolfv commented Jan 12, 2023

dholth commented Jan 25, 2023 • edited Loading

dholth commented Feb 1, 2023 • edited Loading

wolfv commented Feb 2, 2023

dholth commented Feb 2, 2023

baszalmstra commented Feb 14, 2023

dholth commented Feb 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholth Mar 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholth commented Mar 2, 2023

baszalmstra commented Mar 6, 2023

dholth commented Mar 6, 2023

dholth commented Mar 6, 2023 • edited Loading

baszalmstra commented Mar 6, 2023 • edited Loading

dholth commented Mar 6, 2023

baszalmstra commented Mar 6, 2023

dholth commented Mar 8, 2023

dholth Jan 9, 2023 •

edited

Loading

dholth Jan 9, 2023 •

edited

Loading

dholth commented Jan 11, 2023 •

edited

Loading

dholth commented Jan 25, 2023 •

edited

Loading

dholth commented Feb 1, 2023 •

edited

Loading

dholth commented Feb 14, 2023 •

edited

Loading

dholth Mar 2, 2023 •

edited

Loading

dholth commented Mar 6, 2023 •

edited

Loading

baszalmstra commented Mar 6, 2023 •

edited

Loading