Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial cep for repodata state #46

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions cep-repodata-state.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
<table>
<tr><td> Title </td><td> .state.json files for repodata metadata </td>
<tr><td> Status </td><td> Draft </td></tr>
<tr><td> Author(s) </td><td> Wolf Vollprecht &lt;[email protected]&gt;</td></tr>
<tr><td> Created </td><td> Jan 09, 2023</td></tr>
<tr><td> Updated </td><td> Jan 09, 2023</td></tr>
<tr><td> Discussion </td><td> https://conda.slack.com/archives/C017F7C0VM3/p1672669131100819 </td></tr>
<tr><td> Implementation </td><td> https://github.com/mamba-org/mamba/pull/2113 </td></tr>
</table>

## Abstract

Changing how conda and mamba store metadata about repodata.json downloads.

### Motivation

When conda currently downloads `repodata.json` files from the internet, it stores metadata "inside" the file by adding some JSON keys:

- `_url`: The URL that was requested
- `_etag`: ETag returned from server
- `_mod`: Last-Modified header from server
- `_cache_control`: Cache-Control header from server

These are stored as three string values.

This is not an ideal approach as it modifies the `repodata.json` file and corrupts e.g. the hash of the file. Also, the repodata files have gotten increasingly large, and parsing these state values can require parsing a large `json` file.

Therefore we propose to store the metadata in a secondary file called `.state.json` file next to the repodata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been reviewing the conda implementation in conda/conda#12425 and have stumbled over the state term there few times.

I don't think "state" correctly describes what's in the proposed file as its content is mostly derived from stat-like calls -- and stat for better or worse refers to "status" and not really the ambiguous term "state". This might be confusing for future maintainers.

I know this sounds like nitpicking, but before we're shipping this in conda, I'd rather mention it before we regret it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have size and mtime_ns from stat (short for status) calls; last-modified and etag headers for caching; "is some number of alternative remote files available", possibly a content hash of the full file; an intermediate hash and position to start fetching an update to the remote jlap file if that is being used; and the time that the remote server was last checked for new content.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we are using 5 meaningless letters "state", we could make it even more meaningless ".s.json" (and nicely shorter), we could roll the dice and call it >>> os.urandom(2).hex() 'ec5a'.json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something that does makes sense? .cache-info.json perhaps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plain info.json would be short

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Another motivating factor is that for the `jlap` proposal we need to (repeatedly) compute the hash value of the `repodata.json` file -- that only gives correct results straight away when the repodata is stored externally.

Both mamba and conda currently use the same cache folder. If both don't implement the same storage strategy but continue to share the same repodata cache, it would lead to frequent cache busting.

### Specification

```json5
{
// we ensure that state.json and .json files are in sync by storing the file
// last modified time in the state file, as well as the file size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suggest to add a new version entry to the format, to be able to evolve it if needed.

Suggested change
// version of the repodata status format
"version": 1,

Copy link
Contributor

@dholth dholth Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I had to release a version 2, I would say all files without a "version" key are version 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats also fine for me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with Jannis to add version: 1, and an older conda should ignore newer .state.json

// seconds and nanoseconds counted from UNIX timestamp (1970-01-01)
"mtime_ns": INTEGER,
"size": INTEGER, // file size in bytes

// The header values as before
"url": STRING,
"etag": STRING,
"mod": STRING,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change this to last_modified similar to the header name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the old names without a leading _

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to change them, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have implemented (nicer) non-underscored names in a conda branch.

"cache_control": STRING,

// these are alternative encodings of the repodata.json that
// can be used for faster downloading
// both `has_zst` and `has_jlap` keys are optional but should be kept
// even if the other data times out or `file_mtime` does not match
"has_zst": {
// UTC RFC3999 timestamp of when we last checked wether the file is available or not
// in this case the `repodata.json.zst` file
// Note: same format as conda TUF spec
// Python's time.time_ns() would be convenient?
"last_checked": "2023-01-08T11:45:44Z",
// false = unavailable, true = available
"value": BOOLEAN
},
"has_jlap": {
// same format as `has_zst`
},

"jlap": { } // unspecified additional state for jlap when available
}
```

If the `state.json` file_mtime or file_size does not match the `.json` file actual `mtime`, the header values are discarded. However, the `has_zst` or `has_jlap` values are kept as they are independent from the repodata validity on disk.

If the client is tracking `repodata.json.zst` or `repodata.jlap` instead of
`(current_)?repodata.json`, then `etag`/`mod`/`cache_control` will correspond to
those remote files, instead of `repodata.json`.

### Backward compatibility

Older clients that try to reuse the existing cache will not be able to make use of the cached repodata as they do not know about the state (since it's not written to the same location). That means they will redownload the repodata.

## Copyright

All CEPs are explicitly [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/).