Skip to content

3.24

Compare
Choose a tag to compare
@alex-aizman alex-aizman released this 27 Sep 19:36
· 326 commits to main since this release

Version 3.24 arrives nearly 4 months after the previous one and contains more than 400 commits that fall into several main categories, topics, and sub-topics:

1. Core

1.1 Observability

We improved and optimized stats-reporting logic and introduced multiple new metrics and new management alerts.

There's now an easy way to observe per-backend performance and errors, if any. Instead of (or rather, in addition to) a single combined counter or latency, the system separately tracks requests that utilize AWS, GCP, and/or Azure backends.

For latencies, we additionally added cumulative "total-time" metrics:

  • "GET: total cumulative time (nanoseconds)"
  • "PUT: total cumulative time (nanoseconds)"
  • and more

Together with respective counters, those total-times can be used to compute precise latencies and throughputs over arbitrary time intervals - either on a per-backend basis or averaged across all remote backends, if any.

New management alerts include keep-alive, tls-certificate-will-soon-expire (see next section), low-memory, low-capacity, and more.

Build-wise, aisnode with StatsD will now require the corresponding build tag.
Prometheus is effectively the default; for details, see related:

1.2 HTTPS; TLS

HTTPS deployment implies (and requires) that each AIS node (aisnode) has a valid TLS (X.509) certificate.

TLS certificates tend to expire from time to time, or eventually. Each TLS certificate expires, with a standard-defined maximum of 13 months - roughly, 397 days.

AIS v3.24 automatically reloads updated certificates, tracks expiration times, and reports any inconsistencies between certificates in a cluster:

Associated Grafana and CLI-visible management alerts:

alert comment
tls-cert-will-soon-expire Warning: less than 3 days remain until the current X.509 cert expires
tls-cert-expired Critical (red) alert (as the name implies)
tls-cert-invalid ditto

Finally, there's a brand-new management API and ais tls CLI.

1.3 Filesystem Health Checker (FSHC)

FSHC component detects disk faults, raises associated alerts, and disables degraded mountpaths.

AIS v3.24 comes with FSHC a major (version 2) update, with new capabilities that include:

  • detect mountpath changed at runtime;
  • differentiate in-cluster IO errors from network and remote backend (errors);
  • support associated configuration (section "API changes; Config changes" below);
  • resolve (mountpath, filesystem) to disk(s), and handle:
    • no-disks exception;
    • disk loss, disk fault;
    • new disk attachments.

1.4 Keep-Alive; Primary Election

In-cluster keep-alive mechanism (a.k.a. heartbeat) was generally micro-optimized and improved. In particular, when and if failing to ping primary via intra-cluster control, an AIS node will now utilize its public network, if available.

And vice versa.

As an aside, AIS does not require provisioning 3 different networks at deployment time. This has always been and remains a recommended option. But our experience running Kubernetes clusters in production environments proves that it is, well, highly recommended.

1.5 Rebalance; Erasure Coding: Intra-Cluster streams

Needless to say, erasure coding produces a lot of in-cluster traffic. For all those erasure-coded slice-sending-receiving transactions, AIS targets establish long-living peer-to-peer connections dubbed streams.

Long story short, any operation on an erasure bucket requires streams. But, there's also the motivation not to keep those streams open when there's no erasure coding. The associated overhead (expectedly) grows proportionally with the size of the cluster.

In AIS v3.24, we solve this problem, or part of this problem, by piggybacking on keep-alive messages that provide timely updates. Closing EC streams is a lazy process that may take several extra minutes, which is still preferable given that AIS clusters may run for days and weeks at a time with no EC traffic at all.

1.6 List Virtual Directories

Unlike hierarchical POSIX, object storage is flat, treating forward slash ('/') in object names as simply another symbol.

But that's not the entire truth. The other part of it is that users may want to operate on (ie., list, load, shuffle, copy, transform, etc.) a subset of objects in a dataset that, for lack of a better word, looks exactly like a directory.

For details, please refer to:

1.7 API changes; Config changes

Including:

  • "[API change] show TLS certificate details; add top-level 'ais tls' command" 091f7b0
  • "[API change]: extend HEAD(object) to check remote metadata" c1004dd
  • "[config change]: FSHC v2: track and handle total number of soft errors" a2d04da
  • and more

1.8 Performance Optimization; Bug fixes; Improvements

Including:

  • "new RMD not to trigger rebalance when disabled in the config" 550cade20
  • "prefetch/copy/transform: number of concurrent workers" a5a30247d, 8aa832619
  • "intra-cluster notifications: reduce locking, mem allocations" b7965b7be
  • and much more

2. Initial Sharding (ishard); Distributed Shuffle (dsort)

Initial Sharding utility (ishard) is intended to create well-formed WebDataset-formatted shards from the original dataset.

Goes without saying: original ML datasets will have an arbitrary structure, a massive number of small files and/or very large files, and deeply nested directories. Notwithstanding, there's almost always the need to batch associated files (that constitute computable samples) together and maybe pre-shuffle them for immediate consumption by a model.

Hence, ishard:

3. Authentication; Access Control

Other than code improvements and micro-optimizations (as in continuous refactoring) of the AuthN codebase, the most notable updates also include:

topic what changed
CLI improved token handling; user-friendly (and improved) error management; easy-to-use configuration that entails admin credentials, secret keys, and tokens
Configuration notable (and related) environment variables: AIS_AUTHN_SECRET_KEY, AIS_AUTHN_SU_NAME, AIS_AUTHN_SU_PASS, and AIS_AUTHN_TOKEN
AuthN container image (new) tailored specifically for Kubernetes deployments - for seamless integration and easy setup in K8s environments

4. CLI

Usability improvements across the board, including:

  • "add 'ais tls validate-certificates' command" 0a2f25c
  • "'ais put --retries ' with increasing timeout, if need be" 99b7a96
  • "copy/transform: add '--num-workers' (number of concurrent workers) option" 2414c68
  • "extend 'show cluster' - add 'alert' column" 40d6580df
  • "show configured backend providers" ba492a1
  • "per-backend cumulative "total" latencies
  • and much more

5. Python: SDK (AIStore, AuthN); PyTorch DataLoader; Tools

topic what changed
SDK compatibility with Python 3.8 and later versions; support retries via urllib3.Retry; add object group prefixes; improved dataset management for PyTorch
AuthN add AuthN sub-package with Python APIs to manage users, permissions, roles, tokens, and clusters; add ObjectFile
PyTorch dynamic sampling; support for multiple workers; integration with WebDataset. Also, included progress bars, improved error handling
Tools add ShardReader; [Google Colab](https://aistore.nvidia.com/blog/2024/09/18/google-colab-aistore; pyaisloader to support ETL

6. Build; Lint; Continuous Integration (CI)

topic what changed
CI upgrade GitHub and GitLab CI configurations; include support for Python 3.8+; improve AuthN testing; fix various CI workflows; add PyTorch integration tests to CI; improve error handling during minikube deployments
Build update Open Source Software (OSS) packages; standardize Dockerfile configurations; make Prometheus default; address security vulnerabilities (e.g., CVE fixes for google-protobuf and rexml)
Lint enable more golangci-lint linters; clean up linter configurations
Deployment improve deployment scripts and Makefiles; standardize container builds

7. Documentation and Tests

topic what changed
new and updated references virtual directories; AuthN SDK examples, TLS certificate management; Python SDK examples; Loading, reloading, and generating certificates; switching cluster between HTTP and HTTPS; streaming ObjectFile examples; and more
tests new ETL tests for concurrent transformations with varying object sizes; improve Python ETL setup for Kubernetes, with fixes for the mock cloud backend, stress tests for initial sharding (ishard), and enhancements to race condition handling and minikube logging

8. Blog

Finally, there are new technical blogs added during this v3.24 development iteration: