Releases: HewlettPackard/swarm-learning
Community version of Swarm Learning - v2.2.0
We are happy to announce Swarm 2.2.0 community release.
In this release, we have delivered key enhancements on UI/UX, which includes experiment tracking for easier “birds-eye” visualization of past training rounds, parallel Swarm installation on multiple hosts, Podman support via SLM-UI etc., that will significantly enhance user experience. We have also added powerful features to Swarm manageability framework for better management of user ML workloads.
Customers can download product bits and documentation from My HPE Software Center
Features
• Targeted SWOP command used to target the task on a specific SWOP node.
-Dynamic addition of peers to an ongoing task execution.
-Retrying the failed Task on a SWOP node.
• WITH ALL PEERS command to trigger a task execution on all available peers.
• UI/UX Enhancements
-Experiment tracking support to display the training attributes for multiple training rounds.
-Parallel Swarm installation - Option to add multiple hosts simultaneously.
-View SWOP profile and task yaml.
-Support Podman.
• Swarm support for SPIRE as certificate manager.
-Added CLI based SPIRE example (spire/cifar10).
• Real world NIH example – Added new example to show case Swarm use case with real world NIH dataset.
• Documentation enhancements
Defect fixes
• Stale SL Admin node stuck waiting for quorum while a new Admin is selected.
• Enabled non-default APLS port support from SLM-UI.
• Issues during re-start of SLM-UI container while running a training.
You can see the updated documentation for all new feature/defect fixes here .
For help/clarifications, reach out Slack : https://hpe-external.slack.com/archives/C02PWRJPWVD
Community version of Swarm learning - v2.1.0
In this release, we have delivered key enhancements on SLM-UI (model training metrics, easy browse through of ML logs and centralized swarm log collector), that will significantly enhance user experience. For advanced Swarm Learning users, we have provided couple of additional options for merge algorithms. These will help optimize on training convergence for different customer workloads. Further, we have enabled persistence for blockchain data, which will benefit customers with offline analysis of training related data, faster restart of Swarm network (SN).
One can download product bits and documentation, from My HPE Software Center (https://myenterpriselicense.hpe.com/cwp-ui/auth/login),
Here are the key contents of this release :
Features:
• Persistent data in SN
o Make the SN blockchain persist on disk
• UI/UX Features
o Model training metrics – Accuracy, Loss etc. at SL node and global Swarm level
o Browse through ML container logs
o Centralized Swarm log collector for faster diagnostic collection
o Seamless Product upgrade
• New merge methods for Swarm merge process
o Co-ordinate Median, Geometric Median
o Configurable merge through I/O or Memory optimized modes
• Swarm on Podman (alternative for Docker)
o Support Podman container runtime
o Run Swarm containers with rootless privileges
o Added support for SELinux with Podman on RHEL
• Enhanced diagnostics for SWOP and SN
• Containerized License Server (APLS)
• Documentation and example updates
Defect fixes:
• Defect fixes in SN restart path
• Corrected ‘LIST NODES’ to display only active nodes
• Swarm components exits with proper diagnostics if certificates are expired
• Swarm Learning Topology updated to reflect active nodes
• Reverse proxy updates to consider the port number along with service name
RC 2 for 2.1.0 release
Release Candidate 2 for 2.1.0 release - Not meant for production
- List node fix
- Examples changes for Fraud detection using biased data + display of training metrics.
- UI improvements
Release Candidate 1 for 2.1.0 release - Not meant for production
Merge pull request #188 from iArpanPatel/perfdata_examples Example application code updates to support Swarm Callback additional parameters
Community version of Swarm learning - v2.0.0
We’re excited to announce Swarm Learning 2.0.0 community release!
This release contains following updates.
- High availability for SN
Handling Sentinel node failure.
Any SN node can act as sentinel while adding new node.
Supports mesh topology of SN network.
- High availability for SL leader
Electing new merge leader when a leader failure is detected.
Handles stale leader recovery.
- Swarm Learning Management UI (SLM-UI)
Swarm product installation through SLM-UI.
Deploy and Manage Swarm Learning through SLM-UI.
- Swarm client library
Extend Swarm Learning for new ML platforms.
- Improved diagnostics and utility script for logs collection.
Community version of Swarm learning - v1.2.0
We’re excited to announce Swarm Learning 1.2.0 community release!
This release contains new features and important bug fixes.
- Reverse Proxy – Feature to reduce number of ports to be opened for Swarm training
- AMD GPU – Feature to enable work load training on AMD GPUs
- Provision to opt out of Leader election for any SL node. If user doesn’t want to make a slow node (with less compute power, network band width etc) as merge leader, then ‘SL_MAKE_ME_ADMIN’ can be set to ‘False’.
- SL-UID mapping in GET TASKRUNNER PEER STATUS command for better debuggability
- Improved List nodes command to display the list of Swarm nodes that have registered and are currently active
- SWCI aborts if the task fails in between ‘exit on error’ command
- Corrected SWCI error reporting
- Corrected run-swci script to handle swci_init file
- Version agnostic container info to run SWOP container across Linux distributions
- Improved SWCI inline HELP
- Improved plotTopology() in SWCI web API
Thank you all for your support! Please let us know for any feedback or queries.
Community Version of Swarm learning - v1.1.0
This release contains the following features:
• SWOP Docker logs provides more information if user, SL, or SWOP containers exits due to an error.
• Fixed navigation errors in the web GUI.
• Enhanced logging and descriptive error messages in the web GUI.
• Added configurable SWCI_TASK_MAX_WAIT_TIME, which specifies wait time for WAIT FOR TASKRUNNER command.
• SWCI is updated with a new command (sleep command)
• User ML containers run with non-root privileges.
• SWOP_KEEP_CONTAINERS environmental variable is externalized.
• Enhanced Swarm Learning components to work with private Docker registry path patterns.
• Enhanced documentation.
Community Version of Swarm learning - v0.3.0 TOT
First Community / Eval version of Swarm learning v0.3.0
This version is NO longer supported. People who have this version already, can refer to the old documentation in github if needed.
We encourage all customers to move on to the latest version
All new customers are requested to take the current latest version of the product.
Community Version of Swarm learning - v1.0.0
Community release 1.0.0 of Swarm Learning.
This release has the following features:
•Swarm core functionality and user ML workload are not tied to each other in the same Docker image. This enables you to run workload on any version of the ML platform of your choice (Keras, TensorFlow, or PyTorch)
•Swarm Command Interface (SWCI) to create and manage training environments.
•Programmatic Interface to SWCI.
•Swarm Operator (SWOP) to build and execute ML workflows in a decentralized way.
•Support for Nvidia GPUs.
•Web UI Installer for Windows, Linux, and MAC platforms.