All notable changes to this project will be documented in this file.
- No change.
- Fix for an old cron test. Date was in the past.
- Changed metrics store to store data per day to avoid Mongo doc size limits.
- Updated utilizations and reports to take advantage of the new metrics store layout.
- No change.
- Added a new Actor Configs feature with API endpoint for managing configuration shared across several actors, including "secrets" which are encrypted at rest in the Abaco database.
- A new
encryption_key
config within the[web]
stanza of the abaco.conf file has been added and is required (this is used by the Actor Configs feature). All existing Abaco deployments must add such an encryption key to their config file before starting up the platform with the new version. - Workers now have a
create_time
attribute which is populated when their id is initially assigned (at request time)
so that we can detect situations such as a worker being in REQUESTED status for a long period for a long period of time.
- Updated the way worker health checks work so that they i) use unidirectional messages from health to worker
(i.e., worker no longer replies) and ii) workers update their own
last_health_check_time
. This significantly simplifies the architecture and more easily handles the case where workers are no longer responsive. There is also a newhard_delete_worker()
function in the health module which is used in cases where the worker is unresponsive. This should mitigate some issues we have seen with workers getting "stuck" in various statuses and not progressing, particularly when we see very high latency in the network. - Increased the frequency with which health checks run to once every 30 seconds (was previously once every 10 minutes). This should also help with the aforementioned "stuck" workers issue.
- The worker channel thread for workers now runs as a daemon thread, so that it automatically exits if the worker's main thread exits (i.e., the worker crashes).
- Updated the
abaco/core
software, including Docker image to use Python 3.9 and updated the version of several dependent packages, including cloudpickle, cryptography, pycrypto and pyzmq. - Fixed a bug in the autoscaler that was computing a worker's
last_execution_time
incorrectly. - The test code has been move to the
abaco/core
image for simplicity. Theabaco/testsuite
image should be considered deprecated. - Updated the code in several places to use f-strings instead of ```.format()``.`
- Updates to the Makefile.
- The
abaco/testsuite
image has been deprecated. All test code is now bundled inthe abaco/core
image. - Code cleanup including removal of unreachable code lines and removal of a student's personal tmp directory that was inadvertently added to the Dockerfile.
- No change.
- Fixed a bug which stopped cron jobs from creating a working access_token for actor usage.
- No change.
- No change.
- Actors are no longer put into ERROR state when unrecognized exceptions occur during the starting of actor containers. Most of the time, these exceptions are due to internal system errors, such as not being able to talk to RabbitMQ or getting socket timeouts from docker. These are not the fault of the actor, and putting it (but not other actors who simply didn't happen to be executing at the time) in ERROR state is confusing to users and leads to actors not processing messages until the user notices and intervenes.
- Fixed an issue where attempts to tear down the results channel associated with an execution could fail and cause an actor to be put into the ERROR state.
- Fixed issue where actor could be set to the ERROR state even after it was deleted.
- No change.
=======
- Each actor now has a
revision
number property, a monotonically increasing integer that updates every time the actor's image is updated (including updates withforce=True
). Workers are also started with the current revision number and stop processing messages once their revision number is less than the actor's current revision.
- The autoscaler algorithm has been updated to be more resilient to runtime exceptions and other issues.
- A bug has been fixed that caused the status of an execution to remain in RUNNING state even after the actor was put in ERROR state.
- A bug has been fixed that prevented the actor's mailbox queue in RabbitMQ from being deleted when the actor is deleted.
- The channels module has been modified to make more use of the BasicTaskQueue class to decrease the RabbitMQ footprint of the system. Additionally, we have improved some handling of queues by more aggressively deleting them.
- No change.
- Added support for a cron scheduling feature for actors. The cron schedule feature allows users to instruct Abaco to automatically execute actors based on a schedule provided by the user. More information available from the docs (https://tacc-cloud.readthedocs.io/projects/abaco/en/latest/technical/messages.html#cron-schedule).
- Added support for configuring Abaco with a DockerHub credential to be used when pulling images from DockerHub. In particular, Abaco can be configured with the credentials of a licensed account with increased pull quota to avoid "toomanyrequests: You have reached your pull rate limit" errors from the Docker daemon.
- No change.
- No change.
- Added the
GET /actors/search/{search_type}?{search_terms}
endpoint for mongo database full-text search and matching - Added abaco_metrics_store to store long term information about actors executions
- Added tests for search feature
- Added developer docs for the new search endpoint and Mongo conversion.
- Added ReadTheDoc docs for the new search endpoint and it's abilities.
- Added search endpoint to Mongo specs.
- Added testing (not complete) to the makefile
- Added maxfails for pytest in entry.sh for testing. Configurable through Makefile and allows pytest end after set amount of fails for debugging.
- Added tenant and actor_id fields to logs for better compatibility within search.
- Added
total_actors_all_with_executions
andtotal_actors_existing_with_executions
to reports.py to account for the fact that actors without executions are not account for in the executions_store after the flattening.
- Converted Redis databases to Mongo for database simplification. This includes the actors_store, the workers_store, the alias_store, the nonce_store, and pregen_clients.
- Changed all time instances in Mongo from unix strings to datetime objects.
- Modified the Mongo store class to allow for missing features from Redis and allow for recursive use of any function that gets, sets, or deletes. Additionally added full_update, a function that allows for full use of the Mongo update_one function. Additionally added aggregation and indexing functions to use pymongo native functions.
- Modified any functions that needed updating to new Mongo database structure.
- Updated pymongo and other requirements for additional features.
- Flattened the workers and execution stores in order for better search results. Also allows for uninterrupted execution documents without a actor size limit.
- Modified the atomicity of some logic in abaco, namely setting worker status and setting nonce use. Rewritten to allow for atomic queries/sets without needing to implement transactions like when using Redis.
- Modified the cleaning functions of the makefile.
- Fixed abaco_core_test.py by deleting previously undeleted testing tenant.
- Modified reports.py to work with new mongo conversion.
- Eliminated any Redis calls, the Redis stores, Redis objects, any reference to Redis, Redis images, etc.
- Batching executions. Previously meant to fix execution document from going over Mongo document size limits. No longer needed with the flattening of the executions_store.
- No change.
- Actors are no longer put into ERROR state when OAuth (APIM) client generation fails.
- The AgaveClientsService.create() method now tried to delete a "partially created" client for which credential generation failed.
- Fixed bug in the Execution model class methods for updating an execution which could cause exceptions to be thrown if the time to update the database exceeded certain pre-defined thresholds.
- Fixed an issue with workers not exiting cleanly when handling internal exceptions in the main thread.
- No change.
- No change.
- Compiled with an update to agaveflask core lib which adds support for the portals-api and 3dem tenants.
- No change.
- No change.
- Add second check of the globals.keep_running sentinel in the main worker thread (thread 1) to shrink the time window between a worker receiving a shutdown signal (in thread 2) and relaying that to thread 1, particularl after a new actor message was received (in thread 1). The previous, larger time window resulted in a race condition that could cause an actor message to get "partially" processed while the worker was being shut down. In particular, refreshing the token in thread 1 could fail if thread 2 had already removed the oauth client.
- Add retry logic to oauth client generation for a new worker; try up to 10 times before giving up and putting the actor in an error state.
- No change.
- No change.
- Fixed a bug resulting in an exception and possibly setting an actor to ERROR state when truncating an execution log.
- Physically delete worker records from workers store in spawner when a previous or concurrent error prevents the spawner from ever creating/starting the worker containers.
- Workers now try to refresh the access token up to 10 times (with a 2 second sleep between each attempt) before giving up and putting the actor into an ERROR state.
- Fixed an exception (which previously was only logged and swallowed) in the metrics_utils module caused by trying to access a variable that had not been defined under a certain code path.
- Added additional logging in spawner and worker modules.
- No change.
- Added an endpoint
PUT /actors/aliases/{alias}
for updating the definition of an alias. RequiresUPDATE
permission for the alias as well as for the actor to which the alias should be defined.
- Fixed a bug where nonces defined for aliases would not be honored when using the alias in the URL (they were only honored when using the actor id assigned to the alias).
- Fixed issue where autoscaler did not properly scale down worker pools for actors with the
sync
hint. They are now scaled down to 1. - The permission check on all on all
/aliases/{alias}
endpoints has been updated to require UPDATE on the associatedactor_id
. - Fixed issue where the actor's token attribute was not being processed correctly causing tokens to be generated even for actors for which the attribute was false.
- Fixed issue where hypyerlinks in response model for executions were not generated correctly, showing the actor's internal database id instead of the human readable id.
- Fixed error messaging when using a nonce and the API endpoint+HTTP verb combination do not exist.
- The admin role is now recognized when checking access to certain objects in some edge cases, including when a nonce is used.
- It is no longer possible to create an alias nonce for permission levels UPDATE.
- Added
hints
attribute to the actor data model, a list of strings representing metadata about an actor. "Official" Abaco hints will be added over time to provide automatic configuration of actors. - Added support for the
sync
official hint: when an actor is registered with hint "sync", the Abaco autoscaler will leave at least one worker in the actor's worker pool up to a tenant-specific period of idle time. This idle time is configured using thesync_max_idle_time
within the[workers]
stanza of theabaco.conf
file. - Added a "utilization" endpoint,
GET /actors/utilization
, which returns basic utilization data about the Abaco cluster.
- Changed the way Abaco generates OAuth tokens that it injects into actors by prefixing the username associated with the token by its userstore's id. This change fixes an issue where other Tapis services (such as profiles) would not work properly when hit with the token because the associated JWT was not generated properly by WSO2. Otherwise, this change should be transparent to the end user.
- Fixed an issue where the
PUT /actors/{actor_id}
endpoint did not default the actor'stoken
attribute to the tenant default. Now, if thetoken
attribute is missing from thePUT
message body, Abaco will use the default value for the tenant or instance. - An actor's executions list is now initialized when the actor is created to prevent a race condition that can occur when multiple client threads try to add the very first execution (i.e., send the first message).
- The
DELETE /actors/aliases/{alias}
now returns a 404 not found if the alias{alias}
does not exist. - Fixed an issue with
GET /actors/{actor_id}/nonces
where nonces created before the 1.1.0 release (which introduced nonces associated with aliases) were not properly serialized in the response, causing random id's to be generated for the nonces instead of returning their actual id's.
- No change.
- Added a
token
Boolean attribute to the actor data model, indicating whether a token will be generated for an actor. When this attribute is False, Abaco will not generate an access token for the actor.
- Fixed an issue where the results socket was not writeable by non-root accounts.
- The Abaco API proxy (nginx) now returns properly formatted JSON messages for unhandled 400 and 500 level errors including bad gateway and timeout errors.
- Fixed various issues associated with Abaco resources not being shut down correctly on actor delete. First, actors now
enter a
SHUTTING_DOWN
status immediately upon receiving a delete request, and this status is recognized by the autoscaler to prevent workers from being started. Second, workers now enter firstSHUTDOWN_REQUESTED
followed bySHUTTING_DOWN
when they have been requested (respectively, received the stop request) to shut down. Spawners now check if a worker is inREQUESTED
orSHUTDOWN_REQUESTED
status before proceeding with starting the worker. Finally, the actor DELETE API now waits up to 20 seconds for all workers to be shut down and if they have not yet, the delete still returns a 200 but the response message indicates that not all resources were shut down. - Workers now force halt a running execution when an actor has been deleted; this allows resources to be cleaned up more efficiently.
- Fixed a rare edge case issue where a worker container would not exit cleanly due to the the second worker_ch thread not checking the global keep_running boolean properly.
- The abaco.conf file now accepts configurations of the form
{tenant}_default_token
anddefault_token
within the[web]
stanza to provide a default value for the actor token attribute for tenants, respectively, the global Abaco instance. When a tenant and global0 configuration is set, actors in a given tenant will get the tenant's configuration. - The abaco.conf file now accepts a
{tenant}_generate_clients
configuration within the[workers]
stanza that dictates whether client generation is available for a specific tenant. - Several log messages were cleaned up and improved.
- No change.
- No change
- Fixed an issue where in a certain edge case, workers were not exiting properly due to a bug trying to clean up a connection to RabbitMQ.
- No change.
- No change
- Fixed an issue where in a certain edge case, workers were getting shut down by the autoscaler before executions were getting processed.
- The abaco.conf now expects a
max_cmd_length
config within thespawner
stanza which should be an integer and controls how many messages the autoscaler will send to the default command channel at a time.
- No change.
- Added actor events subsystem with events agent that reads from the events queue.
- Added support for actor links to send an actor's events to another actor.
- Added support for an actor webhook property for sending an actor's events as an HTTP POST to an endpoint.
- Added timing data to messages POST processing.
- Executions now change to status "RUNNING" as soon as a worker starts the corresponing actor container.
- Force halting an execution fails if the status is not RUNNING.
- Reading and managing nonces associated with aliases requires permissions on both the alias and the actor.
- Spawner now sets actor to READY state before setting worker to READY state to prevent autoscaler from stopping worker before actor is update to READY.
- Updated ActorMsgQueue to use a new, simpler class, TaskQueue, removing dependency on channelpy.
- No change.
- Added support for sending synchronous messages to an actor.
- Added support for creating/managing nonces associated with aliases through a new API:
GET, POST /actors/aliases/{alias}/nonces
. - Added support for halting a running execution through a new API endpoint:
DELETE /actors/{actor_id}/executions/{execution_id}
. - Added support for streaming logs back to the logs service during a running execution so that the user does not have to wait for an execution to complete before seeing logs.
- The spawer management of workers has been greatly simplified with a significant reduction in messages between the two agents at start up. Worker status was updated to add additional worker states during start up. Worker state transitions are now validated at the model level.
- The
abacosamples/wc
word count image has been updated to now send a bytes result on the results channel. - Improved worker and client cleanup code when actor goes into an ERROR state.
- Updates to health agent to add additional checks/clean up of clients store.
- Consolidated to a single docker-compose.yml file for local development and upgraded it to v3 docker-compose format.
- No change.
- Final updates to the Abaco Autoscaler in preparation for its release.
- Added "actor queues" feature to allow actors to be registered into a specific queue so that dedicated computing resources can be provided for specific groups of actors; updates to the controlers, spawner and health agents were made to support this feature.
- Added a "description" field on nonce objects to ease user management of nonces.
- Added a new "image classifier" sample,
abacosamples/binary_message_classifier
, that uses a pre-trained image classifier algorithm based on tensorflow to classify an image sent as a binary message.
- Aliases are now restricted to a whilelist of characters to prevent issues with the use of non-URL safe characters.
- Several modules were changed to improve handling of errors such as connection issues with RabbitMQ or socket level errors generated by the Docker daemon.
- No change.
- Add support for actor aliases allowing operators to refer to actors and associated endpoints via a self-defined identifier (alias) instead of the actor id.
- Add support for actor resource limits for cpu and memory. These can be globally configured, and, with admin privileges, overriden on a per-actor basis at registration and update (
max_cpus
/maxCpus
andmem_limit
/memLimit
). - Add support for endpoint
DELETE /actors/{aid}/messages
to purge an actor's mailbox. - The fields
actor_name
,worker_id
,container_repo
are now available in the context for an actor execution. - Add support for atomic list mutations on the Redis store class.
- Grafana config added to Promtheus auto-scaler component.
len(clients_store)
is now a Prometheus gauge metric.- Improved logging in spawner, worker, health, clientg and models modules.
- By default, actors are now registered as stateless. This means, by default, the state API will not be available but autoscaling will.
- Improve error handling when clientg process receives an error generating an OAuth client or token.
- Fix bug where workers API reported worker create time incorrectly.
- The locust load test suite application was expanded to allow additional types of actors to be registered and executed; addtionally, bugs were corrected and configuration improved.
- The autoscaler now honors a
max_workers
field for each actor; it also only runs scale up method if the command queue is less than a configurable max length. - Fixed bug in scale-down method of autoscaler preventing scale down when actor had exactly 1 worker.
- Some aspects of the health process were changed to better integrate with the autoscaler.
- Fixed bug preventing health process from restarting crashed spawner correctly.
- Fixed bug in kill_worker causing database integrity issues when pull_image failed with an exception.
- Worker containers are now named by their actor and worker id for ease of identifying them.
- Fixed a bug where a results channel was not always closed properly, causing undue resource usage.
- No change
- A new sleep_loop sample was added for replicating actor executions with varying execution lengths.
- The channels module was refactored to give clients more control over acking/nacking messages, and whether to pre-fetch messages. This fixes a bug where messages could get lost when a worker crashes in certain ways.
- The core code was upgraded to Python 3.6.6 and the base images were updated to Alpine 3.8.
- The admin API now returns workers as a list, and a few other small bugs were fixed.
- Several updates and fixes were made to the Admin Dashboard application.
- No change
- New endpoints in the Admin API,
/actors/v2/admin/workers
and ``/actors/v2/admin/executions`, for retrieving data about workers and executions, respectively. - New
abacosamples/agave_submit_jobs
sample image for submitting a job from an actor.
- Fix issue where Spawner process would crash when receiving a Timeout error from the Docker daemon when a compute node was under heavy load.
- Hardening of various worker actions when compute node is under heavy load, including hardening of stats collection, results socket creation and teardown, and actor container stopping. Adds significant improvements to exception handling and retry logic in these failure cases, and puts actor in error states when unrecoverable errors are encountered. Among other things, these improvements should prevent multiple actor containers from running concurrently under the same worker.
- Numerous improvements to documentation.
- No change
- Extended support for a tenant-specific identity configurations; specifically, enabling use/non-use of TAS integration at the tenant level as well as use of global UID and GID.
- Fixed a reliance on the existence of the Internal/everyone role in the JWT; now, if no roles are present in the JWT, Abaco inserts the "everyone" role enabling basic access and functionality.
- No change
- Added support for a tenant-specific global_mounts config.
- Changed RabbitMQ connection handling across all channel objects to greatly reduce cpu load on RabbitMQ server as well as on worker nodes in the cluster.
- Implemented a stop-no-delete command on the command channel to prevent a race condition when updating an actor's image that could cause the new worker to be killed.
- Fixed an issue where Docker fails to report container execution finish time when the compute server is under heavy load. In this case, we note return finish_time as computed from the start_time and the run_time (calculated by Abaco).
- Fixed issues with Actor update: 1) owner can no longer change in case a different user from the original owner updates the actor image, 2) last_update_time is always updated, and 3) ensure updater has permanent permissions for the actor.
- No change
- Added support for setting max_workers_per_host to prevent overloading.
- Added support for retrieving the TAS GID on a per user basis from the extended profile within Metadata.
- Initial implementation of autoscaling via Prometheus added.
- Additional fields for each execution are now returned in the executions summary.
- The routines used when executing an actor container have been simplified to provide better performance and to prevent some issues such as stats collection generating a UnixHTTPConnectionPool Readtime when compute server is under load.
- Added several safety guards to the health checker code to prevent crashes of the health checker when working with unexpected data (e.g. when a worker's last_execution is not defined)
- Fixed bug due to message formatting issue in message returned from a POST to the /workers endpoint.
- The 'ids' collection has been removed from the executions endpoint response in favor of an 'executions' collections providing additional fields for each execution.
- Add support for binary messages through a FIFO mount to the actor.
- Add support for a "results" endpoint associated with each execution. Results are read from a Unix Domain socket mounted into the actor container and streamed to a Results queue specific to the execution.
- Read host id from the environment to support dynamic assignment such as when deploying with kubernetes.
- Add create_time attribute to workers and fix issue with health agents shutting down new workers too quickly if the worker had not processed an execution.
- An actor's state object can now be an arbitrary JSON-serializable object (not just a dictionary).
- Messages to add multiple new workers are now sent as multiple messages to the command queue to add 1 worker. This distributed commands across multiple spawners better.
- Default expiration time for Results channels has been increased from 100s to 20 minutes.
- Fixed a bug in the auth check that caused certain POST requests to fail with "not authorized" errors when the payload was not a JSON-dictionary.
- Fixed an issue preventing an actor's state object from being updated correctly.
- No change.
- Fixed issue where permissions errors were giving a confusing message about "unrecognized exception".
- Fixed bug causing a worker to be added to the workers_store with the wrong worker_id in a narrow case.
- Fixed an issue where the put_sync in the health check was causing messages to be left on the queue when the worker had already stopped.
- Fixed issue where requests to update an actor (i.e., PUT requests) were ignoring certain fields (e.g., default_environment)
- Fixed bug preventing the Agave OAuth client from being properly instantiated within the actor container when the actor was launched via a nonce.
- Add shutdown_all_workers convenience utility.
- Several tests added, specifically to validate behavior when invalid inputs were provided.
- No change.
- No change.
- Fixed issue (#24) where updating an actor caused mounts to disappear.
- Fixed issue (#25) where an actor's status message was not reset when it left an error state.
- Made the user role for "basic" level access configurable (#26).
- Turned off "check_workers_store" checks in health module until an optimal approach to data cleanup can be found.
- No change.
- Initial external release.
- No change.
- No change.