Skip to content

Commit

Permalink
Merge pull request #8532 from romayalon/romy-online-upgrade-health
Browse files Browse the repository at this point in the history
NC | Online Upgrade | Health CLI update config directory and upgrade checks
  • Loading branch information
romayalon authored Dec 8, 2024
2 parents 4847584 + eaf443f commit dee7a5e
Show file tree
Hide file tree
Showing 8 changed files with 365 additions and 182 deletions.
2 changes: 1 addition & 1 deletion docs/NooBaaNonContainerized/CI&Tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Run `NC mocha tests` with root permissions -
#### NC mocha tests
The following is a list of `NC mocha test` files -
1. `test_nc_nsfs_cli.js` - Tests NooBaa CLI.
2. `test_nc_nsfs_health` - Tests NooBaa Health CLI.
2. `test_nc_health` - Tests NooBaa Health CLI.
3. `test_nsfs_glacier_backend.js` - Tests NooBaa Glacier Backend.
4. `test_nc_with_a_couple_of_forks.js` - Tests the `bucket_namespace_cache` when running with a couple of forks. Please notice that it uses `nc_coretest` with setup that includes a couple of forks.

Expand Down
38 changes: 36 additions & 2 deletions docs/NooBaaNonContainerized/Health.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,14 @@ For more details about NooBaa RPM installation, see - [Getting Started](./Gettin
- Iterating buckets under the config directory.
- Confirming the existence of the bucket's configuration file and its validity as a JSON file.
- Verifying that the underlying storage path of a bucket exists.
- `Config directory health`
- checks if config system and directory data exists
- returns the config directory status
- `Config directory upgrade health`
- checks if config system and directory data exists
- checks if there is ongoing upgrade
- returns error if there is no ongoing upgrade, but the config directory phase is locked
- returns message if there is no ongoing upgrade and the config directory is unlocked

* Health CLI requires root permissions.

Expand Down Expand Up @@ -148,6 +156,11 @@ The output of the Health CLI is a JSON object containing the following propertie
- Enum: 'PERSISTENT' | 'TEMPORARY'
- Description: For TEMPORARY error types, NooBaa attempts multiple retries before updating the status to reflect an error. Currently, TEMPORARY error types are only observed in checks for invalid NooBaa endpoints.

- `config_directory`
- Type: Object {"phase": "CONFIG_DIR_UNLOCKED" | "CONFIG_DIR_LOCKED","config_dir_version": String,
"upgrade_package_version": String, "upgrade_status": Object, "error": Object }.
- Description: An object that consists config directory information, config directory upgrade information etc.
- Example: { "phase": "CONFIG_DIR_UNLOCKED", "config_dir_version": "1.0.0", "upgrade_package_version": "5.18.0", "upgrade_status": { "message": "there is no in-progress upgrade" }}

## Example
```sh
Expand Down Expand Up @@ -225,6 +238,14 @@ Output:
}
],
"error_type": "PERSISTENT"
},
"config_directory": {
"phase": "CONFIG_DIR_UNLOCKED",
"config_dir_version": "1.0.0",
"upgrade_package_version": "5.18.0",
"upgrade_status": {
"message": "there is no in-progress upgrade"
}
}
}
}
Expand All @@ -243,7 +264,8 @@ Output:
- The config file of bucket1 is invalid. Therefore, NooBaa health reports INVALID_CONFIG.
- The underlying file system directory of bucket3 is missing. Therefore, NooBaa health reports STORAGE_NOT_EXIST.


- config_directory:
- the config directory phase is unlocked, config directory version is "1.0.0", matching source code/package version is "5.18.0" and there is no ongoing upgrade.


## Health Errors
Expand Down Expand Up @@ -365,4 +387,16 @@ The following error codes will be associated with a specific Bucket or Account s
- Reasons:
- Bucket missing owner account.
- Resolutions:
- Check for owner_account property in bucket config file.
- Check for owner_account property in bucket config file.
#### 8. Config Directory is invalid
- Error code: `INVALID_CONFIG_DIR`
- Error message: Config directory is invalid
- Reasons:
- System.json is missing - NooBaa was never started
- Config directory property is missing in system.json - the user didn't run config directory upgrade when upgrading from 5.17.z to 5.18.0
- Config directory upgrade error.
- Resolutions:
- Start NooBaa service
- Run `noobaa-cli upgrade`
- Check the in_progress_upgrade the exact reason for the failure.
90 changes: 82 additions & 8 deletions src/manage_nsfs/health.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ const { TYPES } = require('./manage_nsfs_constants');
const { get_boolean_or_string_value, throw_cli_error, write_stdout_response, get_bucket_owner_account_by_id } = require('./manage_nsfs_cli_utils');
const { ManageCLIResponse } = require('./manage_nsfs_cli_responses');
const ManageCLIError = require('./manage_nsfs_cli_errors').ManageCLIError;
const { CONFIG_DIR_LOCKED, CONFIG_DIR_UNLOCKED } = require('../upgrade/nc_upgrade_manager');


const HOSTNAME = 'localhost';
Expand Down Expand Up @@ -60,6 +61,10 @@ const health_errors = {
error_code: 'MISSING_ACCOUNT_OWNER',
error_message: 'Bucket account owner not found',
},
INVALID_CONFIG_DIR: {
error_code: 'INVALID_CONFIG_DIR',
error_message: 'Config directory is invalid',
},
UNKNOWN_ERROR: {
error_code: 'UNKNOWN_ERROR',
error_message: 'An unknown error occurred',
Expand Down Expand Up @@ -117,12 +122,16 @@ class NSFSHealth {
endpoint_state = await this.get_endpoint_response();
memory = await this.get_service_memory_usage();
}
// TODO: add more health status based on system.json, e.g. RPM upgrade issues
const system_data = await this.config_fs.get_system_config_file({ silent_if_missing: true });
const config_directory_status = this._get_config_dir_status(system_data);

let bucket_details;
let account_details;
const response_code = endpoint_state ? endpoint_state.response.response_code : 'NOT_RUNNING';
const service_health = service_status !== 'active' || pid === '0' || response_code !== 'RUNNING' ? 'NOTOK' : 'OK';

const error_code = await this.get_error_code(service_status, pid, response_code);
const endpoint_response_code = (endpoint_state && endpoint_state.response?.response_code) || 'UNKNOWN_ERROR';
const health_check_params = { service_status, pid, endpoint_response_code, config_directory_status };
const service_health = this._calc_health_status(health_check_params);
const error_code = this.get_error_code(health_check_params);
if (this.all_bucket_details) bucket_details = await this.get_bucket_status();
if (this.all_account_details) account_details = await this.get_account_status();
const health = {
Expand All @@ -136,6 +145,7 @@ class NSFSHealth {
endpoint_state,
error_type: health_errors_tyes.TEMPORARY,
},
config_directory_status,
accounts_status: {
invalid_accounts: account_details === undefined ? undefined : account_details.invalid_storages,
valid_accounts: account_details === undefined ? undefined : account_details.valid_storages,
Expand All @@ -161,7 +171,7 @@ class NSFSHealth {
delay_ms: config.NC_HEALTH_ENDPOINT_RETRY_DELAY,
func: async () => {
endpoint_state = await this.get_endpoint_fork_response();
if (endpoint_state.response.response_code === fork_response_code.NOT_RUNNING.response_code) {
if (endpoint_state.response?.response_code === fork_response_code.NOT_RUNNING.response_code) {
throw new Error('Noobaa endpoint is not running, all the retries failed');
}
}
Expand All @@ -173,13 +183,23 @@ class NSFSHealth {
return endpoint_state;
}

async get_error_code(nsfs_status, pid, endpoint_response_code) {
if (nsfs_status !== 'active' || pid === '0') {
/**
* get_error_code returns the error code per the failed check
* @param {{service_status: String,
* pid: string,
* endpoint_response_code: string,
* config_directory_status: Object }} health_check_params
* @returns {Object}
*/
get_error_code({ service_status, pid, endpoint_response_code, config_directory_status }) {
if (service_status !== 'active' || pid === '0') {
return health_errors.NOOBAA_SERVICE_FAILED;
} else if (endpoint_response_code === 'NOT_RUNNING') {
return health_errors.NOOBAA_ENDPOINT_FAILED;
} else if (endpoint_response_code === 'MISSING_FORKS') {
return health_errors.NOOBAA_ENDPOINT_FORK_MISSING;
} else if (config_directory_status.error) {
return health_errors.CONFIG_DIR_ERROR;
}
}

Expand Down Expand Up @@ -239,7 +259,7 @@ class NSFSHealth {
const fork_count_response = await this.make_endpoint_health_request(url_path);
if (!fork_count_response) {
return {
response_code: fork_response_code.NOT_RUNNING,
response: fork_response_code.NOT_RUNNING,
total_fork_count: total_fork_count,
running_workers: worker_ids,
};
Expand Down Expand Up @@ -421,6 +441,60 @@ class NSFSHealth {
err_obj
};
}

/**
* _get_config_dir_status returns the config directory phase, version,
* matching package_version, upgrade_status and error if occured.
* @param {Object} system_data
* @returns {Object}
*/
_get_config_dir_status(system_data) {
if (!system_data) return { error: 'system data is missing' };
const config_dir_data = system_data.config_directory;
if (!config_dir_data) return { error: 'config directory data is missing, must upgrade config directory' };
const config_dir_upgrade_status = this._get_config_dir_upgrade_status(config_dir_data);
return {
phase: config_dir_data.phase,
config_dir_version: config_dir_data.config_dir_version,
upgrade_package_version: config_dir_data.upgrade_package_version,
upgrade_status: config_dir_upgrade_status,
error: config_dir_upgrade_status.error || undefined
};
}

/**
* _get_config_dir_upgrade_status returns one of the following
* 1. the status of an ongoing upgrade, if valid it returns an object with upgrade details
* 2. if upgrade is not ongoing but config dir is locked, the error details of the upgrade's last_failure will return
* 3. if upgrade is not ongoing and config dir is unlocked, a corresponding message will return
* @param {Object} config_dir_data
* @returns {Object}
*/
_get_config_dir_upgrade_status(config_dir_data) {
if (config_dir_data.in_progress_upgrade) return { in_progress_upgrade: config_dir_data.in_progress_upgrade };
if (config_dir_data.phase === CONFIG_DIR_LOCKED) {
return { error: 'last_upgrade_failed', last_failure: config_dir_data.upgrade_history.last_failure };
}
if (config_dir_data.phase === CONFIG_DIR_UNLOCKED) {
return { message: 'there is no in-progress upgrade' };
}
}

/**
* _calc_health_status calcs the overall health status of NooBaa NC
* @param {{service_status: String,
* pid: string,
* endpoint_response_code: string,
* config_directory_status: Object }} health_check_params
* @returns {'OK' | 'NOTOK'}
*/
_calc_health_status({ service_status, pid, endpoint_response_code, config_directory_status }) {
const is_unhealthy = service_status !== 'active' ||
pid === '0' ||
endpoint_response_code !== 'RUNNING' ||
config_directory_status.error;
return is_unhealthy ? 'NOTOK' : 'OK';
}
}

async function get_health_status(argv, config_fs) {
Expand Down
2 changes: 1 addition & 1 deletion src/test/unit_tests/nc_index.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ require('./test_chunk_fs');
require('./test_namespace_fs_mpu');
require('./test_nb_native_fs');
require('./test_nc_nsfs_cli');
require('./test_nc_nsfs_health');
require('./test_nc_health');
require('./test_nsfs_access');
require('./test_nsfs_integration');
require('./test_bucketspace_fs');
Expand Down
2 changes: 1 addition & 1 deletion src/test/unit_tests/sudo_index.js
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,5 @@ require('./test_bucketspace_versioning');
require('./test_bucketspace_fs');
require('./test_nsfs_versioning');
require('./test_nc_nsfs_cli');
require('./test_nc_nsfs_health');
require('./test_nc_health');

Loading

0 comments on commit dee7a5e

Please sign in to comment.