Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: remove OpenSearch warnings #4750

Merged
merged 3 commits into from
Dec 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,20 @@ This release introduces breaking changes, including the utilized URL.
For example, `curl 'http://localhost:8092/actuator/backups'` rather than the previously used `backup`.
:::

Optimize stores its data over multiple indices in Elasticsearch. To ensure data integrity across indices, a backup of Optimize data consists of two Elasticsearch snapshots, each containing a different set of Optimize indices. Each backup is identified by a positive integer backup ID. For example, a backup with ID `123456` consists of the following Elasticsearch snapshots:
Optimize stores its data over multiple indices in the database. To ensure data integrity across indices, a backup of Optimize data consists of two ElasticSearch/OpenSearch snapshots, each containing a different set of Optimize indices. Each backup is identified by a positive integer backup ID. For example, a backup with ID `123456` consists of the following snapshots:

```
camunda_optimize_123456_3.9.0_part_1_of_2
camunda_optimize_123456_3.9.0_part_2_of_2
```

Optimize provides an API to trigger a backup and retrieve information about a given backup's state. During backup creation Optimize can continue running. The backed up data can later be restored using the standard Elasticsearch snapshot restore API.
Optimize provides an API to trigger a backup and retrieve information about a given backup's state. During backup creation Optimize can continue running. The backed up data can later be restored using the standard ElasticSearch/OpenSearch snapshot restore API.

## Prerequisites

The following prerequisites must be set up before using the backup API:

1. A snapshot repository of your choice must be registered with Elasticsearch.
1. A snapshot repository of your choice must be registered with ElasticSearch/OpenSearch.
2. The repository name must be specified using the `CAMUNDA_OPTIMIZE_BACKUP_REPOSITORY_NAME` environment variable, or by adding it to your Optimize [`environment-config.yaml`]($optimize$/self-managed/optimize-deployment/configuration/system-configuration/):

```yaml
Expand All @@ -48,13 +48,13 @@ POST actuator/backups

### Response

| Code | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| 202 Accepted | Backup process was successfully initiated. To determine whether backup process was completed refer to the GET API. |
| 400 Bad Request | Indicates issues with the request, for example when the `backupId` contains invalid characters. |
| 409 Conflict | Indicates that a backup with the same `backupId` already exists. |
| 500 Server Error | All other errors, e.g. issues communicating with Elasticsearch for snapshot creation. Refer to the returned error message for more details. |
| 502 Bad Gateway | Optimize has encountered issues while trying to connect to Elasticsearch. |
| Code | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| 202 Accepted | Backup process was successfully initiated. To determine whether backup process was completed refer to the GET API. |
| 400 Bad Request | Indicates issues with the request, for example when the `backupId` contains invalid characters. |
| 409 Conflict | Indicates that a backup with the same `backupId` already exists. |
| 500 Server Error | All other errors, e.g. issues communicating with the database for snapshot creation. Refer to the returned error message for more details. |
| 502 Bad Gateway | Optimize has encountered issues while trying to connect to the database. |

### Example request

Expand Down Expand Up @@ -96,8 +96,8 @@ GET actuator/backup
| 200 OK | Backup state could be determined and is returned in the response body (see example below). |
| 400 Bad Request | There is an issue with the request, for example the repository name specified in the Optimize configuration does not exist. Refer to returned error message for details. |
| 404 Not Found | If a backup ID was specified, no backup with that ID exists. |
| 500 Server Error | All other errors, e.g. issues communicating with Elasticsearch for snapshot state retrieval. Refer to the returned error message for more details. |
| 502 Bad Gateway | Optimize has encountered issues while trying to connect to Elasticsearch. |
| 500 Server Error | All other errors, e.g. issues communicating with the database for snapshot state retrieval. Refer to the returned error message for more details. |
| 502 Bad Gateway | Optimize has encountered issues while trying to connect to the database. |

### Example request

Expand Down Expand Up @@ -135,8 +135,8 @@ Possible states of the backup:

- `COMPLETE`: The backup can be used for restoring data.
- `IN_PROGRESS`: The backup process for this backup ID is still in progress.
- `FAILED`: Something went wrong when creating this backup. To find out the exact problem, use the [Elasticsearch get snapshot status API](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/get-snapshot-status-api.html) for each of the snapshots included in the given backup.
- `INCOMPATIBLE`: The backup is incompatible with the current Elasticsearch version.
- `FAILED`: Something went wrong when creating this backup. To find out the exact problem, use the [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/get-snapshot-status-api.html) / [OpenSearch](https://opensearch.org/docs/latest/api-reference/snapshots/get-snapshot-status/) get snapshot status API for each of the snapshots included in the given backup.
- `INCOMPATIBLE`: The backup is incompatible with the current ElasticSearch/OpenSearch version.
- `INCOMPLETE`: The backup is incomplete (this could occur when the backup process was interrupted or individual snapshots were deleted).

## Delete backup API
Expand All @@ -154,10 +154,10 @@ DELETE actuator/backups/{backupId}

| Code | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| 204 No Content | The delete request for the associated snapshots was submitted to Elasticsearch successfully. |
| 204 No Content | The delete request for the associated snapshots was submitted to the database successfully. |
| 400 Bad Request | There is an issue with the request, for example the repository name specified in the Optimize configuration does not exist. Refer to returned error message for details. |
| 500 Server Error | An error occurred, for example the snapshot repository does not exist. Refer to the returned error message for details. |
| 502 Bad Gateway | Optimize has encountered issues while trying to connect to Elasticsearch. |
| 502 Bad Gateway | Optimize has encountered issues while trying to connect to ElasticSearch/OpenSearch. |

### Example request

Expand All @@ -167,22 +167,22 @@ curl --request DELETE 'http://localhost:8092/actuator/backups/123456'

## Restore backup

There is no Optimize API to perform the backup restore. Instead, the standard [Elasticsearch restore snapshot API](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/restore-snapshot-api.html) can be used. Note that the Optimize versions of your backup snapshots must match the currently running version of Optimize. You can identify the version at which the backup was taken by the version tag included in respective snapshot names; for example, a snapshot with the name`camunda_optimize_123456_3.9.0_part_1_of_2` was taken of Optimize version `3.9.0`.
There is no Optimize API to perform the backup restore. Instead, the standard [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/restore-snapshot-api.html) / [OpenSearch](https://opensearch.org/docs/latest/api-reference/snapshots/restore-snapshot) restore snapshot API can be used. Note that the Optimize versions of your backup snapshots must match the currently running version of Optimize. You can identify the version at which the backup was taken by the version tag included in respective snapshot names; for example, a snapshot with the name`camunda_optimize_123456_3.9.0_part_1_of_2` was taken of Optimize version `3.9.0`.

:::note
Optimize must NOT be running while a backup is being restored.
:::

To restore an existing backup, all the snapshots this backup contains (as listed in the response of the [create backup API request](#example-response)) must be restored using the Elasticsearch API.
To restore an existing backup, all the snapshots this backup contains (as listed in the response of the [create backup API request](#example-response)) must be restored using the restore API.

To restore a given backup, the following steps must be performed:

1. Stop Optimize.
2. Ensure no Optimize indices are present in Elasticsearch (or the restore process will fail).
3. Iterate over all Elasticsearch snapshots included in the desired backup and restore them using the Elasticsearch restore snapshot API.
2. Ensure no Optimize indices are present in the database (or the restore process will fail).
3. Iterate over all ElasticSearch/OpenSearch snapshots included in the desired backup and restore them using the restore snapshot API mentioned above.
4. Start Optimize.

Example Elasticsearch request:
Example request:

```shell
curl --request POST `http://localhost:9200/_snapshot/repository_name/camunda_optimize_123456_3.9.0_part_1_of_2/_restore?wait_for_completion=true`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ In general, the import assumes the following setup:

- A Camunda engine from which Optimize imports the data.
- The Optimize backend, where the data is transformed into an appropriate format for efficient data analysis.
- [Elasticsearch](https://www.elastic.co/guide/index.html), which is the database Optimize persists all formatted data to.
- [Elasticsearch (ES)](https://www.elastic.co/guide/index.html) or [OpenSearch (OS)](https://opensearch.org/), which serves as the database that Optimize uses to persist all of its formatted data.

The following depicts the setup and how the components communicate with each other:

![Optimize Import Structure](img/Optimize-Structure.png)

Optimize queries the engine data using a dedicated Optimize REST-API within the engine, transforms the data, and stores it in its own Elasticsearch database such that it can be quickly and easily queried by Optimize when evaluating reports or performing analyses. The reason for having a dedicated REST endpoint for Optimize is performance: the default REST-API adds a lot of complexity to retrieve the data from the engine database, which can result in low performance for large data sets.
Optimize queries the engine data using a dedicated Optimize REST-API within the engine, transforms the data, and stores it in its own database such that it can be quickly and easily queried by Optimize when evaluating reports or performing analyses. The reason for having a dedicated REST endpoint for Optimize is performance: the default REST-API adds a lot of complexity to retrieve the data from the engine database, which can result in low performance for large data sets.

Note the following limitations regarding the data in Optimize's database:

- The data is only a near real-time representation of the engine database. This means Elasticsearch may not contain the data of the most recent time frame, e.g. the last two minutes, but all the previous data should be synchronized.
- The data is only a near real-time representation of the engine database. This means the database may not contain the data of the most recent time frame, e.g. the last two minutes, but all the previous data should be synchronized.
- Optimize only imports the data it needs for its analysis. The rest is omitted and won't be available for further investigation. Currently, Optimize imports:
- The history of the activity instances
- The history of the process instances
Expand All @@ -47,7 +47,7 @@ This section gives an overview of how fast Optimize imports certain data sets. T

It is very likely that these metrics change for different data sets because the speed of the import depends on how the data is distributed.

The import is also affected by how the involved components are set up. For instance, if you deploy the Camunda engine on a different machine than Optimize and Elasticsearch to provide both applications with more computation resources, the process is likely to speed up. If the Camunda engine and Optimize are physically far away from each other, the network latency might slow down the import.
The import is also affected by how the involved components are set up. For instance, if you deploy the Camunda engine on a different machine than Optimize and Elasticsearch/OpenSearch to provide both applications with more computation resources, the process is likely to speed up. If the Camunda engine and Optimize are physically far away from each other, the network latency might slow down the import.

### Setup

Expand Down Expand Up @@ -135,7 +135,7 @@ During execution, the following steps are performed:
2. Map entities and add an import job
3. [Execute the import](#execute-the-import).
1. Poll a job
2. Persist the new entities to Elasticsearch
2. Persist the new entities to the database

### Start an import round

Expand Down Expand Up @@ -175,33 +175,37 @@ First, the `ImportScheduler` retrieves the newest index, which identifies the la

#### Map entities and add an import job

All fetched entities are mapped to a representation that allows Optimize to query the data very quickly. Subsequently, an import job is created and added to the queue to persist the data in Elasticsearch.
All fetched entities are mapped to a representation that allows Optimize to query the data very quickly. Subsequently, an import job is created and added to the queue to persist the data in the database.

### Execute the import

Full aggregation of the data is performed by a dedicated `ImportJobExecutor` for each entity type, which waits for `ImportJob` instances to be added to the execution queue. As soon as a job is in the queue, the executor:

- Polls the job with the new Optimize entities
- Persists the new entities to Elasticsearch
- Persists the new entities to the database

The data from the engine and Optimize do not have a one-to-one relationship, i.e., one entity type in Optimize may consist of data aggregated from different data types of the engine. For example, the historic process instance is first mapped to an Optimize `ProcessInstance`. However, for the heatmap analysis it is also necessary for `ProcessInstance` to contain all activities that were executed in the process instance.

Therefore, the Optimize `ProcessInstance` is an aggregation of the engine's historic process instance and other related data: historic activity instance data, user task data, and variable data are all [nested documents](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) within Optimize's `ProcessInstance` representation.
Therefore, the Optimize `ProcessInstance` is an aggregation of the engine's historic process instance and other related data: historic activity instance data, user task data, and variable data are all nested documents ([ES](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) / [OS](https://opensearch.org/docs/latest/field-types/supported-field-types/nested/)) within Optimize's `ProcessInstance` representation.

:::note
Optimize uses [nested documents](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html), the above mentioned data is an example of documents that are nested within Optimize's `ProcessInstance` index.
Optimize uses nested documents ([ES](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) / [OS](https://opensearch.org/docs/latest/field-types/supported-field-types/nested/)), the above mentioned data is an example of documents that are nested within Optimize's `ProcessInstance` index.

Elasticsearch applies restrictions regarding how many objects can be nested within one document. If your data includes too many nested documents, you may experience import failures. To avoid this, you can temporarily increase the nested object limit in Optimize's [index configuration](./../configuration/system-configuration.md#index-settings). Note that this might cause memory errors.
Elasticsearch and OpenSearch apply restrictions regarding how many objects can be nested within one document. If your data includes too many nested documents, you may experience import failures. To avoid this, you can temporarily increase the nested object limit in Optimize's [index configuration](./../configuration/system-configuration.md#index-settings). Note that this might cause memory errors.
grlimacan marked this conversation as resolved.
Show resolved Hide resolved
:::

Import executions per engine entity are actually independent from another. Each follows a [producer-consumer-pattern](https://dzone.com/articles/producer-consumer-pattern), where the type specific `ImportService` is the single producer and a dedicated single `ImportJobExecutor` is the consumer of its import jobs, decoupled by a queue. So, both are executed in different threads. To adjust the processing speed of the executor, the queue size and the number of threads that process the import jobs can be configured:

:::note
Although the parameters below include `ElasticSearch` in their name, they apply to both ElasticSearch and OpenSearch installations. For backward compatibility reasons, the parameters have not been renamed.
:::

```yaml
import:
# Number of threads being used to process the import jobs per data type that are writing
# data to elasticsearch.
# data to the database.
elasticsearchJobExecutorThreadCount: 1
# Adjust the queue size of the import jobs per data type that store data to elasticsearch.
# Adjust the queue size of the import jobs per data type that store data to the database.
# A too large value might cause memory problems.
elasticsearchJobExecutorQueueSize: 5
```
Loading
Loading