Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-4115] Performance Profile Support #466

Merged
merged 79 commits into from
Oct 21, 2024

Conversation

phvalguima
Copy link
Contributor

@phvalguima phvalguima commented Oct 1, 2024

This PR extends the current charm to support performance profiles following spec DA031 and add supports for the following profiles:

  • testing:
    • focused in the integration tests and our CI -> 1G RAM dedicated to the Heap and no automation
  • staging:
    • HA capabilities must be available: for that, we will enforce index template that encompasses all the indices and sets replica: 1-all
    • Extends heap to: max(1G, 10% of RAM)
    • indices.memory.index_buffer_size extends to 25%
    • Adds three component templates, that will be described later
  • production:
    • Same features as the staging, but heap is set instead to: max(1G, 50% of RAM)

The options above are set based on the following documents:
https://opensearch.org/docs/latest/tuning-your-cluster/performance/
https://opensearch.org/docs/latest/search-plugins/knn/performance-tuning/

The user can switch between the three options above, and depending on the selected value, the templates are created or destroyed.


UPDATE

One important question about this PR is if index templates with '*' will apply to system indices. So, the first part of this answer is: index templates are only applied at index creation, as shown here. We can delete index templates after indices were created based on that template.

Manual configuration (e.g. setting 0-all) will always take precedence.

There is an exception for hidden (not necessarily system) indices: if we have templates that are "catch-all", then they are not applied to hidden indices.

@phvalguima phvalguima requested review from Mehdi-Bendriss and skourta and removed request for skourta October 1, 2024 17:23
Copy link
Contributor

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @phvalguima.

I have a few points:

1. Replication factor:

We should never set the replication factor to 1-all - this is unnatural for non system indices and may cause the disk of all units to overflow quickly.

Replication factors are set by the user.

If we want to be safer, we should look for indices whose number_of_replicas < 1 and set it to 1. Not more.

2. Codec used:

Why did you choose zstd_no_dict instead of zst or instead qat_lz4 (in supported systems) or others? which one to choose? How?

It may also be needed to set a compression algorithm with each? which one to choose? how?

3. Heap size:

The heap size should, in production, be set to 50% but lower than 32Gb as per the official opensearch recommendations. here and here.

4. Units conversion:

The main complexity of this PR revolves around unit conversions, 2 things to note:

  1. /proc/meminfo always returns units in Kb.
    1. any reason to not use psutil ? this should save you the parsing of /meminfo
  2. Jvm XMS and XMX properties accept either g|G -- m|M -- k|K, meaning the smallest unit is k.

With this in mind - it makes sense to normalize the values to the smallest unit supported by both, which is the Kb and work with it exclusively (when reading, calculating or writing to file).
Something along the lines of:

def jvm_size_in_kb(input: str) -> int:
    """Normalize the size values set in the jvm.options from supported units to Kb."""
    match = re.match(r"(\d+)([gmk])$", input.lower())

    value = float(match.group(1))
    jvm_formatted_unit = match.group(2)

    factor = 1
    if jvm_formatted_unit == "m":
        factor = 1024
    elif jvm_formatted_unit == "g":
        factor = 1024 * 1024

    # the loss is minimal since we're working with KBs
    return int(value * factor)

With this, I believe we do not need ByteUnit and JavaByteSize classes.

Similar to the the percentage method, which can simply be calculated as int(0.25 * val) ==> again, the loss of rounding to the floor is minimal because we are dealing with Kilobytes.

@phvalguima
Copy link
Contributor Author

phvalguima commented Oct 6, 2024

Hi @Mehdi-Bendriss, I will go over your points one by one.


  1. Replication factor:
    We should never set the replication factor to 1-all - this is unnatural for non system indices and may cause the disk of all units to overflow quickly.
    ...
    If we want to be safer, we should look for indices whose number_of_replicas < 1 and set it to 1. Not more.

Indeed, I corrected that right after our conversation earlier this week in this commit.

Although I agree with the removal of -all, thinking of real world deployments, we always go with at least 3x nodes. Other operators like SQL DBs will always have 1x main + 2x replicas in-sync when deployed by field. We remove the -all but we could have a similar experience and set it to 2, as we are in the HA type of deployment, hence at least 3 nodes will be present.

Replication factors are set by the user.

True, but on our specification: DA031 - Profile config option, it is stated that both staging and production, we will be providing a highly-available and scalable service to be used in production.. Now, that is a bit open for interpretation in my view: "highly-available" in terms of OpenSearch services (i.e. only the service indices should be set for HA by our charm) OR in terms of using OpenSearch as well (i.e. even indices created by the user).

That is why I am trying to put together a minimum "index template", that at least assures an user that indices will be replicated, unless this user explicitly states otherwise.


  1. Codec used:

TBH, I am not entirely set on each of the values we will discuss below. That is one of the reasons I have added them as component templates, so an user will build their own index templates on top.

I'd rather benchmark these for comparison on top. Maybe getting that landed on this PR is too much. TL;DR I think we can play it safer as follows:

  1. break the "component template" setup on a separate branch, for now
  2. we run performance tests until we are okay with which values to suggest
  3. if we are okay with the results, we can report it and update first our documentation (an user can then follow and create their own indices / index template)
  4. Eventually, come back to the branch from step (1) and merge it.

WDYT?

I will start with a later question:

which one to choose?

The codec selection came from these results.

Why did you choose zstd_no_dict instead of zst

In the case of zstd_no_dict, from the results you will see above, we would sacrifice 5% compression for +7% (net) p90 latency and +7% throughput (net) when compared with zstd.

or instead qat_lz4 (in supported systems) or others?

Happy to set QAT if we get the logic to detect and enable it. We are not yet there.

How?

First, we should keep in mind some types of indices cannot simply work with all the codecs (e.g. the vector indices).

As you also noticed, I am not really onboard with these results, as there are other parameters to be set. In this case, the How?? will have to be done the same way we've done the AVX testing: by running performance tests and comparing results.

It may also be needed to set a compression algorithm with each? which one to choose? how?

Yes, I also have this concern. What I think we should do here is run these component templates against benchmarks and document their results.


  1. Heap size:
    The heap size should, in production, be set to 50% but lower than 32Gb as per the official opensearch recommendations. here and here.

Thanks, that is a really good point. I will set a hard limit of up to 32G. Indeed having the GC going over 100s of Gigs does not sound good :)


  1. Units conversion:

So, some thoughts here: (1) indeed sticking with Kb, as long as the JVM can accept the "32G" in Kb format, is a good idea and would simplify a lot; and (2) I was discussing with Big Data team could benefit from this logic here. The idea is to eventually move to the data_platform_helpers. From what I gathered, they are currently setting the JVM heap on a hard limited number rn.

The main complexity of this PR revolves around unit conversions, 2 things to note:

  1. /proc/meminfo always returns units in Kb.

That is true as the kernel code shows.

  1. any reason to not use psutil ? this should save you the parsing of /meminfo

Yes, two reasons: (1) the parsing goes from L589 to L598... I prefer 10 LoCs that process a file whose format is set in stone by the kernel than adding a new dependency (that actually is not shipped with stock Ubuntu anyways); and (2) it gives access to hugepages info as well, which we can potentially explore later, and would be "just there".

Jvm XMS and XMX properties accept either g|G -- m|M -- k|K, meaning the smallest unit is k.
With this in mind - it makes sense to normalize the values to the smallest unit supported by both, which is the Kb and work with it exclusively (when reading, calculating or writing to file).

Yes, I was going down to the bytes and coming back but indeed makes sense to stop on Kb instead and we handle all this in Kb.

Copy link
Contributor

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pedro!

Comment on lines +71 to +75
if not self.peers_data.get(Scope.UNIT, PERFORMANCE_PROFILE):
return None
return OpenSearchPerfProfile.from_dict(
{"typ": self.peers_data.get(Scope.UNIT, PERFORMANCE_PROFILE)}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit for conciseness:

Suggested change
if not self.peers_data.get(Scope.UNIT, PERFORMANCE_PROFILE):
return None
return OpenSearchPerfProfile.from_dict(
{"typ": self.peers_data.get(Scope.UNIT, PERFORMANCE_PROFILE)}
)
if not (profile := self.peers_data.get(Scope.UNIT, PERFORMANCE_PROFILE)):
return None
return OpenSearchPerfProfile.from_dict({"typ": profile})

Copy link
Contributor

@reneradoi reneradoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Only one minor thing that could potentially be changed, depending on your preference.

return

perf_profile_needs_restart = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed, it will be overwritten with True or False in line 738.

@phvalguima phvalguima merged commit cd1c034 into 2/edge Oct 21, 2024
35 of 40 checks passed
@phvalguima phvalguima deleted the DPE-4115-performance-profiles branch October 21, 2024 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants