Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MBUILDCACHE-104] Allow multiple cache entries per checksum #175

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

reda-alaoui
Copy link
Member

@reda-alaoui reda-alaoui commented Aug 20, 2024

This PR adds the concept of cache zone. There can be multiple zones per checksum.
This allows to store multiple final cache states and to use some zones as input and others as output.

The first input zone with a cache hit "wins". When the cache entry is saved, it is written to all configured output zones. By default, there is one input/output zone named default-zone.


Following this checklist to help us incorporate your
contribution quickly and easily:

  • Make sure there is a MBUILDCACHE JIRA issue filed
    for the change (usually before you start working on it). Trivial changes like typos do not
    require a JIRA issue. Your pull request should address just this issue, without
    pulling in other changes.
  • Each commit in the pull request should have a meaningful subject line and body.
  • Format the pull request title like [MBUILDCACHE-XXX] - Fixes bug in ApproximateQuantiles,
    where you replace MBUILDCACHE-XXX with the appropriate JIRA issue. Best practice
    is to use the JIRA issue title in the pull request title and in the first line of the
    commit message.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will
    be performed on your pull request automatically.
  • You have run the Core IT successfully.

If your pull request is about ~20 lines of code you don't need to sign an
Individual Contributor License Agreement if you are unsure
please ask on the developers list.

To make clear that you license your contribution under
the Apache License Version 2.0, January 2004
you have to acknowledge this by using the following check-box.

@reda-alaoui reda-alaoui marked this pull request as draft August 20, 2024 16:30
@reda-alaoui reda-alaoui marked this pull request as ready for review August 21, 2024 08:14

import static java.util.Objects.requireNonNull;

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having a descriptive class annotation will be helpful

Copy link
Contributor

@AlexanderAshitkin AlexanderAshitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please provide more details on the reasons behind this change? What specific situations or scenarios require introduction of these zones? Personally, I would prefer to initiate a discussion via email with members of the community first in order to gain a better understanding of the problem and to establish a broad consensus on the solution before proceeding with implementation.

@reda-alaoui
Copy link
Member Author

@AlexanderAshitkin the use case was already in https://issues.apache.org/jira/browse/MBUILDCACHE-104 .

I paste it here:

I have the following use case: a CI pipeline with a compile step then multiple parallelized test steps. Tests are split in 3 groups. Each test group execution specify the group via maven-surefire-plugin:test property 'groups'.

Maven phase for compilation: process-test-classes
Maven phase for tests: verify

The compilation should use remote cache as much as possible and end up populating it.
Each test execution should either use the populated cache ending with process-test-classes phase or a cache representing the result of the execution of the current test group (value for 'groups').

The issue I have is that this extension can only store a single cache entry per input checksum. But I need to persist the build result of each test group.

The easiest solution I can think of is having 4 cache zones. The first for the compile phase (the default zone), the 3 remaining for each test group.

I need the compile phase to read/write from/to the default cache zone.
I need each test group to read from its own cache zone then the default cache zone, in this order. At the end, it should write to its own cache zone.

Of course, I can still open a discussion on the mailing list for such a large change.

@AlexanderAshitkin
Copy link
Contributor

Of course, I can still open a discussion on the mailing list for such a large change.

I'm trying to understand that is causing issues and what type of issues it is, in this case with different <groups> plugin parameters.

The problem I'm facing is that the extension can only store one cache entry per input checksum, but I need to save the build result of each test group.

Different plugin parameters result in different hash sums, and these different checksums are stored separately. If there are 3 different build configurations (though profiles or command line parameters) with 3 different groups, it yields 3 different cache records under different checksums. This makes different checksums equivalent to different zones.

Another point that's unclear to me is the statement:

But I need to persist the build result of each test group.

I don't see any problem using classifiers or some other method to differentiate artifacts stored in the build. If we split build in logical parts, we must consider the case of the next plugins, which depend on the "zoned" cache records. And it must run consistently with the non-cached build.

From what I understand now, this scenario is already supported. There are two options:

  1. Having 3 different runs with 3 different groups resulting in 3 different cache records under different checksums.
  2. Within a single build, using the same checksum but storing the results using group-specific classifiers.

We also should consider, that if the plugins in questions are not leaf plugins. Overall, I would like to better understand the problem, starting with the build configuration and the use case.

@reda-alaoui
Copy link
Member Author

reda-alaoui commented Aug 28, 2024

Different plugin parameters result in different hash sums, and these different checksums are stored separately.

This is only true if the plugin parameter is tracked by the effective pom. By default, it is not the case (I use maven flags to pass those parameters). Plus I track those parameters via the reconciliation tag.

If there are 3 different build configurations (though profiles or command line parameters) with 3 different groups, it yields 3 different cache records under different checksums. This makes different checksums equivalent to different zones.

I want a test group step to resume where the compile step ended. If the checksums are different between the 2, that's not possible.

My CI flow is compile -> parallel[test group 1, test group 2, test group 3]. Each test group step should skip any plugin execution already performed by compile step.

I don't see any problem using classifiers or some other method to differentiate artifacts stored in the build. If we split build in logical parts, we must consider the case of the next plugins, which depend on the "zoned" cache records. And it must run consistently with the non-cached build.

When I said persist the build result of each test group, I was meaning that running one of the test group again, without any code change, should skip all plugin executions.

@AlexanderAshitkin
Copy link
Contributor

AlexanderAshitkin commented Aug 28, 2024

My CI flow is compile -> parallel[test group 1, test group 2, test group 3]

was meaning that running one of the test group again, without any code change, should skip all plugin executions.

Please clarify:

  • Does statement 2 imply that the rerun is a different reactor compile -> test group 1 and the expectation is that the results of the initial run are reused?
  • Or does it mean you rerun as compile -> parallel[test group 1, test group 2, test group 3] and expect compile, test group 2, test group 3 to be reused from the cache?
  • Does parallel[test group 1, test group 2, test group 3] refer to three different reactor modules running tests?
  • Are compile and any of test group 1, test group 2, test group 3 parts of the same module?

This is only true if the plugin parameter is tracked by the effective pom.

You can move the parameter to the effective pom. One way to do this is by using a profile that defines a literal property value and referencing that value in a plugin. By doing so, the value will be interpolated to the plugin parameters and become part of the effective pom.

@reda-alaoui
Copy link
Member Author

reda-alaoui commented Aug 28, 2024

Let me try to be more concrete by exposing a simplified example.

I have a maven project with a single module. In this module, I have many prod classes and many slow tests.

My initial naive CI (Jenkins) pipeline is a single step: mvn clean verify. Running this takes 5 hours (30 minutes of compilation + 4h30m of test execution).

My CI can execute jobs among multiple machines. So I decide to distribute the pipelines of my tests among my CI machines.
I split the tests in 3 groups. To assign each test a group, I annotate each test class with a JUnit 5 @Tag annotation with a value of group1, group2 or group3.

I ajust my build cache config as follow:

<?xml version="1.0" encoding="UTF-8" ?>
<cache xmlns="http://maven.apache.org/BUILD-CACHE-CONFIG/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://maven.apache.org/BUILD-CACHE-CONFIG/1.0.0 https://maven.apache.org/xsd/build-cache-config-1.0.0.xsd">
  <configuration>
    <remote>
      <url>XXX</url>
    </remote>
    <attachedOutputs>
      <dirNames>
        <dirName>classes</dirName>
        <dirName>test-classes</dirName>
      </dirNames>
    </attachedOutputs>
  </configuration>
  <executionControl>
    <reconcile>
      <plugins>
        <plugin artifactId="maven-surefire-plugin" goal="test">
          <reconciles>
            <reconcile propertyName="skipTests" skipValue="true"/>
            <reconcile propertyName="groups"/>
          </reconciles>
        </plugin>
      </plugins>
    </reconcile>
  </executionControl>
</cache>

My new pipeline is:

  1. mvn clean verify -DskipTests (Expected duration: ~30 minutes)
  2. In parallel:
    1. mvn clean verify -Dgroups=group1 (Expected duration: 4h30m/3)
    2. mvn clean verify -Dgroups=group2 (Expected duration: 4h30m/3)
    3. mvn clean verify -Dgroups=group3 (Expected duration: 4h30m/3)

I want:

Does statement 2 imply that the rerun is a different reactor compile -> test group 1 and the expectation is that the results of the initial run are reused?

Yes

Does parallel[test group 1, test group 2, test group 3] refer to three different reactor modules running tests?

(was Yes) Sorry, no.

Are compile and any of test group 1, test group 2, test group 3 parts of the same module?

Not necessarily. You could have a test group spanning across multiple modules.

You can move the parameter to the effective pom. One way to do this is by using a profile that defines a literal property value and referencing that value in a plugin. By doing so, the value will be interpolated to the plugin parameters and become part of the effective pom.

Yes I know, and as I said I don't want a different checksum between all those steps.

@AlexanderAshitkin
Copy link
Contributor

AlexanderAshitkin commented Aug 30, 2024

@reda-alaoui I am currently trying to better understand this request, and I have some conceptual concerns. To enhance the build time, traditional approach would be to divide the project into smaller modules and allow tests to run in parallel. The propozed zoned approach looks out-of-line with the Maven core design, which is per-project granularity. Cache is not a tool in itself, it is a Maven extension, and the request must be evaluated with the the regular Maven in mind - how to achieve the same goal without using the cache, using regular Maven instead.

As i see, this change aims to address the following Maven limitations:

  1. Lack of plugin level resumption (not consistent with Maven's core - it does not have a concept of Gradle-like tasks)
  2. Issues with artifacts staging and promotion (not consistent with Maven's core - it lacks built-in staging concepts. It is more typical to use different cache endpoints).
  3. Reactor distribution and scaling (not consistent with Maven's core - single reactor runs as a JVM, and results are shared through the artifacts of sub-projects)

In summary, this approach appears unconventional and is chosen because reworking the existing mono-project is more expensive. I see value in this idea because such "mono-projects" are not uncommon, and zones by itself look like a good generalization of "local" cache and it could be handy to split builds by quality levels. But I need some time to digest it. I will take my time to understand why this is necessary and how to address it consistently.

Cheers
Alex

@reda-alaoui
Copy link
Member Author

To enhance the speed of project optimization, a traditional approach would be to divide the project into smaller modules

is chosen because reworking the existing mono-project is more expensive

I do not agree with these assumptions. I have another project having more than 10 modules, where each module run its own integration tests, and the constraints are exactly the same.

and allow tests to run in parallel

I guess you mean inside the same reactor? This doesn't allow to distribute the load across multiple machines. This need is maybe rare in the Maven ecosystem, but this extension is probably heavily used by big projects needing exactly that.

Please note that I applied all of my PRs to a forked version with the distributed CI pipeline. It has been working correctly for almost a week, still counting. Of course, that doesn't mean we can't do better using better concepts or implementations.

@AlexanderAshitkin
Copy link
Contributor

AlexanderAshitkin commented Aug 30, 2024

The distribution is a good example. However, what's unusual here is an attempt to distribute a single-module build. The zones are intended to divide the build into smaller, shareable parts, but there is already a mechanism for that - the projects themselves. The projects (modules) exist to make the build more manageable and optimized by splitting the project into smaller parts. Smaller projects allow more efficient results sharing as well. Sharing intermediate builds is commonly achieved through staging repositories or by sharing results at the file system level.

I can see the value in this change, but it will take some time to evaluate the idea, tradeoffs, and options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants