Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use 'foreground' delete policy to cancel k8s job #290

Merged
merged 15 commits into from
Dec 4, 2023

Conversation

qwtsc
Copy link
Contributor

@qwtsc qwtsc commented Nov 23, 2023

Purpose of the PR

  • bypass the test case issue, and add more info when error occurs.

Main Changes

Verifying these changes

  • Trivial rework / code cleanup without any test coverage. (No Need)
  • Already covered by existing tests, such as (please modify tests here).
  • Need tests and can be verified as follows.

Does this PR potentially affect the following parts?

  • Nope
  • Dependencies (add/update license info)
  • Modify configurations
  • The public API
  • Other affects (typed here)

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - No Need

coderzc
coderzc previously approved these changes Nov 23, 2023
@coderzc
Copy link
Member

coderzc commented Nov 23, 2023

image

It seems Enum already overrides write toString

@javeme
Copy link
Contributor

javeme commented Nov 23, 2023

maybe we need to implement DefaultJobState.toString()?

@qwtsc
Copy link
Contributor Author

qwtsc commented Nov 24, 2023

maybe we need to implement DefaultJobState.toString()?

Done, and I have found the issue that blocked the minikube test. please review the latest changed files.

Copy link

codecov bot commented Nov 24, 2023

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (ff85e34) 85.03% compared to head (1133673) 85.07%.
Report is 3 commits behind head on master.

Files Patch % Lines
...egraph/computer/algorithm/sampling/RandomWalk.java 92.06% 2 Missing and 3 partials ⚠️
...thm/centrality/betweenness/BetweennessMessage.java 0.00% 2 Missing ⚠️
.../algorithm/community/cc/ClusteringCoefficient.java 0.00% 0 Missing and 1 partial ⚠️
...computer/algorithm/sampling/RandomWalkMessage.java 94.11% 1 Missing ⚠️
...che/hugegraph/computer/driver/DefaultJobState.java 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master     #290      +/-   ##
============================================
+ Coverage     85.03%   85.07%   +0.03%     
- Complexity     3246     3278      +32     
============================================
  Files           345      349       +4     
  Lines         12298    12411     +113     
  Branches       1102     1113      +11     
============================================
+ Hits          10458    10559     +101     
- Misses         1315     1320       +5     
- Partials        525      532       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

imbajin
imbajin previously approved these changes Nov 24, 2023
Copy link
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks and could give more context about the issue/bug?

@javeme
Copy link
Contributor

javeme commented Nov 24, 2023

thanks and could give more context about the issue/bug?

it may be this issue: #287 (comment)

@qwtsc
Copy link
Contributor Author

qwtsc commented Nov 27, 2023

I still can't find the root cause. Let's keep this PR open for another two or three days. If it's still mysterious to me, we can append it to PR #287.

@qwtsc
Copy link
Contributor Author

qwtsc commented Nov 27, 2023

@coderzc @javeme @imbajin I finally find the actual reason.

  • the default delete policy in k8s has changed to background since v1.20.
  • Plus the way you added finalizer is replaceCR, so you will sometimes miss the delete event generated by your client.
    private boolean finalizer(HugeGraphComputerJob computerJob) {
        if (computerJob.addFinalizer(FINALIZER_NAME)) {
   --->   // this is partial reason, can u use patch CR instead?
            this.replaceCR(computerJob);
            return true;
        }

        ComputerJobStatus status = computerJob.getStatus();
        if (computerJob.isMarkedForDeletion()) {
            if (!JobStatus.finished(status.getJobStatus())) {
                status.setJobStatus(JobStatus.CANCELLED.name());
                this.updateStatus(computerJob);
            } else {
                if (computerJob.removeFinalizer(FINALIZER_NAME)) {
                    this.replaceCR(computerJob);
                }
            }
            return true;
        } else {
            if (JobStatus.finished(status.getJobStatus())) {
                if (this.autoDestroyPod) {
                    this.deleteCR(computerJob);
                }
                return true;
            }
        }

        return false;
    }
  • k8s will deleted its owned resource in the background whether the CR is deleted or not, so the calculated job status will always remain INITIALIZING, then the operator won't deploy again.
        ComputerJobComponent observed = this.observeComponent(computerJob);
----> //calculated status remains INITIALIZING forever, so won't deploy it again.
        if (!this.updateStatus(observed) && request.retryTimes() == 0) {
            LOG.debug("Wait status to be stable before taking further actions");
            return OperatorResult.NO_REQUEUE;
        }

        if (Objects.equals(computerJob.getStatus().getJobStatus(),
                           JobStatus.RUNNING.name())) {
            String crName = computerJob.getMetadata().getName();
            LOG.info("ComputerJob {} already running, no action", crName);
            return OperatorResult.NO_REQUEUE;
        }

        ComputerJobDeployer deployer = new ComputerJobDeployer(this.kubeClient,
                                                               this.config);
        deployer.deploy(observed);

        return OperatorResult.NO_REQUEUE;
  • Then as human, we observe the unit-test case failed again and again.

solution

My solution is to adjust the delete policy back to foreground. Everyone is happy now.

Copy link
Contributor

@javeme javeme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your detailed analysis~

@imbajin
Copy link
Member

imbajin commented Nov 28, 2023

CI error https://github.com/apache/incubator-hugegraph-computer/actions/runs/7013868991/job/19080685026?pr=290

image

retry

image

BTW, the u could import the hugegraph-style.xml to avoid most of the align/import problems

@qwtsc
Copy link
Contributor Author

qwtsc commented Nov 28, 2023

BTW, the u could import the hugegraph-style.xml to avoid most of the align/import problems

Got it.

image
I cannot repeat your failure, but it failed with another reason. I am concerned with the quality of test cases.

@coderzc
Copy link
Member

coderzc commented Dec 1, 2023

@qwtsc Please correct the title of PR.

@qwtsc qwtsc changed the title chore: add toString in jobstatus feat: use foreground delete policy rather than default background option to pass the test case of cancel job Dec 4, 2023
@qwtsc
Copy link
Contributor Author

qwtsc commented Dec 4, 2023

@qwtsc Please correct the title of PR.

done

@imbajin imbajin requested a review from coderzc December 4, 2023 11:39
@imbajin imbajin changed the title feat: use foreground delete policy rather than default background option to pass the test case of cancel job feat: use 'foreground' delete policy to pass the cancel job test Dec 4, 2023
@imbajin imbajin merged commit 9d4d276 into apache:master Dec 4, 2023
8 checks passed
@javeme javeme changed the title feat: use 'foreground' delete policy to pass the cancel job test feat: use 'foreground' delete policy to pass the cancel job Dec 5, 2023
@javeme javeme changed the title feat: use 'foreground' delete policy to pass the cancel job feat: use 'foreground' delete policy to cancel k8s job Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants