Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEZ-4559: Fix Retry logic in case of Recovery #353

Merged
merged 1 commit into from
May 7, 2024

Conversation

abstractdog
Copy link
Contributor

@abstractdog abstractdog commented May 6, 2024

Some unit tests were broken by TEZ-4543, where we simply returned a failed DAG if the requested DAG status cannot be found. This completely breaks recovery scenarios where the dagClient might keep asking for the failed DAGs status (while the AM restarts after a failure).

Considering recovery works, the client should simply consider if recovery is enabled and behave accordingly. This patch reverts the behavior in case of recovery to pre-TEZ-4543, but if there is no recovery, TEZ-4543 is a fair assumption and still makes the client able to return much faster in case of the specialized exception implying that the DAG is already lost.

Unit tests have been run manually with this patch: TestDAGRecovery, TestAMRecovery, TestRecovery

@tez-yetus

This comment was marked as outdated.

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 5s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 18m 28s master passed
+1 💚 compile 0m 37s master passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu222.04.1
+1 💚 compile 0m 34s master passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~22.04-b06
+1 💚 checkstyle 1m 19s master passed
+1 💚 javadoc 0m 50s master passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu222.04.1
+1 💚 javadoc 0m 38s master passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~22.04-b06
+0 🆗 spotbugs 1m 36s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 34s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 21s the patch passed
+1 💚 compile 0m 24s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu222.04.1
+1 💚 javac 0m 24s the patch passed
+1 💚 compile 0m 20s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~22.04-b06
+1 💚 javac 0m 20s the patch passed
+1 💚 checkstyle 0m 13s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 24s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu222.04.1
+1 💚 javadoc 0m 26s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~22.04-b06
+1 💚 findbugs 1m 3s the patch passed
_ Other Tests _
+1 💚 unit 2m 15s tez-api in the patch passed.
+1 💚 asflicense 0m 17s The patch does not generate ASF License warnings.
31m 54s
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-353/2/artifact/out/Dockerfile
GITHUB PR #353
JIRA Issue TEZ-4559
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 76698a6a2b35 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 66a6ca6
Default Java Private Build-1.8.0_402-8u402-ga-2ubuntu1~22.04-b06
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu222.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~22.04-b06
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-353/2/testReport/
Max. process+thread count 404 (vs. ulimit of 5500)
modules C: tez-api U: tez-api
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-353/2/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@abstractdog abstractdog requested a review from ayushtkn May 7, 2024 08:00
Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM.
Tried the Recovery tests locally

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.tez.test.TestAMRecovery
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 186.222 s - in org.apache.tez.test.TestAMRecovery
[INFO] Running org.apache.tez.test.TestDAGRecovery
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 63.794 s - in org.apache.tez.test.TestDAGRecovery
[INFO] Running org.apache.tez.test.TestRecovery
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 489.212 s - in org.apache.tez.test.TestRecovery
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

@ayushtkn ayushtkn merged commit 7a9211e into apache:master May 7, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants