Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEZ-4589: Counter for the overall duration of succeeded/failed/killed task attempts #382

Merged
merged 2 commits into from
Nov 23, 2024

Conversation

abstractdog
Copy link
Contributor

@abstractdog abstractdog commented Nov 21, 2024

tested on cluster with different scenarios, hive LLAP

  1. 2 queries running in parallel, leading to task kills, also I killed random nodes during the test
INFO  : Run DAG                               149.16s

INFO  :    NUM_FAILED_TASKS: 18
INFO  :    NUM_KILLED_TASKS: 11078
INFO  :    NUM_SUCCEEDED_TASKS: 774
INFO  :    TOTAL_LAUNCHED_TASKS: 11870
INFO  :    DURATION_FAILED_TASKS_MILLIS: 125559
INFO  :    DURATION_KILLED_TASKS_MILLIS: 2474797
INFO  :    DURATION_SUCCEEDED_TASKS_MILLIS: 15740900

ratio of failed to succeeded durations: 0.0080 = 0.80%
ratio of killed to succeeded durations: 0.1575 = 15.75%

low killed/succeeded ratio implies the hive LLAP behavior (task preempted quite fast, 11078 task kills didn't contributed that much)

  1. single query, but more aggressive node killing
INFO  : Run DAG                               392.71s

INFO  :    NUM_FAILED_TASKS: 210
INFO  :    NUM_KILLED_TASKS: 334
INFO  :    NUM_SUCCEEDED_TASKS: 880
INFO  :    TOTAL_LAUNCHED_TASKS: 1318
INFO  :    DURATION_FAILED_TASKS_MILLIS: 14433036
INFO  :    DURATION_KILLED_TASKS_MILLIS: 12815980
INFO  :    DURATION_SUCCEEDED_TASKS_MILLIS: 31076258

ratio of failed to succeeded durations: 0.4643 = 46.43%
ratio of killed to succeeded durations: 0.4117 = 41.17%

btw, this time, task kills were in the early started Reducer5 tasks, as many Map 4 source tasks have failed (so reducer considered unhealthy most probably):

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 2 ..........      llap     SUCCEEDED      1          1        0        0       0       0
Map 1 ..........      llap     SUCCEEDED      1          1        0        0       0       0
Map 3 ..........      llap     SUCCEEDED      5          5        0        0       1       0
Map 4 ..........      llap     SUCCEEDED    675        675        0        0     209       0
Reducer 5 ......      llap     SUCCEEDED     92         92        0        0       0     334 
----------------------------------------------------------------------------------------------
VERTICES: 05/05  [==========================>>] 100%  ELAPSED TIME: 394.27 s
----------------------------------------------------------------------------------------------
  1. normal query, no contention, no node failures
INFO  :    NUM_SUCCEEDED_TASKS: 774
INFO  :    TOTAL_LAUNCHED_TASKS: 774
INFO  :    DURATION_SUCCEEDED_TASKS_MILLIS: 15253902

@abstractdog abstractdog force-pushed the TEZ-4589 branch 3 times, most recently from 2f0af7c to 411e095 Compare November 21, 2024 09:07
@apache apache deleted a comment from tez-yetus Nov 21, 2024
@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@abstractdog abstractdog changed the title TEZ-4589: Counter for the overall duration of failed/killed task attempts TEZ-4589: Counter for the overall duration of succeeded/failed/killed task attempts Nov 21, 2024
@abstractdog abstractdog requested a review from ayushtkn November 21, 2024 15:24
@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 9s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+0 🆗 mvndep 6m 3s Maven dependency ordering for branch
+1 💚 mvninstall 11m 6s master passed
+1 💚 compile 0m 48s master passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 compile 0m 47s master passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 checkstyle 1m 10s master passed
+1 💚 javadoc 0m 56s master passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javadoc 0m 43s master passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+0 🆗 spotbugs 0m 45s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 2m 0s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 7s Maven dependency ordering for patch
+1 💚 mvninstall 0m 32s the patch passed
+1 💚 compile 0m 31s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javac 0m 31s the patch passed
+1 💚 compile 0m 28s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 javac 0m 28s the patch passed
+1 💚 checkstyle 0m 18s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 19s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javadoc 0m 24s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 findbugs 1m 23s the patch passed
_ Other Tests _
+1 💚 unit 1m 57s tez-api in the patch passed.
+1 💚 unit 4m 14s tez-dag in the patch passed.
+1 💚 asflicense 0m 21s The patch does not generate ASF License warnings.
35m 14s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-382/9/artifact/out/Dockerfile
GITHUB PR #382
JIRA Issue TEZ-4589
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 2b0776be3846 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 53309ea
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-382/9/testReport/
Max. process+thread count 645 (vs. ulimit of 5500)
modules C: tez-api tez-dag U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-382/9/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@abstractdog abstractdog merged commit b5bf8dc into apache:master Nov 23, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants