Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on test reproduction quirks for the test team #4548

Closed
fjeremic opened this issue Feb 1, 2019 · 12 comments
Closed

Feedback on test reproduction quirks for the test team #4548

fjeremic opened this issue Feb 1, 2019 · 12 comments

Comments

@fjeremic
Copy link
Contributor

fjeremic commented Feb 1, 2019

While attempting to launch Grinders and reproduce #4526 locally I kept notes of some of the quirks I encountered or issues I stumbled upon. Some of these have been detailed in the various documentation, others are unspecified. I hope this feedback can be used to improve our documentation and/or processes for debugging:


Pain Points:

  • It is not clear from a failed test what JVM command line options were used
  • For a particular test failure it is non-obvious to which test bucket the test belongs to
    • Is it functional? Is it systemtest? Is it some Adopt test?
    • Other than going to the ~4 different repos and searching for the test name is there a better way to know?
    • Need to know this so one can fill in the BUILD_LIST in the Grinder
  • From a test failure looking at the java -version output it is not clear where to download the JDK
    • For example I know the SHA numbers and the build date and number too "20190130_208 (JIT enabled, AOT enabled)" but where do I download this build?
    • Can see the curl command from the Grinder so I can find the JDK from there but why shouldn't we be able to extract the build ID from java -version somehow? If someone just pasted me the java -version output I would have no idea how to grab that same build, and that is a problem.
  • STF tests don't show crash information in the console output
    • Have to dig through the artifacts to find it always which is time consuming
  • Why do we need to export JDK_VERSION when running tests? Should we be able to determine that from $JAVA_BIN/java -version output?
  • When attempting to reproduce [1] locally the make compile command from instructions from [2] seems to compile all tests (sometimes comping ~6000 java source files) however a Grinder launched for the same test seems only compile and run the one specific test [3]. Why is that? How can I locally do thet same thing as the Grinder? i.e. I only want to compile and run my one test I care about.

[1] #4526
[2] https://github.com/eclipse/openj9/wiki/Reproducing-Test-Failures-Locally#run-sanity-system-tests-on-jdk10_x86-64_linux_openj9-sdk
[3] https://hyc-runtimes-jenkins.swg-devops.com/view/Test_grinder/job/Grinder/1363/consoleText


Issues:

  • https://github.com/eclipse/openj9/wiki/Reproducing-Test-Failures-Locally#general-steps
    • Cannot run ./get.sh without setting JAVA_BIN first which seems to be step 6 so the instructions need to get updated
      • System tests also seem to export JAVA_HOME to be ../../ from JAVA_BIN, so it means JAVA_BIN has to be the "jre/bin" directory, bot the "bin" directory. This is non-obvious.
    • Last step make _sanity.system does not work due to class loader net.adoptopenjdk.stf.runner.StfClassLoader exception being thrown. Exporting both JAVA_BIN and JAVA_HOME does not seem to work on s390 Linux as implied by other people.
GEN stderr Exception in thread "main" java/lang/Error: java.lang.ClassNotFoundException: net.adoptopenjdk.stf.runner.StfClassLoader
GEN stderr      at java/lang/ClassLoader.getSystemClassLoader (ClassLoader.java:781)
GEN stderr      at java/lang/Thread.completeInitialization (Thread.java:166)
GEN stderr      at java/lang/J9VMInternals.completeInitialization (J9VMInternals.java:72)
Generation failed

The issue seems to be that System tests define JAVA_HOME themselves by exporting $JAVA_BIN/../../, however the instructions in [1] specify JAVA_BIN should be /someLocation/bin, which appears to be incorrect as the instructions state to download/unpack the SDK to /someLocation.

System tests seem to expect JAVA_BIN to be /someLocation/jre/bin, not /someLocation/bin. After changing this, rerunning make -f run_configure.mk and make compile things seem to work now.


General Feedback:

  • Tests seem to run in huge buckets per one Jenkins job as opposed to much smaller buckets per Jenkins job. This makes re-running a test tedious and involves a lot of manual work as opposed to VMFarm-esque clicking a "Re-run on Grinder" button and launching a reproduction batch.
  • Grinder tests are sequential which is very time consuming when it comes to reproducing issues which are intermittent (1/50 failures take several hours as opposed to a few minutes to reproduce)
    • Sometimes a failure occurs in the middle of a grinder, say the 5th job out of 50. Is there a way to "kill" the grinder and just get the data for the failure at that point without running through the other 45 iterations?
  • Test material seems scattered all over the place and it is not easy to find tests
  • Getting machine access is non-trivial (impossible?) which makes reproducing issues which only appear to happen on farm machines very difficult
@fjeremic
Copy link
Contributor Author

fjeremic commented Feb 1, 2019

@smlambert @llxia FYI. I'm more than happy to help with any of the above.

@pshipton
Copy link
Member

pshipton commented Feb 1, 2019

For a particular test failure it is non-obvious to which test bucket the test belongs to

It is in the link i.e. https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-linux_390-64_cmprssptrs/181/
extended.system indicates system testing.

but where do I download this build?

You need to look at the parent job(s) of the test failure, find the build job for the platform, and the jvm is an artifact of the job. It is only available for a short time, depending on how many other builds are run, as we have limited space.

@llxia
Copy link
Contributor

llxia commented Feb 1, 2019

I will try to clarify some of the questions.

It is not clear from a failed test what JVM command line options were used
Example: https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-linux_390-64_cmprssptrs/181/

TKG does print info at beginning of the tests with JVM_OPTIONS:

===============================================
Running test SharedClassesAPI_0 ...
===============================================
SharedClassesAPI_0 Start Time: Thu Jan 31 03:14:37 2019 Epoch Time (ms): 1548922477290
variation: NoOptions
JVM_OPTIONS:  -Xcompressedrefs 

As a result it is not clear how to add EXTRA_OPTIONS or JVM_OPTIONS to a Grinder

It is documented in:
https://github.com/AdoptOpenJDK/openjdk-tests/wiki/How-to-Run-a-Grinder-Build-on-Jenkins
https://github.com/eclipse/openj9/blob/master/test/docs/OpenJ9TestUserGuide.md

For a particular test failure it is non-obvious to which test bucket the test belongs to
Is it functional? Is it systemtest? Is it some Adopt test?
Other than going to the ~4 different repos and searching for the test name is there a better way to know?
Need to know this so one can fill in the BUILD_LIST in the Grinder

The information is in the job name. For example, Test-extended.functional-JDK11-linux_x86-64_cmprssptrs means running extended functional test using JDK11 on linux_x86-64_cmprssptrs. For system test, you should see system in the job name.

From a test failure looking at the java -version output it is not clear where to download the JDK
For example I know the SHA numbers and the build date and number too "20190130_208 (JIT enabled, AOT enabled)" but where do I download this build?

In Openj9 Jenkins, we can get parameters from test build https://ci.eclipse.org/openj9/view/Test/job/Test-extended.functional-JDK11-linux_x86-64_cmprssptrs/169/parameters/

It shows UPSTREAM_JOB_NAME and UPSTREAM_JOB_NUMBER. We should be able to find the build in Jenkins Build tab. Once we find the exact JDK build, we can Copy Link Address of the archived JDK.

Or another way to do this #3697 (comment)

Can see the curl command from the Grinder so I can find the JDK from there but why shouldn't we be able to extract the build ID from java -version somehow? If someone just pasted me the java -version output I would have no idea how to grab that same build, and that is a problem.

The grinder can take JDK from any public url (i.e., AdoptOpenJDK, Artifactory, etc). We may not have enough infromation to determine OpenJ9 build ID.

STF tests don't show crash information in the console output
Have to dig through the artifacts to find it always which is time consuming

@Mesbah-Alam we may need to update STF to handle this.

Why do we need to export JDK_VERSION when running tests? Should we be able to determine that from $JAVA_BIN/java -version output?

We are working on this #442. The idea is we do not need to provide JDK_VERSION, JDK_IMPL and SPEC. All the information can be auto-detected when JAVA_BIN is provided.

When attempting to reproduce [1] locally the make compile command from instructions from [2] seems to compile all tests (sometimes comping ~6000 java source files) however a Grinder launched for the same test seems only compile and run the one specific test [3]. Why is that? How can I locally do thet same thing as the Grinder? i.e. I only want to compile and run my one test I care about.

We can use BUILD_LIST to narrow down to the folder that we care about. This is documented in FAQ
Maybe we should add a link to FAQ in https://github.com/eclipse/openj9/wiki/Reproducing-Test-Failures-Locally
Note: this feature only works for subdirs in functional atm. Support for systemtest is on the way.

Tests seem to run in huge buckets per one Jenkins job as opposed to much smaller buckets per Jenkins job. This makes re-running a test tedious and involves a lot of manual work as opposed to VMFarm-esque clicking a "Re-run on Grinder" button and launching a reproduction batch.

Test job does not have all parameters defined in config, so rebuild maynot work. One thing that is on our to-do list is to auto generate test jobs, so that we can avoid this issue.

Grinder tests are sequential which is very time consuming when it comes to reproducing issues which are intermittent (1/50 failures take several hours as opposed to a few minutes to reproduce)
Sometimes a failure occurs in the middle of a grinder, say the 5th job out of 50. Is there a way to "kill" the grinder and just get the data for the failure at that point without running through the other 45 iterations?

Issue is created adoptium/aqa-tests#836
Once parallel is enabled, iteration 50 means start 50 separate jobs. We can kill any of them in the middle of the grinder.

Getting machine access is non-trivial (impossible?) which makes reproducing issues which only appear to happen on farm machines very difficult

Unfortunately, test team does not have control of the machines access. fyi @jdekonin

@fjeremic
Copy link
Contributor Author

fjeremic commented Feb 1, 2019

It is in the link i.e. ci.eclipse.org/openj9/job/Test-extended.system-JDK8-linux_390-64_cmprssptrs/181
extended.system indicates system testing.

Right, this is also obvious from the test name. So where do I find this test? Is "extended.system" == "systemtest"? That part is confusing, at least to me.

You need to look at the parent job(s) of the test failure, find the build job for the platform, and the jvm is an artifact of the job. It is only available for a short time, depending on how many other builds are run, as we have limited space.

Using your example:
https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-linux_390-64_cmprssptrs/181/

I navigate to "build number 850", then to "build number 383", then I seem to be at the top level for this nightly build:
https://ci.eclipse.org/openj9/job/Pipeline-Build-Test-All/383/

I fail to see how to navigate to the build artifact you describe. Can you describe the steps from here?

@fjeremic
Copy link
Contributor Author

fjeremic commented Feb 1, 2019

It is documented in:
AdoptOpenJDK/openjdk-tests/wiki/How-to-Run-a-Grinder-Build-on-Jenkins
/test/docs/OpenJ9TestUserGuide.md@master

There are quirks. For example it is non-obvious how to input the following command:

-Xjit:{java/lang/SomeClass.foo()I}(tracefull,log=foo.trace)

Through experimentation and help from others it seems you have to double quote the full command and escape the quotes, so the actual thing you have to input is:

\"-Xjit:{java/lang/SomeClass.foo()I}(tracefull,log=foo.trace)\"

In Openj9 Jenkins, we can get parameters from test build ci.eclipse.org/openj9/view/Test/job/Test-extended.functional-JDK11-linux_x86-64_cmprssptrs/169/parameters

It shows UPSTREAM_JOB_NAME and UPSTREAM_JOB_NUMBER. We should be able to find the build in Jenkins Build tab. Once we find the exact JDK build, we can Copy Link Address of the archived JDK.

Or another way to do this #3697 (comment)

Neither of these seem to work for the test failure example at hand from #4526:

Navigating to the build artifact from a test failure would be good to know, however my original question was if there is a way to navigate to the build artifact only using the java -version output which I can always find inside of a test failure console log.


We can use BUILD_LIST to narrow down to the folder that we care about. This is documented in FAQ
Maybe we should add a link to FAQ in eclipse/openj9/wiki/Reproducing-Test-Failures-Locally
Note: this feature only works for subdirs in functional atm. Support for systemtest is on the way.

Ah I see, I think I encountered the systemtest limitation here then.

Thanks for all the answers!

@llxia
Copy link
Contributor

llxia commented Feb 1, 2019

I fail to see how to navigate to the build artifact you describe. Can you describe the steps from here?

You do not need to get to build number 383. The information is in console output of build number 850

Hopefully, this comment lists the steps clearly #3697 (comment)

@llxia
Copy link
Contributor

llxia commented Feb 1, 2019

JDK build 1178 passed but not having JDK archived. I do see tar command in the console.
https://ci.eclipse.org/openj9/view/Build/job/Build-JDK8-linux_390-64_cmprssptrs/1178/console

And the next nightly build have JDK archived https://ci.eclipse.org/openj9/view/Build/job/Build-JDK8-linux_390-64_cmprssptrs/1184/

@AdamBrousseau Is there a limitaton on how long the artifacts are kept?

@fjeremic
Copy link
Contributor Author

fjeremic commented Feb 1, 2019

You do not need to get to build number 383. The information is in console output of build number 850

Hopefully, this comment lists the steps clearly #3697 (comment)

Right but there is no archive link anywhere.

JDK build 1178 passed but not having JDK archived. I do see tar comment in the console.
ci.eclipse.org/openj9/view/Build/job/Build-JDK8-linux_390-64_cmprssptrs/1178/console

Yeah it says:

23:51:16 ARTIFACTORY server is not set saving artifacts on jenkins.

Not sure why it worked on the very next build. Does that mean I can't get a hold of the exact binary JDK package used from that build? (without having to rebuild the entire JDK using the SHAs)

@AdamBrousseau
Copy link
Contributor

We only have space to keep 10 artifacts per build at the moment

@pshipton
Copy link
Member

pshipton commented Feb 1, 2019

We should be able to pass in the build number and have it appear in the -version output, fairly certain there is a configure parameter to allow this. In this part below I think we could change the +0 to be the build number.
(build 11.0.2-internal+0-adhoc.jenkins.Build-JDK11-linuxx86-64cmprssptrs)

@smlambert
Copy link
Contributor

A lot of these items have now been addressed through several major updates and features added, including and not limited to:

  • variation (from playlist) and JVM_OPTIONS used are printed at the start of each test run
    Example console output:
    15:27:52 variation: NoOptions
    15:27:52 JVM_OPTIONS: -Xcompressedrefs

  • Re-run link for easier prepopulation of Grinder parameters

  • AUTODETECT, so if you use customized/upstream SDK_RESOURCE, you no longer need to tell TKG what JDK_VERSION/JDK_IMPL it is

  • removed the "make -f run_configure.mk" step, to simplify test runs even further

  • better doc to ensure developers know utilize BUILD_LIST to control which directories to be compiled

  • new logical target called _testList (to allow a custom list of test targets to be passed to TKG, therefore to a Grinder/test job/workflow)

  • rename all directories in openjdk-tests repo to match the test group names (system == system, external == external, etc)

  • simplification of using get.sh (can now simply clone openjdk-tests, export TEST_JDK_HOME=/whereEverYouPutYourJDK and then run get.sh with no arguments)

  • centralization of test doc can be tracked via EPIC: Centralize and update test documentation adoptium/aqa-tests#1558

  • smart parallelization work can be tracked via Support for smart parallelization in TKG / TRSS adoptium/aqa-tests#1563

Most other items have been addressed in comments above. The suggested enhancement to STF output should be raised against the STF repo, though I do not believe it will get any priority (no resources to spare), and that STF output is already too verbose (would want to reduce noise, before adding new 'content' to the output stream).

Given all of that, I believe we can/should close this issue, @fjeremic ?

@fjeremic
Copy link
Contributor Author

Agreed. Many thanks to the test team who invested resource into fixing most of these issues. I certainly have observed the improvements and am very grateful for the investment in this area. Thank you!

mpirvu added a commit to mpirvu/openj9 that referenced this issue May 7, 2020
GCR (guarded counting recompilations is a mechanism that is supposed
to upgrade cold or AOT compilations to warm opt level. In theory
it should work with any opt level, but due to a bug it seems that
GCR is incompatible with hot compilations (see eclipse-openj9#4548 for details).
To allow stress testing with hot compilations this commit disables
GCR for hot compilations. This should have no bearing on behavior
in production because the natural way of upgrading hot compilations
to scorching is through sampling.

Fixes: eclipse-openj9#4445 eclipse-openj9#8064

Signed-off-by: Marius Pirvu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants