Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test failures with mpich 4+. #789

Conversation

vehre
Copy link
Collaborator

@vehre vehre commented Oct 9, 2024

coverage on master
Codecov branch

Summary of changes

Fix possible bug in mpich 4+ that needs communicators to be cleared from messages before free.

Rationale for changes

Using mpich 4.1.2 several tests failed in the finalization phase where mpich reported, that messages were dangling on a communicator to free. This patch just reads all messages before freeing the communicator, which fixes the issue.

A second issue was that the openmpi flag was not propagated into the top-level CMakeLists.txt using set(... PARENT_SCOPE) which lead to test errors, because the generated hostsfile was not referenced and also other openmpi specific flags were not set. The flag is now a global property, that is introduced in the top-level CMakeLists.txt and this fixes the issue.

Tested with mpich 4.1.2, openmpi 4.1.5 and intel mpi 2021.13

Fix for issue #781, #783, #769, #761

Additional info and certifications

This pull request (PR) is a:

  • Bug fix
  • Feature addition
  • Other, Please describe:

I certify that

  • I certify that:
    • I have reviewed and followed the contributing guidelines
    • I will wait at least 24 hours before self-approving the PR to give another
      OpenCoarrays developer a chance to review my proposed code
    • I have not introduced errant white space (no trailing white space or white space errors may
      be introduced)
    • I have added an explanation of what these changes do and why they should be included
    • I have checked to ensure there aren't other open Pull Requests for the same change
    • I have you written new tests for these changes
    • I have successfully tested these changes locally
    • I have commented any non-trivial, non-obvious code changes
    • The commits are logically atomic, self consistent and coherent
    • The commit messages follow best practices
    • Test coverage is maintained or increased after this is merged

Code coverage data

coverage on master

Mpich from 4.0 on seems to bug when there is a message dangling on
freeing the communicator.
Using just a variable to indicate openmpi did not work reliably with
all generators, i.e. at least with ninja the variable was not set
in the top-level CMakeLists.txt and therefore additional options
were not set leading to test failures.
Copy link
Collaborator

@zbeekman zbeekman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @vehre

@rouson rouson merged commit 6636fc5 into vehre/issue-759-fix-parallel-build-issues Oct 11, 2024
4 checks passed
@rouson rouson deleted the vehre/issue-781-clear-communicator-4-mpich branch October 11, 2024 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants