Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix segfault in ~Region #3108

Merged
merged 3 commits into from
Jan 5, 2024
Merged

Conversation

njzjz
Copy link
Member

@njzjz njzjz commented Jan 5, 2024

When debuging #3103, I notice a segfault in ~Region, preventing the actual error message thrown. _norm_copy_coord_gpu replaces boxt and rec_boxt of Region with GPU pointers, runs deepmd::copy_coord_gpu, and finnally recover the original pointer that can be deleted. However, deepmd::copy_coord_gpu might throw CUDA errors for any reason, and the pointers are not recovered. ~Region tries to delete a pointer that it doesn't own, causing the segfault. The CUDA error message is not visible due to segfault. The segfault in #2895 may be also caused by it.
This PR adds a new constructor to Region to accept the external pointers. ~Region will delete boxt andrec_boxt only when the pointer is not external.
We still need to figure out the reason for the error of copy_coord_gpu behind the segfault.

When debuging deepmodeling#3103, I notice a segfault in `~Region`, preventing the actual error message thrown. `_norm_copy_coord_gpu` replaces `boxt` and `rec_boxt` of `Region` with GPU pointers, runs `deepmd::copy_coord_gpu`, and finnally recover the original pointer that can be deleted. However, `deepmd::copy_coord_gpu` might throw CUDA errors for any reason, and the pointers are not recovered. `~Region` tries to delete a pointer that it doesn't own, causing the segfault. The CUDA error message is not visible due to segfault. The segfault in deepmodeling#2895 may be also caused by it.
This PR adds a new constructor to `Region` to accept the external pointers. `~Region` will delete `boxt` and`rec_boxt` only when the pointer is not external.
We still need to figure out the reason for the error of `_norm_copy_coord_gpu` behind the segfault.

Signed-off-by: Jinzhe Zeng <[email protected]>
@njzjz njzjz added the bug label Jan 5, 2024
@njzjz njzjz requested a review from wanghan-iapcm January 5, 2024 02:56
@njzjz njzjz added the Test CUDA Trigger test CUDA workflow label Jan 5, 2024
@github-actions github-actions bot added Core OP and removed Test CUDA Trigger test CUDA workflow labels Jan 5, 2024
Signed-off-by: Jinzhe Zeng <[email protected]>
Copy link

codecov bot commented Jan 5, 2024

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (a905817) 75.86% compared to head (2296573) 75.85%.
Report is 1 commits behind head on devel.

Files Patch % Lines
source/lib/src/region.cc 37.50% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #3108      +/-   ##
==========================================
- Coverage   75.86%   75.85%   -0.02%     
==========================================
  Files         247      247              
  Lines       25075    25081       +6     
  Branches     1597     1598       +1     
==========================================
+ Hits        19023    19024       +1     
- Misses       5114     5118       +4     
- Partials      938      939       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@njzjz njzjz added the Test CUDA Trigger test CUDA workflow label Jan 5, 2024
@github-actions github-actions bot removed the Test CUDA Trigger test CUDA workflow label Jan 5, 2024
Signed-off-by: Jinzhe Zeng <[email protected]>
@github-actions github-actions bot added the Python label Jan 5, 2024
@njzjz njzjz added the Test CUDA Trigger test CUDA workflow label Jan 5, 2024
@github-actions github-actions bot removed the Test CUDA Trigger test CUDA workflow label Jan 5, 2024
@wanghan-iapcm wanghan-iapcm merged commit 61ee4f2 into deepmodeling:devel Jan 5, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants