-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMP NUMA first touch replay does not exactly reproduce the original behavior #190
Comments
Thanks for the report ! Do you know why the OpenMP region that touches pages is not always exact ? I'm not sure to see why |
I think that the issue is how/when/over which pages we currently call strncpy to perform the first touch. Using libnuma just provides more information (where the page is) and can actually move a page even if it was already touched before. So, I agree with you: if a page was not touched before, there is no difference between using strncpy and libnuma. I tested libnuma by mapping the pages on the second codelet iteration replay and got the same execution time as when we had unset CERE_FIRST_TOUCH. In the current replay version, a thread touches its own pages with the call:
Where (char *)(address + read_bytes) is the address of the page that the function call touches. In the first place, I through that this call is wrong and the function should be called instead as: To summarize, I think that fixing this issue only with strncpy is the best solution but it requires to see why pages were touched differently. On the other side, libnuma allows to faithfully assign pages to cores but introduces both an overhead (due to page migration) and a library dependency. So libnuma is the quick fix. |
Thanks for the feedback :-) I agree with your analysis, if you want to contribute a PR you are more than welcome ! |
Changed the way pages are mapped. Instead of calling a parallel region, we use move_pages Also replaced strncpy by memcpy. strncpy was not a good choice because we are moving binary data. Data was not properly initialized during the first codelet iteration replay with NUMA. Test the updated replay with BT zsolve. We consider a 4 NUMA nodes system. Time in cycles. Execution time with application: original: 266238260 map all the pages of the region within the original run to NODE 0: 681839968 Execution time with codelets: current commit - reproduce original: 249310636 current commit - reproduce all page in node 0: 660664332 older commit - reproduce original: 201484164 older commit - reproduce all page in node 0: 202323544 use a single thread to load the data: 661585616 The replay now reproduces the NUMA behavior across the two extreme configurations. There are two points to fix/discuss: 1) missing pages during capture 2) get the number of threads used during capture at replay
Here is an update on this bug. Currently, CERE NUMA behavior has two issues:
Missing Pages To address this issue, simply do a FT dump at the end of the capture process (or capture invocation 2 as long as the sames pages are touched across different invocations). Wrong first touch page information The capture performs two full memory locks, one at the start of the application and a second right before the parallel region. The first lock helps to identify when a page is accessed for the first time by a thread while the second is used for the dumping process. However, heap allocated pages are not yet allocated at the start of the application: the first lock misses them. Therefore, there can be undetected accesses to these pages before the parallel region. To address this issue, CERE must override all the memory allocating functions to lock the heap allocated pages. In particular, mtrace (in tracee.c) must be activated. Not all pages allocated by malloc should be locked: use /proc/pid/maps to identify which address ranges must be avoided. A single call to /proc/pid/maps at the beginning of the application can detect these ranges. Then, tracer_lock_range must not be called with pages within these ranges. |
Pages are allocated in NUMA systems with the lazy first touch policy: a page is mapped to the NUMA domain of the thread which first touches it. To ensure faithful codelets replay over NUMA systems, CERE must map the pages as they were in the original run.
At replay, CERE uses an OpenMP region to touch previously recorded pages with strncpy. While this method is more faithful to the original run than just touching all the pages from a serial region of code, it does not always faithfully reproduce the original mapping.
We did the following test to show that the current NUMA mapping is not correct. We focus on the parallel region rhs from SP OMP over 4 NUMA nodes. We consider 2 versions. First, we use a first touch file where all the pages are touched by the same master thread. Second, we unset CERE_FIRST_TOUCH to touch all the pages within a serial region. These two versions should have the same performance. Yet, the first is 25% faster.
A solution to address this issue is to use libnuma. In particular, the function "numa_move_pages" moves a page to a specific NUMA domain. This function can also be used to check the actual allocation of a page.
The text was updated successfully, but these errors were encountered: