Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMP NUMA first touch replay does not exactly reproduce the original behavior #190

Open
mihailpopov opened this issue Feb 13, 2018 · 4 comments
Assignees
Labels

Comments

@mihailpopov
Copy link
Member

mihailpopov commented Feb 13, 2018

Pages are allocated in NUMA systems with the lazy first touch policy: a page is mapped to the NUMA domain of the thread which first touches it. To ensure faithful codelets replay over NUMA systems, CERE must map the pages as they were in the original run.

At replay, CERE uses an OpenMP region to touch previously recorded pages with strncpy. While this method is more faithful to the original run than just touching all the pages from a serial region of code, it does not always faithfully reproduce the original mapping.

We did the following test to show that the current NUMA mapping is not correct. We focus on the parallel region rhs from SP OMP over 4 NUMA nodes. We consider 2 versions. First, we use a first touch file where all the pages are touched by the same master thread. Second, we unset CERE_FIRST_TOUCH to touch all the pages within a serial region. These two versions should have the same performance. Yet, the first is 25% faster.

A solution to address this issue is to use libnuma. In particular, the function "numa_move_pages" moves a page to a specific NUMA domain. This function can also be used to check the actual allocation of a page.

@mihailpopov mihailpopov self-assigned this Feb 13, 2018
@pablooliveira
Copy link
Member

Thanks for the report ! Do you know why the OpenMP region that touches pages is not always exact ?

I'm not sure to see why numa_move_pages is more accurate ?

@mihailpopov
Copy link
Member Author

I think that the issue is how/when/over which pages we currently call strncpy to perform the first touch.

Using libnuma just provides more information (where the page is) and can actually move a page even if it was already touched before. So, I agree with you: if a page was not touched before, there is no difference between using strncpy and libnuma. I tested libnuma by mapping the pages on the second codelet iteration replay and got the same execution time as when we had unset CERE_FIRST_TOUCH.

In the current replay version, a thread touches its own pages with the call:

//strncpy(dest,src,len);
strncpy(buff,(char *)(address + read_bytes),PAGESIZE);

Where (char *)(address + read_bytes) is the address of the page that the function call touches.

In the first place, I through that this call is wrong and the function should be called instead as:
strncpy((char *)(address + read_bytes),buff,PAGESIZE);
since buff already contents the data from memdump. However, i did a quick checksum program and tested the two versions: the current replay returned the correct value but not the new one.

To summarize, I think that fixing this issue only with strncpy is the best solution but it requires to see why pages were touched differently. On the other side, libnuma allows to faithfully assign pages to cores but introduces both an overhead (due to page migration) and a library dependency. So libnuma is the quick fix.

@pablooliveira
Copy link
Member

Thanks for the feedback :-) I agree with your analysis, if you want to contribute a PR you are more than welcome !

mihailpopov added a commit that referenced this issue May 9, 2018
Changed the way pages are mapped.
Instead of calling a parallel region, we use move_pages

Also replaced strncpy by memcpy.
strncpy was not a good choice because we are moving binary data.
Data was not properly initialized during the first codelet iteration replay with NUMA.

Test the updated replay with BT zsolve.
We consider a 4 NUMA nodes system.
Time in cycles.

Execution time with application:
original: 266238260
map all the pages of the region within the original run to NODE 0: 681839968

Execution time with codelets:
current commit - reproduce original: 249310636
current commit - reproduce all page in node 0: 660664332

older commit - reproduce original: 201484164
older commit - reproduce all page in node 0: 202323544

use a single thread to load the data: 661585616

The replay now reproduces the NUMA behavior across the two extreme configurations.

There are two points to fix/discuss:
1) missing pages during capture
2) get the number of threads used during capture at replay
@mihailpopov
Copy link
Member Author

Here is an update on this bug.

Currently, CERE NUMA behavior has two issues:

  1. Some pages are missing in the First Touch (FT) file
  2. The thread reported by CERE FT is wrong for heap allocated pages

Missing Pages
First touch pages are currently dumped right before the parallel region that we capture.
So, if a page is touched for the first time in the parallel region by thread 1, CERE will not consider it in its FT warmup process. This is an issue: CERE touches all the pages at replay. Therefore CERE will remap it to thread 0 instead of to thread 1.

To address this issue, simply do a FT dump at the end of the capture process (or capture invocation 2 as long as the sames pages are touched across different invocations).

Wrong first touch page information
Data are allocated in different ways: malloc, stack, fixed addresses in binary (static)...
For heap allocated pages, the FT thread reported by CERE is wrong.
Here are the details:

The capture performs two full memory locks, one at the start of the application and a second right before the parallel region. The first lock helps to identify when a page is accessed for the first time by a thread while the second is used for the dumping process. However, heap allocated pages are not yet allocated at the start of the application: the first lock misses them. Therefore, there can be undetected accesses to these pages before the parallel region.

To address this issue, CERE must override all the memory allocating functions to lock the heap allocated pages. In particular, mtrace (in tracee.c) must be activated. Not all pages allocated by malloc should be locked: use /proc/pid/maps to identify which address ranges must be avoided. A single call to /proc/pid/maps at the beginning of the application can detect these ranges. Then, tracer_lock_range must not be called with pages within these ranges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants