OpenMP NUMA first touch replay does not exactly reproduce the original behavior #190

mihailpopov · 2018-02-13T13:37:34Z

Pages are allocated in NUMA systems with the lazy first touch policy: a page is mapped to the NUMA domain of the thread which first touches it. To ensure faithful codelets replay over NUMA systems, CERE must map the pages as they were in the original run.

At replay, CERE uses an OpenMP region to touch previously recorded pages with strncpy. While this method is more faithful to the original run than just touching all the pages from a serial region of code, it does not always faithfully reproduce the original mapping.

We did the following test to show that the current NUMA mapping is not correct. We focus on the parallel region rhs from SP OMP over 4 NUMA nodes. We consider 2 versions. First, we use a first touch file where all the pages are touched by the same master thread. Second, we unset CERE_FIRST_TOUCH to touch all the pages within a serial region. These two versions should have the same performance. Yet, the first is 25% faster.

A solution to address this issue is to use libnuma. In particular, the function "numa_move_pages" moves a page to a specific NUMA domain. This function can also be used to check the actual allocation of a page.

pablooliveira · 2018-02-22T10:10:26Z

Thanks for the report ! Do you know why the OpenMP region that touches pages is not always exact ?

I'm not sure to see why numa_move_pages is more accurate ?

mihailpopov · 2018-02-26T12:52:51Z

I think that the issue is how/when/over which pages we currently call strncpy to perform the first touch.

Using libnuma just provides more information (where the page is) and can actually move a page even if it was already touched before. So, I agree with you: if a page was not touched before, there is no difference between using strncpy and libnuma. I tested libnuma by mapping the pages on the second codelet iteration replay and got the same execution time as when we had unset CERE_FIRST_TOUCH.

In the current replay version, a thread touches its own pages with the call:

//strncpy(dest,src,len);
strncpy(buff,(char *)(address + read_bytes),PAGESIZE);

Where (char *)(address + read_bytes) is the address of the page that the function call touches.

In the first place, I through that this call is wrong and the function should be called instead as:
strncpy((char *)(address + read_bytes),buff,PAGESIZE);
since buff already contents the data from memdump. However, i did a quick checksum program and tested the two versions: the current replay returned the correct value but not the new one.

To summarize, I think that fixing this issue only with strncpy is the best solution but it requires to see why pages were touched differently. On the other side, libnuma allows to faithfully assign pages to cores but introduces both an overhead (due to page migration) and a library dependency. So libnuma is the quick fix.

pablooliveira · 2018-04-06T15:01:17Z

Thanks for the feedback :-) I agree with your analysis, if you want to contribute a PR you are more than welcome !

Changed the way pages are mapped. Instead of calling a parallel region, we use move_pages Also replaced strncpy by memcpy. strncpy was not a good choice because we are moving binary data. Data was not properly initialized during the first codelet iteration replay with NUMA. Test the updated replay with BT zsolve. We consider a 4 NUMA nodes system. Time in cycles. Execution time with application: original: 266238260 map all the pages of the region within the original run to NODE 0: 681839968 Execution time with codelets: current commit - reproduce original: 249310636 current commit - reproduce all page in node 0: 660664332 older commit - reproduce original: 201484164 older commit - reproduce all page in node 0: 202323544 use a single thread to load the data: 661585616 The replay now reproduces the NUMA behavior across the two extreme configurations. There are two points to fix/discuss: 1) missing pages during capture 2) get the number of threads used during capture at replay

mihailpopov · 2018-06-06T16:16:39Z

Here is an update on this bug.

Currently, CERE NUMA behavior has two issues:

Some pages are missing in the First Touch (FT) file
The thread reported by CERE FT is wrong for heap allocated pages

Missing Pages
First touch pages are currently dumped right before the parallel region that we capture.
So, if a page is touched for the first time in the parallel region by thread 1, CERE will not consider it in its FT warmup process. This is an issue: CERE touches all the pages at replay. Therefore CERE will remap it to thread 0 instead of to thread 1.

To address this issue, simply do a FT dump at the end of the capture process (or capture invocation 2 as long as the sames pages are touched across different invocations).

Wrong first touch page information
Data are allocated in different ways: malloc, stack, fixed addresses in binary (static)...
For heap allocated pages, the FT thread reported by CERE is wrong.
Here are the details:

The capture performs two full memory locks, one at the start of the application and a second right before the parallel region. The first lock helps to identify when a page is accessed for the first time by a thread while the second is used for the dumping process. However, heap allocated pages are not yet allocated at the start of the application: the first lock misses them. Therefore, there can be undetected accesses to these pages before the parallel region.

To address this issue, CERE must override all the memory allocating functions to lock the heap allocated pages. In particular, mtrace (in tracee.c) must be activated. Not all pages allocated by malloc should be locked: use /proc/pid/maps to identify which address ranges must be avoided. A single call to /proc/pid/maps at the beginning of the application can detect these ranges. Then, tracer_lock_range must not be called with pages within these ranges.

mihailpopov added the bug label Feb 13, 2018

mihailpopov self-assigned this Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMP NUMA first touch replay does not exactly reproduce the original behavior #190

OpenMP NUMA first touch replay does not exactly reproduce the original behavior #190

mihailpopov commented Feb 13, 2018 •

edited

Loading

pablooliveira commented Feb 22, 2018

mihailpopov commented Feb 26, 2018

pablooliveira commented Apr 6, 2018

mihailpopov commented Jun 6, 2018

OpenMP NUMA first touch replay does not exactly reproduce the original behavior #190

OpenMP NUMA first touch replay does not exactly reproduce the original behavior #190

Comments

mihailpopov commented Feb 13, 2018 • edited Loading

pablooliveira commented Feb 22, 2018

mihailpopov commented Feb 26, 2018

pablooliveira commented Apr 6, 2018

mihailpopov commented Jun 6, 2018

mihailpopov commented Feb 13, 2018 •

edited

Loading