Skip to content
This repository has been archived by the owner on Oct 23, 2020. It is now read-only.

Fix thread race condition corrupting output of mpas native timer #1462

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

Larofeticus
Copy link
Contributor

This is about the second issue mentioned by @mark-petersen in #1289

I noticed around 6 months ago that when running with 2+ threads, certain timer phases in the timer report have obviously incorrect results. Max/total/Average times orders of magnitudes larger than the actual run time, or often overflowing the format and just ***********'s for output. I took a quick look, didn't see an obvious problem, and moved on because the timers are just a convenience.

I returned to this problem recently while I wait for my more important PR's to worm through the sausage factory.

It's not a problem with the reduction; the corruption happens when a very large time value is added, and only the first call to start/stop('noun') has the corrupt value. The master thread is always correct, other threads may have the problem. Running 30to10 with 8k ranks, ranks will randomly have the corruption with a probability of ~0.02.

It's fixed by adding a thread barrier before the start timer master-only code block that looks for timer tree nodes and creates new ones if not found. I can't explicitly explain why. It's related to the master thread falling behind the others such as two timer calls being separated by only a block of master only code, which explains why only certain report lines can have the problem.

Non-master threads call for wtime for start_time(:) and somehow the master thread setup_timer conditional then overwrites it to zero. Because times are relative to an epoc, 1.5 billion - 0 = 1.5 billion and that's where the absurd value comes from. I was able to confirm that the setup_timer(noun) correctly triggers once for each noun.

This problem itself is still only convenience, but it suggests there could be similar threading problems lurking in other systems that are just as difficult to analyze but without obvious ******* warning flags. A cell field incorrectly being initilized to 0 when it should be 0.048 at the first time step is not going to be obvious when looking at the outcome.

Another concern is, much as surrounding code impacts a race in the timers, the thread barriers inside timer calls could be necessary for correct behavior in surrounding code. Running with timers turned off could expose new race conditions.

This is only a bug fix. It will perform a tiny bit slower because there's one more thread barrier per start_timer call.

@mark-petersen
Copy link
Contributor

@Larofeticus Thanks! It is really great to nab these pesky problems, and this one has been long-standing. You have two commits, but only need the second one. Please do this on your local branch:

git fetch mpas
git reset --hard mpas/develop
git cherry-pick ca205c9
git push Larofeticus timer_race_fix --force

here mpas is the alias for the MPAS-Dev remote, and Larofeticus is the alias for your remote fork. You probably have different alias names.

@Larofeticus
Copy link
Contributor Author

I did something...? Maybe that was it?

I've entered a weird new stage of Git ignorance where: I know the nouns, I know what my state is now, and I know exactly what I want to change it too...

but i have no idea what sort of black magic commands actually do exactly and only those changes, so I end up with something that's close but also smeared side effects all over

@mark-petersen
Copy link
Contributor

I did a reset and cherry-pick to clean up the commit, as described above at #1462 (comment)

@mark-petersen
Copy link
Contributor

@Larofeticus could you verify that this still occurs on edison? I just tested on cori-knl and cori-haswell with two threads, and the timer output all appears OK. You can test the head of ocean/develop with commit fe6f219 with this file:

/global/cscratch1/sd/mpeterse/runs/tar_files/c71j_171212_QU240_performance_MPASO_hash_fe6f219e8.tar
then:
cd c71j/ocean/global_ocean/QU240/performance_test/forward/

link your executable, and run.

@Larofeticus
Copy link
Contributor Author

I just ran a 2 thread job on edison with fe6f219 using the input from that tar and the corrupted timer data is present.

@mark-petersen
Copy link
Contributor

Perfect. Can you add this thread barrier on that last test and retest, just to be pedantic?

@Larofeticus
Copy link
Contributor Author

Added barrier, problem not present.

Don't forget this was a randomized race condition thing and a single run doesn't guarantee correctness, but yeah I agree it's probably fine considering all the tests from other platforms too

@mark-petersen
Copy link
Contributor

I did some before/after tests on two threads on my local cluster and did not see any effect on overall time. @Larofeticus how careful do you think we need to be about the possibility of this slowing down the code?

There are essentially timers in every subroutine, but they are required to be outside loops over cells or edges, so probably no harm done. Could you test for slowdown with this EC60to30 file on cori-knl, using multiple threads and multiple nodes?

/global/cscratch1/sd/mpeterse/runs/tar_files/c71k_171212_EC60to30_MPASO_hash_fe6f219e8.tar

Thanks. I'm just hesitant to put in an explicit thread barrier that is called in every subroutine without performance testing.

@Larofeticus
Copy link
Contributor Author

the timers need this barrier to be correct, though. logically, if these extra calls were a performance priority then there would be a mechanism to disable them already.

This is probably not measurable above noise without well beyond absurd scaling, like 20 cells per rank or something like that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants