Fix thread race condition corrupting output of mpas native timer #1462

Larofeticus · 2017-12-08T17:22:40Z

This is about the second issue mentioned by @mark-petersen in #1289

I noticed around 6 months ago that when running with 2+ threads, certain timer phases in the timer report have obviously incorrect results. Max/total/Average times orders of magnitudes larger than the actual run time, or often overflowing the format and just ***********'s for output. I took a quick look, didn't see an obvious problem, and moved on because the timers are just a convenience.

I returned to this problem recently while I wait for my more important PR's to worm through the sausage factory.

It's not a problem with the reduction; the corruption happens when a very large time value is added, and only the first call to start/stop('noun') has the corrupt value. The master thread is always correct, other threads may have the problem. Running 30to10 with 8k ranks, ranks will randomly have the corruption with a probability of ~0.02.

It's fixed by adding a thread barrier before the start timer master-only code block that looks for timer tree nodes and creates new ones if not found. I can't explicitly explain why. It's related to the master thread falling behind the others such as two timer calls being separated by only a block of master only code, which explains why only certain report lines can have the problem.

Non-master threads call for wtime for start_time(:) and somehow the master thread setup_timer conditional then overwrites it to zero. Because times are relative to an epoc, 1.5 billion - 0 = 1.5 billion and that's where the absurd value comes from. I was able to confirm that the setup_timer(noun) correctly triggers once for each noun.

This problem itself is still only convenience, but it suggests there could be similar threading problems lurking in other systems that are just as difficult to analyze but without obvious ******* warning flags. A cell field incorrectly being initilized to 0 when it should be 0.048 at the first time step is not going to be obvious when looking at the outcome.

Another concern is, much as surrounding code impacts a race in the timers, the thread barriers inside timer calls could be necessary for correct behavior in surrounding code. Running with timers turned off could expose new race conditions.

This is only a bug fix. It will perform a tiny bit slower because there's one more thread barrier per start_timer call.

mark-petersen · 2017-12-08T17:38:55Z

@Larofeticus Thanks! It is really great to nab these pesky problems, and this one has been long-standing. You have two commits, but only need the second one. Please do this on your local branch:

git fetch mpas
git reset --hard mpas/develop
git cherry-pick ca205c9
git push Larofeticus timer_race_fix --force

here mpas is the alias for the MPAS-Dev remote, and Larofeticus is the alias for your remote fork. You probably have different alias names.

Larofeticus · 2017-12-08T17:54:51Z

I did something...? Maybe that was it?

I've entered a weird new stage of Git ignorance where: I know the nouns, I know what my state is now, and I know exactly what I want to change it too...

but i have no idea what sort of black magic commands actually do exactly and only those changes, so I end up with something that's close but also smeared side effects all over

mark-petersen · 2017-12-12T23:11:40Z

I did a reset and cherry-pick to clean up the commit, as described above at #1462 (comment)

mark-petersen · 2017-12-20T17:22:59Z

@Larofeticus could you verify that this still occurs on edison? I just tested on cori-knl and cori-haswell with two threads, and the timer output all appears OK. You can test the head of ocean/develop with commit fe6f219 with this file:

/global/cscratch1/sd/mpeterse/runs/tar_files/c71j_171212_QU240_performance_MPASO_hash_fe6f219e8.tar
then:
cd c71j/ocean/global_ocean/QU240/performance_test/forward/

link your executable, and run.

Larofeticus · 2017-12-21T18:32:27Z

I just ran a 2 thread job on edison with fe6f219 using the input from that tar and the corrupted timer data is present.

mark-petersen · 2017-12-21T18:37:02Z

Perfect. Can you add this thread barrier on that last test and retest, just to be pedantic?

Larofeticus · 2017-12-21T19:16:54Z

Added barrier, problem not present.

Don't forget this was a randomized race condition thing and a single run doesn't guarantee correctness, but yeah I agree it's probably fine considering all the tests from other platforms too

mark-petersen · 2017-12-21T19:28:43Z

I did some before/after tests on two threads on my local cluster and did not see any effect on overall time. @Larofeticus how careful do you think we need to be about the possibility of this slowing down the code?

There are essentially timers in every subroutine, but they are required to be outside loops over cells or edges, so probably no harm done. Could you test for slowdown with this EC60to30 file on cori-knl, using multiple threads and multiple nodes?

/global/cscratch1/sd/mpeterse/runs/tar_files/c71k_171212_EC60to30_MPASO_hash_fe6f219e8.tar

Thanks. I'm just hesitant to put in an explicit thread barrier that is called in every subroutine without performance testing.

Larofeticus · 2017-12-21T19:35:55Z

the timers need this barrier to be correct, though. logically, if these extra calls were a performance priority then there would be a mechanism to disable them already.

This is probably not measurable above noise without well beyond absurd scaling, like 20 cells per rank or something like that.

Larofeticus force-pushed the timer_race_fix branch from ca205c9 to f078f97 Compare December 8, 2017 17:50

mgduda added bug Framework labels Dec 11, 2017

mark-petersen requested review from mgduda and mark-petersen December 11, 2017 18:03

mark-petersen force-pushed the timer_race_fix branch from f078f97 to 3c62bcb Compare December 12, 2017 23:03

fix race condition in mpas native timer

2d42e64

mark-petersen force-pushed the timer_race_fix branch from 3c62bcb to 2d42e64 Compare December 14, 2017 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix thread race condition corrupting output of mpas native timer #1462

Fix thread race condition corrupting output of mpas native timer #1462

Larofeticus commented Dec 8, 2017

mark-petersen commented Dec 8, 2017

Larofeticus commented Dec 8, 2017

mark-petersen commented Dec 12, 2017

mark-petersen commented Dec 20, 2017

Larofeticus commented Dec 21, 2017

mark-petersen commented Dec 21, 2017

Larofeticus commented Dec 21, 2017

mark-petersen commented Dec 21, 2017

Larofeticus commented Dec 21, 2017

Fix thread race condition corrupting output of mpas native timer #1462

Are you sure you want to change the base?

Fix thread race condition corrupting output of mpas native timer #1462

Conversation

Larofeticus commented Dec 8, 2017

mark-petersen commented Dec 8, 2017

Larofeticus commented Dec 8, 2017

mark-petersen commented Dec 12, 2017

mark-petersen commented Dec 20, 2017

Larofeticus commented Dec 21, 2017

mark-petersen commented Dec 21, 2017

Larofeticus commented Dec 21, 2017

mark-petersen commented Dec 21, 2017

Larofeticus commented Dec 21, 2017