coarray "get": I got a SIGSEGV, but I do not understand why... #6

szaghi · 2016-11-30T17:44:17Z

Dear All,

I am sorry for bothering you...

In the just-uploaded feature/add-coarray-buckets branch I obtain a SIGSEGV that I am not able to debug... any hints are much more than welcome. In the following there is a full report.

The test

The test is minimal

program hasty_test_caf_get_clone
use, intrinsic :: iso_fortran_env, only : int32, int64, error_unit
use hasty

type(hash_table)      :: a_table       !< A table.
class(*), allocatable :: a_new_content !< A content.

call a_table%initialize(buckets_number=4, use_prime=.true.)

#ifdef CAF

call a_table%add_clone(key=3_int32, content=int(this_image(), int32))

critical
call a_table%get_clone(key=3_int32, content=a_new_content)
end critical
sync all

#endif
endprogram hasty_test_caf_get_clone

The get_clone method is

  subroutine get_clone(self, key, content)
  class(hash_table),     intent(in)  :: self    !< The hash table.
  class(*),              intent(in)  :: key     !< The key.
  class(*), allocatable, intent(out) :: content !< Content of the queried node.
  integer(I4P)                       :: b       !< Bucket index.
  integer(I4P)                       :: i       !< Image index.
  
  if (self%is_initialized_) then
    call self%get_bucket_image_indexes(key=key, bucket=b, image=i)
    if (b>0) then
#ifdef CAF
      call dictionary_get_clone(self%bucket(b)[i], key=key, content=content)
#else
      call self%bucket(b)%get_clone(key=key, content=content)
#endif
    endif
  endif
  endsubroutine get_clone

The statement call dictionary_get_clone(self%bucket(b)[i], key=key, content=content) is where all evil starts.

Results using OpenCoarrays/GNU gfortran

The call to get_clone raises a SIGSEGV if the number of images is greater than 1

stefano@zaghi(06:24 PM Wed Nov 30) on feature/add-coarray-buckets
~/fortran/HASTY 14 files, 356Kb
→ cafrun -np 2 ./exe/hasty_test_caf_get_clone

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f6966ba10af in ???
#1  0x40eae9 in __hasty_dictionary_node_MOD_has_key
        at src/lib/hasty_dictionary_node.f90:97
#2  0x41095f in key_iterator_search
        at src/lib/hasty_dictionary.f90:328
#3  0x40fa7e in __hasty_dictionary_MOD_traverse_iterator
        at src/lib/hasty_dictionary.f90:521
#4  0x40fe5f in __hasty_dictionary_MOD_node
        at src/lib/hasty_dictionary.f90:315
#5  0x410772 in __hasty_dictionary_MOD_get_pointer
        at src/lib/hasty_dictionary.f90:223
#6  0x40fc98 in __hasty_dictionary_MOD_get_clone
        at src/lib/hasty_dictionary.f90:206
#7  0x41362f in __hasty_hash_table_MOD_get_clone
        at src/lib/hasty_hash_table.f90:236
#8  0x414ca7 in hasty_test_caf_get_clone
        at src/tests/hasty_test_caf_get_clone.F90:23
#9  0x414d4d in main
        at src/tests/hasty_test_caf_get_clone.F90:7

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 9044 RUNNING AT zaghi
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Valgrind inspection

stefano@zaghi(06:24 PM Wed Nov 30) on feature/add-coarray-buckets
~/fortran/HASTY 14 files, 356Kb
→ valgrind --leak-check=yes cafrun -np 2 ./exe/hasty_test_caf_get_clone
==9448== Memcheck, a memory error detector
==9448== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==9448== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==9448== Command: /opt/arch/opencoarrays/build/bin/cafrun -np 2 ./exe/hasty_test_caf_get_clone
==9448==

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f7e933dd0af in ???
#1  0x40eae9 in __hasty_dictionary_node_MOD_has_key
        at src/lib/hasty_dictionary_node.f90:97
#2  0x41095f in key_iterator_search
        at src/lib/hasty_dictionary.f90:328
#3  0x40fa7e in __hasty_dictionary_MOD_traverse_iterator
        at src/lib/hasty_dictionary.f90:521
#4  0x40fe5f in __hasty_dictionary_MOD_node
        at src/lib/hasty_dictionary.f90:315
#5  0x410772 in __hasty_dictionary_MOD_get_pointer
        at src/lib/hasty_dictionary.f90:223
#6  0x40fc98 in __hasty_dictionary_MOD_get_clone
        at src/lib/hasty_dictionary.f90:206
#7  0x41362f in __hasty_hash_table_MOD_get_clone
        at src/lib/hasty_hash_table.f90:236
#8  0x414ca7 in hasty_test_caf_get_clone
        at src/tests/hasty_test_caf_get_clone.F90:23
#9  0x414d4d in main
        at src/tests/hasty_test_caf_get_clone.F90:7

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 9452 RUNNING AT zaghi
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
==9448==
==9448== HEAP SUMMARY:
==9448==     in use at exit: 101,606 bytes in 1,426 blocks
==9448==   total heap usage: 4,560 allocs, 3,134 frees, 257,768 bytes allocated
==9448==
==9448== 12 bytes in 1 blocks are definitely lost in loss record 94 of 409
==9448==    at 0x4C2AB8D: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==9448==    by 0x47219D: xmalloc (in /usr/bin/bash)
==9448==    by 0x46BDD8: set_default_locale (in /usr/bin/bash)
==9448==    by 0x41A048: main (in /usr/bin/bash)
==9448==
==9448== LEAK SUMMARY:
==9448==    definitely lost: 12 bytes in 1 blocks
==9448==    indirectly lost: 0 bytes in 0 blocks
==9448==      possibly lost: 0 bytes in 0 blocks
==9448==    still reachable: 101,594 bytes in 1,425 blocks
==9448==         suppressed: 0 bytes in 0 blocks
==9448== Reachable blocks (those to which a pointer was found) are not shown.
==9448== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==9448==
==9448== For counts of detected and suppressed errors, rerun with: -v
==9448== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

There are memory leaks, but I cannot understand why

Digging deeper

I think that the final memory leak happens when I try to check if a node has a key here

  elemental logical function has_key(self)
  !< Return .true. if the node has a key (or id) set-up.
  class(dictionary_node), intent(in) :: self !< The node.

  has_key = allocated(self%key)
  endfunction has_key

Note that self is not defined as pointer, but when I invoke has_key as method it is likely a pointer into a list. Moreover, before calling has_key on a pointer-node I check if the node is associated, see here

  subroutine traverse_iterator(self, iterator)
  !< Traverse dictionary from head to tail calling the iterator procedure.
  class(dictionary), intent(in)  :: self     !< The dictionary.
  procedure(iterator_interface)  :: iterator !< The iterator procedure to call for each node.
  type(dictionary_node), pointer :: p        !< Pointer to scan the dictionary.
  logical                        :: done     !< Flag to set to true to stop traversing.

  done = .false.
  p => self%head
  do
    if (associated(p)) then
      call iterator(node=p, done=done)
      if (done) exit
      p => p%next
    else
      exit
    endif
  enddo
  endsubroutine traverse_iterator

The call iterator... statement is where I actually pass the has_key iterator check on pointer-node `p'.

@LadaF @jeffhammond @MichaelSiehl @zbeekman @rouson have some suggestions? (do not think I want you to force to read all, just what do you make in such situations?).

In such situation I generally try other Compilers, but as you know for this project I have to stick on GNU gfortran (OpenCoarrays).

O.T. @rouson @MichaelSiehl @zbeekman I am failing to force a sync all for debugging output: even echoing on standard error unit an disabling all IO buffering of my shell the write(error_unit...) of my tests seems to be unaffected by sync all: is sync all really like mpi barrier or I am misunderstanding (a lot)?

The text was updated successfully, but these errors were encountered:

zbeekman · 2016-12-01T04:48:40Z

O.T. @rouson @MichaelSiehl @zbeekman I am failing to force a sync all for debugging output: even echoing on standard error unit an disabling all IO buffering of my shell the write(error_unit...) of my tests seems to be unaffected by sync all: is sync all really like mpi barrier or I am misunderstanding (a lot)?

Can you provide an example? I'm not sure I understand what is happening. I know that the compiler can do some IO buffering and there was potentially a problem shutting down the Fortran runtime library gracefully, so it's possible that you may need a sync all before and after your debugging print/write statements to ensure that they make it out onto the screen. If you this is the case, and you have a chance to make a small reproducer, that would be great. We'll open an OpenCoarrays bug report and include the reproducer in future regression/unit tests.

As for the SIGSEGV error, how are you installing OpenCoarrays? I'm guessing that 0x7f7e933dd0af in ??? is either in the Fortran runtime library or libcaf_mpi and we can reinstall OpenCoarrays with backtraces and debug symbols turned on to get more info.

I haven't looked through the code in detail yet, but if it's not a bug in OpenCoarrays, then it's possible that it's a parallel programming bug. For example, if you have a put, you must ensure that there is an image control statement separating any get of that memory location so that the execution statements are ordered w.r.t. each other.

szaghi · 2016-12-01T05:13:49Z

@zbeekman Thank you very much for your help, it is very appreciated!

Can you provide an example? I'm not sure I understand what is happening. I know that the compiler can do some IO buffering and there was potentially a problem shutting down the Fortran runtime library gracefully, so it's possible that you may need a sync all before and after your debugging print/write statements to ensure that they make it out onto the screen. If you this is the case, and you have a chance to make a small reproducer, that would be great. We'll open an OpenCoarrays bug report and include the reproducer in future regression/unit tests.

Sure, there is already a test, but it is not small/clean, today I'll clean it for you. Essentially, I was surprise that

use iso_fortran_env

sync all
write(error_unit, *) ' I am image: ', this_image()
sync all

! some stuff

sync all
write(error_unit, *) ' Hello world from image: ', this_image()
sync all

will generate an output like

stdbuf -i0 -o0 -e0 caf_hello
I am image: 3
I am image: 2
Hello world from image: 2
I am image: 1
...

namely mixing Hello world,,, with I am image..... For what I have guessed the order of execution is unpredictable, but the 2 kind of printing should be forced by the sync all. I used standard error unit and tried to unbuffer my shell to be sure that buffering is not happening, but I am not sure I did. Note that this happens for a test that does not fail, thus there should be not an error exit. I am almost sure that this is not a bug of OpenCorrays rather it is a mine false expectation. Is it possible that buffering happens also for standard error unit?

As for the SIGSEGV error, how are you installing OpenCoarrays? I'm guessing that 0x7f7e933dd0af in ??? is either in the Fortran runtime library or libcaf_mpi and we can reinstall OpenCoarrays with backtraces and debug symbols turned on to get more info.

Argggggh, Gandalf has always the right answer!. I have to check, but it is likely probable that I have installed the release version without the debugging symbols... As soon as arrive to office I try to install OpenCoarrays with debugging symbols (along side the release one 😄 )

I haven't looked through the code in detail yet, but if it's not a bug in OpenCoarrays, then it's possible that it's a parallel programming bug.

This is likely the case: this the first CAF program (of a minimum complexity) that I am trying to write. It is probably a parallel programming bug.

For example, if you have a put, you must ensure that there is an image control statement separating any get of that memory location so that the execution statements are ordered w.r.t. each other.

There should be a pollution of sync all in the current test, the get should happen after the put/sync all. Anyhow, I'll check this. I lost the meaning of w.r.t., what does it mean?

Thank you very very much!

zbeekman · 2016-12-01T05:26:36Z

Argggggh, Gandalf has always the right answer!. I have to check, but it is likely probable that I have installed the release version without the debugging symbols... As soon as arrive to office I try to install OpenCoarrays with debugging symbols (along side the release one 😄 )

I'm about to go to bed, but it may not be obvious getting traceback and debugging symbols activated. I thought traceback was on by default but I need to double check. I can guide you with CMake if you get stuck

I used standard error unit and tried to unbuffer my shell to be sure that buffering is not happening, but I am not sure I did. Note that this happens for a test that does not fail, thus there should be not an error exit. I am almost sure that this is not a bug of OpenCorrays rather it is a mine false expectation. Is it possible that buffering happens also for standard error unit?

This may in fact be a gfortran/OpenCoarrays bug... I need to take a look at the standard.... I'm guessing GFortran is buffering IO to stderr and stdout, so syncall may not be flushing these... I'll try a small reproducer on my systems and see what happens.

This is likely the case: this the first CAF program (of a minimum complexity) that I am trying to write. It is probably a parallel programming bug.

I wouldn't be so sure... OpenCoarrays needs more users to torture test it.... I'll give it even odds that it's a OpenCoarrays/GFortran bug.

I lost the meaning of w.r.t., what does it mean?

Sorry... it means With Respect To.

rouson · 2016-12-01T05:39:28Z

On Nov 30, 2016, at 9:26 PM, Izaak Beekman ***@***.***> wrote: I wouldn't be so sure... OpenCoarrays needs more users to torture test it.... I'll give it even odds that it's a OpenCoarrays/GFortran bug.

Oh ye of little faith! I haven’t read the details of this thread but the above quote jumped out at me. :) I’ve been teaching a class for 10 weeks in which we use CAF in nearly every example in lecture and nearly every problem in homework assignments, and I only rarely encounter bugs. I will say, however, that CRITICAL is not a feature I’ve used because of its negative performance implications and I’ve only used unlimited polymorphism rarely because it just feels too limited to be of great value so this code is mixing two features that I’d imagine are quite rarely mixed. Stefano, as you know, if you’d like to book some time to explain to me what you’re doing and especially the motivations for using CRITICAL, I’d be glad to offer any insights that come to mind. I can’t help but wonder if there’s a better design that would eliminate the need for CRITICAL. I think of that as a last resort only to be used when absolutely necessary — kind of like locks and atomics, which have better alternatives in Fortran 2015 (events). Damian

jeffhammond · 2016-12-01T08:38:46Z

Can you make a MCVE? I cannot reproduce because use hasty fails.

In particular, I want to be able to git clone $OMETHING && make && cafrun $PROGRAM.

szaghi · 2016-12-01T08:42:14Z

@rouson

Dear Damian,

Oh ye of little faith! I haven’t read the details of this thread but the above quote jumped out at me. :) I’ve been teaching a class for 10 weeks in which we use CAF in nearly every example in lecture and nearly every problem in homework assignments, and I only rarely encounter bugs.

This is also my feeling: the issue is more related to my poor-fortranish than to a possible (improbable) OC bug.

I will say, however, that CRITICAL is not a feature I’ve used because of its negative performance implications and I’ve only used unlimited polymorphism rarely because it just feels too limited to be of great value so this code is mixing two features that I’d imagine are quite rarely mixed.

Arggggghhhhh, do not consider exactly that test... the one uploaded is just last meaningless-modification of the baseline test, the addiction of critical is only a desperate tentative, it is not necessary in any sense. Today, I'll try to prepare a real (clean) test for you. The pollution of critical and sync all is not necessary, I added them only to see if they make some differences. However, I think you cannot help me on really debug the test, because the test is strictly related to the design of the hash table: making it minimal is difficult, thus I cannot ask for your help on the real code. On the contrary, my help request is more related on how to debug such issue and our Gandalf has already given me a good answer... I am a donkey that is trying to debug a code without activating the debugging symbols, my bad!

Stefano, as you know, if you’d like to book some time to explain to me what you’re doing and especially the motivations for using CRITICAL, I’d be glad to offer any insights that come to mind. I can’t help but wonder if there’s a better design that would eliminate the need for CRITICAL. I think of that as a last resort only to be used when absolutely necessary — kind of like locks and atomics, which have better alternatives in Fortran 2015 (events).

You are very very kind, but as we experienced that last time my spoken English is very bad, I do not like to waste your time for a not-so-important talk. Anyhow, soon I'll probably bother you for some CAF teachings (aside I invited @afanfa at my Institute for a lecture on CAF hoping that he will have the patience to talk with oompa loompa 😄 )

Cheers.

szaghi · 2016-12-01T08:47:00Z

@jeffhammond

I'll try today to dump a Minimal Complete and Verifiable Example, but I feel that the problem is really into use hasty, I mean I am almost sure that as @zbeekman said I have probably done a parallel-programming-monster in designing the hash-table. Moreover, the unlimited polymorphic pointer/allocatable add more uncertainty. I am not sure I'll be able to create a MCVE, but I'll try.

Thank you very very much for your help!

P.S. do you know if standard saying anything about buffering on standard error unit?

szaghi · 2016-12-01T08:48:31Z

@jeffhammond

In particular, I want to be able to git clone $OMETHING && make && cafrun $PROGRAM.

Yep, I'll add such bias soon, sorry for the bothering.

jeffhammond · 2016-12-01T09:01:41Z

@szaghi Thanks. I was hoping to try Cray and Intel implementations to see if this issue is GCC-specific or not.

szaghi · 2016-12-01T12:50:20Z

@zbeekman @rouson @jeffhammond

Dear All,

I have done a small step over. I failed to build OpenCoarrays in debug mode: all the debug matching I found into all build/download scripts seem to be referred to the debugging of the build/download themselves and not to triggering the debug flags for building OpenCoarrays. However, this is quite not important (for the moment, but in the near future I like to have OC with all debug symbols activated), whereas I think I found my error.
A preliminary: some days ago I found that doing the following call generates a GNU ICE:

call self%bucket(b)[i]%get_clone(key=key, content=content)

Note that the bucket index b and the image index i are correctly computed. Compiling such instruction with GNU/OpenCoarrays I obtain the following ICE

src/lib/hasty_hash_table.f90:237:0:

       call self%bucket(b)[i]%get_clone(key=key, content=content)

internal compiler error: in gfc_get_tree_for_caf_expr, at fortran/trans-expr.c:1818
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://bugs.archlinux.org/> for instructions.

Before asking for help about this ICE, I tried to circumvent it with the following workaround

call dictionary_get_clone(self=self%bucket(b)[i], key=key, content=content)

where dictionary_get_clone is the (publically exposed) get_clone method of the dictionary type. This non-TBP version compile correctly. However, I think here is (probably) my error.

What means passing the remote copy bucket(b)[i] as a dummy argument to a local procedure? I really do not know if this is allowed and what implies. Anyhow, in the case this is allowed, I think that a (temporary) copy of the remote data must be done and, if so, the copy could be a mess... because it is a derived type containing pointers! Consider the following snippet

type :: mess
  integer :: val=0
  type(mess) :: next=>null()
end type mess
...
type(mess), pointer :: foo=>null()
type(mess), pointer :: foo1=>null()
type(mess) :: foo2

allocate(foo)
foo%val = 0

allocate(foo1)
foo1%val = 1
foo1%next=> foo

foo2 = foo1

After the last copy foo2 = foo1 I think that foo2%next is not pointing to foo, is this right? I check this now trying to dump a MCVE for Jeff...

szaghi · 2016-12-02T16:55:51Z

@jeffhammond

Dear Jeff, I added a makefile to build the test I am now playing with (only the branch). However, do not waste your time with it now: there is a great problem on the algorithm I am using to distributing the work-load among the CAF images that must be resolved before trying to fix this SIGSEGV bug... I simply like to say that I am not forgetting to create the MCVE for you, but I need more time.

Cheers.

rouson · 2016-12-02T22:27:20Z

On Dec 1, 2016, at 12:42 AM, Stefano Zaghi ***@***.***> wrote: You are very very kind, but as we experienced that last time my spoken English is very bad, I do not like to waste your time for a not-so-important talk.

I have no recollection of difficulty understanding your English so please don’t let that stop you if I can be of any assistance.

(aside I invited @afanfa <https://github.com/afanfa> at my Institute for a lecture on CAF hoping that he will have the patience to talk with oompa loompa 😄 )

That’s great news! I hope it works out for him to visit. Damian

rouson · 2016-12-02T22:48:20Z

On Dec 1, 2016, at 4:50 AM, Stefano Zaghi ***@***.***> wrote: call self%bucket(b)[i]%get_clone(key=key, content=content)

@szaghi As I’m sure you know an internal compiler error is always a compiler bug so you might submit this via the GCC Bugzilla <https://gcc.gnu.org/bugzilla/> site. If you do, then you should also email the bug report to [email protected] <mailto:[email protected]>. I’m pretty certain the above code is invalid, which doesn’t change the fact that an ICE is a compiler bug because the compiler should inform you of the invalid code. I haven’t checked the standard for the exact language, but the one image is not allowed to execute code in another image, which could be one interpretation of the above code if it were standard-conforming code. Alternatively, if the intention is to get an object from another image and then invoke a TBP on that object, then I’m pretty certain you have to copy that object into a local data structure in a prior statement and then invoke the TBP on the local data structure or do something like what you do below.

Before asking for help about this ICE, I tried to circumvent it with the following workaround call dictionary_get_clone(self=self%bucket(b)[i], key=key, content=content) where dictionary_get_clone is the (publically exposed) get_clone method of the dictionary type. This non-TBP version compile correctly. However, I think here is (probably) my error. What means passing the remote copy bucket(b)[i] as a dummy argument to a local procedure?

I see nothing wrong with your second version. It simply means to get the remote data and pass it as the actual argument to the keyword argument named “self”. It might be nice to see a short, compilable example so we can see the full procedure interface.

I really do not know if this is allowed and what implies. Anyhow, in the case this is allowed, I think that a (temporary) copy of the remote data must be done and, if so, the copy could be a mess... because it is a derived type containing pointers! Consider the following snippet type :: mess integer :: val=0 type(mess) :: next=>null() end type mess ... type(mess), pointer :: foo=>null() type(mess), pointer :: foo1=>null() type(mess) :: foo2 allocate(foo) foo%val = 0 allocate(foo1) foo1%val = 1 foo1%next=> foo foo2 = foo1 After the last copy foo2 = foo1 I think that foo2%next is not pointing to foo, is this right? I check this now trying to dump a MCVE for Jeff...

Oh boy… you probably know how much I try to avoid pointers. They are especially dangerous with coarrays. If you communicate an object between images and that object contains a pointer, I’m pretty sure the pointer becomes undefined. I don’t see clearly that you’re doing this above, but it sounds like it might be happening based on the context. (On a possibly related note, I don’t think the standard allows associating a pointer with a coarray. And I'm guessing form the “next” name above, that you’re doing this for purposes of constructing a linked list. Linked lists are next on my avoid list right after pointers, but at least in the rare case in which I found it useful to construct a linked list, we constructed it using arrays and indirect addressing rather than using pointers. It makes life much easier. Did you see the video <https://www.youtube.com/watch?v=YQs6IC-vgmo> I posted to c.l.f a while back in which C++ language inventor Bjarne Stroustrup argues that there is almost always a better choice of data structures than a linked list. Damian

szaghi · 2016-12-03T06:34:46Z

@rouson

Dear Damian,

As I’m sure you know an internal compiler error is always a compiler bug so you might submit this via the GCC Bugzilla https://gcc.gnu.org/bugzilla/ site. If you do, then you should also email the bug report to [email protected] mailto:[email protected]. I’m pretty certain the above code is invalid, which doesn’t change the fact that an ICE is a compiler bug because the compiler should inform you of the invalid code.

Yes, I know the ICE meaning, but, as you said, I was almost sure that the first version is invalid, thus I preferred to understand if it was really invalid or not. Submitting a GCC report is not so easy and I would like to write a correct, meaningful report to help my GCC superheroes and do not waste their time. Now that you have confirmed that is invalid I'll try to create a MCVE for GCC team.

I haven’t checked the standard for the exact language, but the one image is not allowed to execute code in another image, which could be one interpretation of the above code if it were standard-conforming code. Alternatively, if the intention is to get an object from another image and then invoke a TBP on that object, then I’m pretty certain you have to copy that object into a local data structure in a prior statement and then invoke the TBP on the local data structure or do something like what you do below.

This was my intuition when I tried this workaround, but my CAF knowledge is growing empirically. (OT please, consider the idea to write a new book devoted to CAF, it is really necessary...)

I see nothing wrong with your second version. It simply means to get the remote data and pass it as the actual argument to the keyword argument named “self”. It might be nice to see a short, compilable example so we can see the full procedure interface.

I am working on MCVE for (all) of you. Indeed, it is not so easy to reduce all: HASTY is a rather stupid, but not so simple... yesterday I have finished a MCVE for @jeffhammond that I supposed to have all the needed ingredients, but it works right without raising the SIGSEGV 😢

Oh boy… you probably know how much I try to avoid pointers.

Me too, this was my mantra. Nevertheless, when I started to develop my Adaptive Mesh Refinement methods I needed a different data structure from arrays... and I had to play with pointers, my bad.

They are especially dangerous with coarrays. If you communicate an object between images and that object contains a pointer, I’m pretty sure the pointer becomes undefined.

Ohhhh, do they become surely undefined, are there no solutions?

I don’t see clearly that you’re doing this above, but it sounds like it might be happening based on the context. (On a possibly related note, I don’t think the standard allows associating a pointer with a coarray.

Yes, I was carefully to avoid to associate pointers between images. The hash table of HASTY has currently 2 getters, get_clone (the one really bugged) and get_pointer. The latter works well because it can be used only on the local-image buckets, if the node lives in the buckets of other images the result of get_pointer is null.

And I'm guessing form the “next” name above, that you’re doing this for purposes of constructing a linked list.

I use linked list (doubly linked in this case) for the chaining collisions resolution that always happens when we use a not fully injection-hashing-function. I am not really expert, but at least for my use case I cannot use a perfect-hashing without collisions (the tables become huge) thus I need to resolve the collisions. To my knowledge, linked list works very well to this aim.

Linked lists are next on my avoid list right after pointers, but at least in the rare case in which I found it useful to construct a linked list, we constructed it using arrays and indirect addressing rather than using pointers.

I see some of these approaches, but I found them rather more complex (for my poor software-engineer level) than a plain linked list. Can you give me some good reference about these approaches? It is preferred your own works (books, papers, reports), I found your English more understandable 😄
but all other references are welcome.

It makes life much easier.

Mmm, I partially agree. As I said, I briefly (superficially) read some of these approaches and I found them not so simple. However, this was not the main reason for why I preferred a plain linked list (if I have to discard all things that I do not understand the first time probably I have to take care of my garden...). My main concern for AMR data structure is to have almost good access efficiency, namely on average to obtain something near O(1), while having efficient put/remove nodes O(1) that is typically a feature of linked lists. With the indirect arrays indexes is this put/remove efficiency possible?

Did you see the video https://www.youtube.com/watch?v=YQs6IC-vgmo I posted to c.l.f a while back in which C++ language inventor Bjarne Stroustrup argues that there is almost always a better choice of data structures than a linked list.

No, I missed it. I'll see it when Angelica will sleep 😄

Damian, thank you very very much, as always, great teachings!

szaghi · 2016-12-05T16:26:33Z

@rouson

Dear Damian, I created a (potential) bug report for the ICE I got with GNU gfortran, see it here, number 78682.

I write a MCVE for raising the ICE, it is the following

module core_module
  implicit none

  type :: core
    integer :: core_value
    contains
      procedure :: core_value_print
  end type

  contains
    subroutine core_value_print(self)
      class(core), intent(in) :: self
      print*, 'image: ', this_image(), ' core value: ', self%core_value
    end subroutine core_value_print
end module core_module

program gfortran_ice_caf
  use core_module

  implicit none
  type(core), allocatable :: core_caf[:]

  allocate(core_caf[*])

  if (mod(this_image(), 2)==0) then
    core_caf%core_value = 2
  else
    core_caf%core_value = 1
  endif

  if (this_image()==2) call core_caf[1]%core_value_print
end program gfortran_ice_caf

Building it on my workstation with GNU gfortran 6.2.1 I obtain

stefano@zaghi(04:30 PM Mon Dec 05)
~/fortran/compilers_bug/gfortran-ice-caf-derived_type 1 files, 12Kb
→ gfortran -fcoarray=lib gfortran_ice_caf.f90 
gfortran_ice_caf.f90:31:0:

   if (this_image()==2) call core_caf[1]%core_value_print
 
internal compiler error: in gfc_get_tree_for_caf_expr, at fortran/trans-expr.c:1818
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://bugs.archlinux.org/> for instructions.

As soon as it will confirmed a bug (maybe it is not), I can add it to your AdHoc if you like.

My best regards.

rouson · 2016-12-08T07:42:38Z

Thank for letting me know. I just added a comment on the bug report.

szaghi · 2016-12-08T16:29:59Z

@rouson

I read it. Thank you too for your great help.

I have re-build mpich/openmpi/opencoarrays with the latest 7.0.0 trunk and as Janus said the ICE vanished thus this is an issue for the versions before 7.x However, the 7.0.0 testing version compile the invalid code without any warnings/errors, and this is another issue...

@rouson @zbeekman I have another question: the gfortran versions before 7.0.0 complain with derived type caf having allocatable members (but not for pointer ones) thus I had to do some tricks with gfortran 6.2.1, while the new 7.0.0 accepts also derived type with allocatable members. Is allocatable members into caf allowed by the standard 2008/2015? I am thinking to adopt 7.x trunk as the base version for HASTY...

Cheers.

rouson · 2016-12-08T20:20:25Z

On Dec 8, 2016, at 8:30 AM, Stefano Zaghi ***@***.***> wrote: @rouson <https://github.com/rouson> I read it. Thank you too for your great help. I have re-build mpich/openmpi/opencoarrays with the latest 7.0.0 trunk and as Janus said the ICE vanished thus this is an issue for the versions before 7.x However, the 7.0.0 testing version compile the invalid code without any warnings/errors, and this is another issue...

The standard defines certain “constraints” that tell compiler vendors what they are required to detect and report at compile-item. I don’t think there is a constraint related to this issue and I can’t imagine there could be one. Except in trivial cases, this would be difficult or impossible for a compiler to detect without executing the code.

@rouson <https://github.com/rouson> @zbeekman <https://github.com/zbeekman> I have another question: the gfortran versions before 7.0.0 complain with derived type caf having allocatable members (but not for pointer ones) thus I had to do some tricks with gfortran 6.2.1, while the new 7.0.0 accepts also derived type with allocatable members. Is allocatable members into caf allowed by the standard 2008/2015? I am thinking to adopt 7.x trunk as the base version for HASTY...

Yes, allocatable components of derived-type coarrays are allowed in 2008 and 2015. The implementation of that feature in gfortran/OpenCoarrays is immature. It might work in some cases, but not in others so proceed with caution. I hope that the support will be more complete within a few weeks. It’s probably not a great idea to adopt a pre-release version as a requirement because pre-release versions can be unstable. I’m a fan of staying at the bleeding edge and I regularly use pre-release versions of gfortran. In fact, I just laugh a whole graduate course based on pre-release versions of gfortran, but it was very tricky and only worked out because I was able to get quick responses on bug fixes and update the students to a new version of the compiler midway through the course. The major benefit of using the virtual machine is that I can roll whole new environment out to students at any time and know that each of them is working in the exact same environment even though each works on the system of their choosing. Damian

szaghi · 2016-12-09T04:38:26Z

@rouson

Dear Damian,

The standard defines certain “constraints” that tell compiler vendors what they are required to detect and report at compile-item. I don’t think there is a constraint related to this issue and I can’t imagine there could be one. Except in trivial cases, this would be difficult or impossible for a compiler to detect without executing the code.

Sure, what I meant is that is another issue for me not for the compiler: I am learning CAF empirically, if the compiler compiles and runs invalid codes is an issue for my way.

Yes, allocatable components of derived-type coarrays are allowed in 2008 and 2015. The implementation of that feature in gfortran/OpenCoarrays is immature. It might work in some cases, but not in others so proceed with caution. I hope that the support will be more complete within a few weeks.

Good, I am going with allocatables.

It’s probably not a great idea to adopt a pre-release version as a requirement because pre-release versions can be unstable.

Sure, I agree, but HASTY is an experiment: I have to demonstrate to my bosses that CAF is possible... for the time that HASTY is finished I will have gcc 8...

The major benefit of using the virtual machine is that I can roll whole new environment out to students at any time and know that each of them is working in the exact same environment even though each works on the system of their choosing.

I am alone, I handle my system alone, I have a rolling release GNU Linux OS, no problem to follow gcc updates 😄

Cheers

szaghi · 2016-12-09T15:04:42Z

@rouson @jeffhammond

I had done a lot of dry-clean stuff on a MCVE and I finally understand that the SIGSEGV is caused by allocatable members into CAF: gfortran lower than 7.0.0 complains directly at compile time (but I used a workaround that shadows the issue), while gcc trunk 7.0.0 compiles the caf with allocatable memeber but generates a SIGSEGV at runtime: if I make the member static the SIGSEGV vanishes and the results are as expetect.

The MCVE is not yet so minimal (more than 1000 slocs), but in a few hours I hope to reduce it at mimimum: as soon as I have a MCVE can you test it with Cray?

jeffhammond · 2016-12-09T15:38:22Z

The MCVE is not yet so minimal (more than 1000 slocs), but in a few hours I hope to reduce it at mimimum: as soon as I have a MCVE can you test it with Cray?

SLOC isn't that important. I just want to be able to type "make FC=ftn" and the "srun .. $BIN" and have it work, or something close to that.

szaghi · 2016-12-09T16:04:52Z

@jeffhammond

SLOC isn't that important. I just want to be able to type "make FC=ftn" and the "srun .. $BIN" and have it work, or something close to that.

Sure, it is going to be one file test 😄

szaghi · 2016-12-09T16:47:47Z

@jeffhammond @rouson

Dear Jeff and Damian,

in the following there is a the most minimal example that I was able to create. Unfortunately, it behaves slightly different with respect the HASY-dry example... read the following.

@rouson Damian: this example generates an ICE with all gcc I have... do you think I have to submit to the same thicket I opened for the other issue? Or this is more related to allocatable members of caf?

The MCVE

I cannot attach a fortran code here, thus please cut/paste the following code

module node_module
  use, intrinsic :: iso_fortran_env

  implicit none
  private
  public :: node

  type :: node
#ifdef RAISE_ERROR
    integer(int32), allocatable :: storage
#else
    integer(int32)              :: storage
#endif
    contains
      procedure :: add
      procedure :: get
  end type node

  contains
    subroutine add(self, storage)
      class(node),    intent(inout) :: self
      integer(int32), intent(in)    :: storage

#ifdef RAISE_ERROR
      if (.not.allocated(self%storage)) allocate(self%storage)
#endif
      self%storage = storage
    end subroutine add

    subroutine get(self, storage)
      class(node),                 intent(in)  :: self
      integer(int32), allocatable, intent(out) :: storage

#ifdef RAISE_ERROR
      if (allocated(self%storage)) allocate(storage, source=self%storage)
#else
      allocate(storage, source=self%storage)
#endif
    end subroutine get
end module node_module

module caf_module
  use, intrinsic :: iso_fortran_env
  use node_module

  implicit none
  private
  public :: caf

  type :: caf
    private
    type(node), allocatable :: array[:]
    contains
      procedure :: add
      procedure :: get
      procedure :: initialize
  end type caf

  contains
    subroutine add(self, storage)
      class(caf),     intent(inout) :: self
      integer(int32), intent(in)    :: storage

      call self%array%add(storage=storage)
    end subroutine add

    subroutine get(self, image, storage)
      class(caf),                  intent(in)  :: self
      integer(int32),              intent(in)  :: image
      integer(int32), allocatable, intent(out) :: storage
      type(node)                               :: copy_of_remote_node

      if (image/=this_image()) then
        copy_of_remote_node = self%array[image]
        call copy_of_remote_node%get(storage=storage)
      else
        call self%array%get(storage=storage)
      endif
    end subroutine get

    subroutine initialize(self)
      class(caf), intent(inout) :: self

      if (allocated(self%array)) deallocate(self%array)
      allocate(self%array[*])
    end subroutine initialize
end module caf_module

program sigsegv_caf_dt
  use, intrinsic :: iso_fortran_env
  use caf_module

  implicit none
  type(caf)                   :: caf_storage
  integer(int32), allocatable :: storage

  call caf_storage%initialize

  sync all
  call caf_storage%add(storage=int(this_image(), int32))
  sync all

  if (this_image()==1) then
    print*, 'hello from image: ', this_image()
    print*, 'test get from image 2 by image 1'
    call caf_storage%get(storage=storage, image=2_int32)
    if (allocated(storage)) then
      print*, 'storage cloned: ', storage
    else
      print*, 'get_clone failed'
    endif
  endif
end program sigsegv_caf_dt

Save it with .F90 extension it has one cpp macro to enable/disable the SIGSEGV (namely, to enable/disable allocatable storage member of caf type).

My log

I compiled it with GNU Fortran (GCC) 7.0.0 20161206 (experimental), MPICH 3.2.0 (compiled by gcc 7.0.0) and OpenCoarrays 1.7.5 (compiled by gcc 7.0.0).

Working good

If I compile the static version with

→ caf -fcoarray=lib sigsegv_caf_dt.F90

I obtain the expected result

stefano@zaghi(05:34 PM Fri Dec 09) desk {opencoarrays-1.7.5-gnu-7.0.0 - OpenCoarrays 1.7.5 with gcc 7.0.0 environment}
~/fortran/compilers_bug/gfortran_sigsegv_caf_dt_allocatable_member 4 files, 80Kb
→ cafrun -np 2 a.out
 hello from image:            1
 test get from image 2 by image 1
 storage cloned:            2

If I enable the allocatability of storage member

→ caf -fcoarray=lib sigsegv_caf_dt.F90 -DRAISE_ERROR

I obtain... an ICE (sigh, my bad... with the HASTY example it compiles right, but generates a runtime SIGSEGV)

stefano@zaghi(05:34 PM Fri Dec 09) desk {opencoarrays-1.7.5-gnu-7.0.0 - OpenCoarrays 1.7.5 with gcc 7.0.0 environment}
~/fortran/compilers_bug/gfortran_sigsegv_caf_dt_allocatable_member 4 files, 80Kb
→ caf -fcoarray=lib sigsegv_caf_dt.F90 -DRAISE_ERROR
sigsegv_caf_dt.F90:87:0:

 end module caf_module

internal compiler error: Segmentation fault
0xc0db4f crash_signal
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/toplev.c:333
0xeb1764 recompute_tree_invariant_for_addr_expr(tree_node*)
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/tree.c:4317
0xeb1d7c build1_stat(tree_code, tree_node*, tree_node*)
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/tree.c:4414
0x92c76c build1_stat_loc
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/tree.h:3903
0x92c76c fold_build1_stat_loc(unsigned int, tree_code, tree_node*, tree_node*)
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fold-const.c:12139
0x6f204f gfc_build_addr_expr(tree_node*, tree_node*)
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans.c:298
0x70532b structure_alloc_comps
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans-array.c:8329
0x7827b3 gfc_trans_deallocate(gfc_code*)
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans-stmt.c:6477
0x6f1bf7 trans_code
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans.c:1942
0x7742f3 gfc_trans_if_1
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans-stmt.c:1303
0x77c39a gfc_trans_if(gfc_code*)
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans-stmt.c:1334
0x6f1ce7 trans_code
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans.c:1878
0x7742f3 gfc_trans_if_1
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans-stmt.c:1303
0x77c39a gfc_trans_if(gfc_code*)
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans-stmt.c:1334
0x6f1ce7 trans_code
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans.c:1878
0x77e271 gfc_trans_simple_do
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans-stmt.c:1924
0x77e271 gfc_trans_do(gfc_code*, tree_node*)
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans-stmt.c:2057
0x6f1cba trans_code
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans.c:1890
0x723038 gfc_generate_function_code(gfc_namespace*)
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans-decl.c:6271
0x6f6949 gfc_generate_module_code(gfc_namespace*)
        /opt/arch/gcc/opencoarrays/prerequisites/downloads/trunk/gcc/fortran/trans.c:2164
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.

So my questions are:

is the above code valid?
if it is, does Cray compiler go right?

Summary

copy the above example;
compile with caf -fcoarray=lib sigsegv_caf_dt.F90 -DRAISE_ERROR for the error-enabled version or caf -fcoarray=lib sigsegv_caf_dt.F90 for the error-disabled version.
run with 2 images, e.g. cafrun -np 2 a.out

As always, you help is priceless, you are my heroes!

Cheers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coarray "get": I got a SIGSEGV, but I do not understand why... #6

coarray "get": I got a SIGSEGV, but I do not understand why... #6

szaghi commented Nov 30, 2016

zbeekman commented Dec 1, 2016

szaghi commented Dec 1, 2016

zbeekman commented Dec 1, 2016

rouson commented Dec 1, 2016 via email

jeffhammond commented Dec 1, 2016 •

edited

Loading

szaghi commented Dec 1, 2016

szaghi commented Dec 1, 2016

szaghi commented Dec 1, 2016

jeffhammond commented Dec 1, 2016

szaghi commented Dec 1, 2016

szaghi commented Dec 2, 2016

rouson commented Dec 2, 2016 via email

rouson commented Dec 2, 2016 via email

szaghi commented Dec 3, 2016 •

edited

Loading

szaghi commented Dec 5, 2016

rouson commented Dec 8, 2016

szaghi commented Dec 8, 2016

rouson commented Dec 8, 2016 via email

szaghi commented Dec 9, 2016

szaghi commented Dec 9, 2016

jeffhammond commented Dec 9, 2016 via email

szaghi commented Dec 9, 2016

szaghi commented Dec 9, 2016

coarray "get": I got a SIGSEGV, but I do not understand why... #6

coarray "get": I got a SIGSEGV, but I do not understand why... #6

Comments

szaghi commented Nov 30, 2016

The test

Results using OpenCoarrays/GNU gfortran

Valgrind inspection

Digging deeper

zbeekman commented Dec 1, 2016

szaghi commented Dec 1, 2016

zbeekman commented Dec 1, 2016

rouson commented Dec 1, 2016 via email

jeffhammond commented Dec 1, 2016 • edited Loading

szaghi commented Dec 1, 2016

szaghi commented Dec 1, 2016

szaghi commented Dec 1, 2016

jeffhammond commented Dec 1, 2016

szaghi commented Dec 1, 2016

szaghi commented Dec 2, 2016

rouson commented Dec 2, 2016 via email

rouson commented Dec 2, 2016 via email

szaghi commented Dec 3, 2016 • edited Loading

szaghi commented Dec 5, 2016

rouson commented Dec 8, 2016

szaghi commented Dec 8, 2016

rouson commented Dec 8, 2016 via email

szaghi commented Dec 9, 2016

szaghi commented Dec 9, 2016

jeffhammond commented Dec 9, 2016 via email

szaghi commented Dec 9, 2016

szaghi commented Dec 9, 2016

The MCVE

My log

Working good

Summary

jeffhammond commented Dec 1, 2016 •

edited

Loading

szaghi commented Dec 3, 2016 •

edited

Loading