Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Conflict with matplotlib #47

Open
devanubis opened this issue Mar 1, 2022 · 8 comments
Open

[Bug] Conflict with matplotlib #47

devanubis opened this issue Mar 1, 2022 · 8 comments

Comments

@devanubis
Copy link

I've discovered a weird inexplicable conflict between helics and matplotlib.

The bug is if matplotlib is imported before helics then helics.helicsCreateCombinationFederateFromConfig() raises an exception when trying to connect to a remote broker (in a separate docker container).

The exception I get with the latest versions of helics is:

helics.capi.HelicsException: [-3] --brokerport: Value  not in range 0.000000 to 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000

Here's a gist with a a minimal docker-compose.yml, Dockerfile image and a simple reproduction.py script which triggers this.

https://gist.github.com/devanubis/ec452950c317d3684016dd3b609ebca3

This can be executed with just docker-compose up.

I've found the following:

  • Moving the import matplotlib to after import helics does not result in the same exception being triggered
    • Which is very strange
  • The exception does trigger if running the federate and broker on the same host
    • Without the broker_address but with the broker_port
    • It was just easiest to do the reproduction with docker

I don't know enough about either library to dig much deeper.

I can also report this to https://github.com/matplotlib/matplotlib if you want, but thought I'd start with pyhelics since the exception is from here.

Fortunately I've been able to remove matplotlib from our affected code, so this isn't urgent for us, but it was painful to track down what was causing this weirdness.

@kdheepak
Copy link
Contributor

kdheepak commented Mar 1, 2022

Wow. Thanks for reporting. I can imagine that was super frustrating to debug.

Does this problem only occur in docker? When I run it on my mac, I don't have a problem:

Screen Shot 2022-03-01 at 11 05 53 AM

I looked through my code where I've used matplotlib and helics in the same process and found that I've imported matplotlib before and after helics, and never had a problem with this.

I've never tried this in a Docker environment though. I'm currently not able to run it (because of some weird permissions issues on my end, I'll have to try on my other computer). I'll experiment and report back.

FYI @phlptp @nightlark

@phlptp
Copy link
Member

phlptp commented Mar 1, 2022

That exception comes from argument validation when processing the federate config. One place to look would be if the json that gets created is different in those two cases, specifically on the broker_port field

@kdheepak
Copy link
Contributor

kdheepak commented Mar 2, 2022

I wonder if socket.gethostbyname('helics') is returning a different value if the imports are reordered?

@devanubis
Copy link
Author

devanubis commented Mar 2, 2022

Thanks for the quick responses!

The JSON strings appear to be identical in both cases:

Imported before:

{"name": "test_federate", "core_type": "zmq", "federates": 1, "broker_port": 23456, "broker_address": "172.22.0.2"}

Imported after:

{"name": "test_federate", "core_type": "zmq", "federates": 1, "broker_port": 23456, "broker_address": "172.22.0.2"}

And shuffling the order around to get the socket and create the JSON first before the other imports still results in the exception if matplotlib is imported before helics.

I've also just checked this with the python 3.9-slim and 3.8-slim images and they both hit this exception the same as the 3.10-slim image I initially reproduced with.

I know helics only reports 3.8 compatibility, but FWIW we've been using with with 3.9 for a while now and it seems to have run just fine with 3.10.

Does this problem only occur in docker? When I run it on my mac, I don't have a problem:

Running the reproduce.py script on a single (Ubuntu) system without the broker_address set and with a local broker helics_broker --federates=1 --name=test --localport=23456 still results in the/an exception being raised. That host has an older version of helics (2.5.2) for reasons but the exception is still thrown (it just doesn't have any message text).

@devanubis
Copy link
Author

Well here's a bit of a work-around at least:

    "broker_port": "23456",

Setting thebroker_port as a string (not an int) before encoding to JSON avoids this error.

I still have no idea what could possibly be interfering with the gridlabd code which parses the port number.

@kdheepak
Copy link
Contributor

We are able to reproduce the segfault with matplotlib and helics, and @nightlark and @phlptp have ideas for how to resolve it.

@nightlark
Copy link
Member

So what we've found is that on certain systems (Quartz, which is RHEL 7, and uses Python executables compiled by gcc 4.9.3) the order of loading shared libraries matters; it doesn't seem to be particularly dependent on the Python version.

Minimal test case that results in a segfault:

import ctypes
ctypes.cdll.LoadLibrary('lib/python3.8/site-packages/helics/install/lib64/libhelics.so')
ctypes.cdll.LoadLibrary('lib/python3.8/site-packages/kiwisolver/_cext.cpython-38-x86_64-linux-gnu.so')

In addition to the shared library from the kiwisolver dependency, the shared libraries included in matplotlib _contour.cpython-38-x86_64-linux-gnu.so and _tri.cpython-38-x86_64-linux-gnu.so also cause the same segfault.

All instances result in an error message along the lines of *** Error in 'python': free(): invalid pointer: 0x00002aaaaec90f40 ***.

After testing with a copy of the HELICS shared library that had all the extra symbols hidden, the same segfault occurs -- so the underlying cause is still a mystery; as is why the same thing doesn't happen with more of the libraries included in matplotlib.

@nightlark
Copy link
Member

nightlark commented Oct 10, 2022

Narrowed down the problem to libstdc++ or libgcc_s, most likely conflicting versions between the statically linked copy in HELICS which is kind of weird -- none of those symbols should be visible/used in processing relocations for libraries loaded by matplotlib. Afaik, we are hiding all the symbols we can from the statically linked copy of libstdc++; maybe something strange is happening with a 3rd party library whose build system we don't have as much control over.

When loading matplotlib first, the system copy of libstdc++ and libgcc_s get loaded so the functions they import get used and nothing breaks; when helics is loaded first matplotlib libraries try to use parts of the static linked copies of those libraries, which conflict with the system libraries.

This also explains why the crash doesn't happen on all systems -- some systems have a copy of libstdc++ and libgcc_s that is compatible with the statically linked version in HELICS (system copy is same or newer version than included in HELICS).


I have some ideas for how to fix this, but I'm not sure if it should be done as part of building pyhelics wheels or the HELICS release binaries yet (or both) -- it might depend on whether the segfault in matHELICS on Linux has the same underlying cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants