Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_controller still hangs sometimes #192

Open
irapha opened this issue Feb 22, 2017 · 12 comments
Open

test_controller still hangs sometimes #192

irapha opened this issue Feb 22, 2017 · 12 comments
Assignees

Comments

@irapha
Copy link
Member

irapha commented Feb 22, 2017

in a fresh install, it will hang on the first try of testing then it will pass on all subsequent times.

I'm marking low priority because Josh now has Google to worry about and because it does work most of the time. But I'd like to at least know why this happens..

@joshuamorton
Copy link
Member

Sahit seemed to have a similar issue, it seems to be that things hand when you connect to an already publishing node. No idea why unfortunately.

@joshuamorton
Copy link
Member

Is this still the case with test controller, I can't test it.

@irapha
Copy link
Member Author

irapha commented Jun 20, 2017

@chsahit ^

@joshuamorton
Copy link
Member

@chsahit bumperoni. I don't believe it is having issues on CI, but I'd still like to know.

@joshuamorton
Copy link
Member

@jgkamat

Ok, so this seems to fail almost every time on CI, but works 100% of the time locally. Because we're on docker, sshing into the CI environment is practically useless, since everything is done inside of a docker container created inside of the CI environment. I'm trying to run locally to see if I can repro, but I run into the following:

as part of our install script, I delete a symlink (https://github.com/gtagency/buzzmobile/blob/master/install#L44) created as part of the virtualenv creation process. This works as part of the install-at-startup, and works when run on Circle, but when I run locally, runtests.sh fails because this symlink doesn't exist. If I skip the step, it fails moments later on something else. Any ideas?

@jgkamat
Copy link

jgkamat commented Aug 21, 2017

Hmm, I can try to take a look at this on tuesday earliest.

You can definetly still ssh via the circleci ssh feature, it's just a bit more complicated. Ssh in, then run a docker ps to find the running container that's hanging, and then run docker exec -ti <container sha> bash and you should get a shell up.

I'm a little bit confused as to why the symlink would be existing on the remote but not local containers. Maybe there's something slightly different about the environment variables being passed in or something?

I definetly noticed this problem earlier, but I didn't try and debug it. I have a feeling it has to do with a clean install of buzzmobile (if you get past one run, it suddenly becomes fine). I'll get back to you soon and let you know what I find out.

@joshuamorton
Copy link
Member

I can't, it fails with an error: Error response from daemon: Unsupported: Exec is not supported by the lxc driver, which led down a fun rabbit hole that was ultimately unsuccessful :/

Apparently Circle doesn't use normal docker.

@joshuamorton
Copy link
Member

Alright, now to bamboozle everyone even more: I can't repro this on local docker container.

I ran, on my host machine,

  1. sudo docker create -i -t arbitrary_value bash. arbitrary_value here is the name of the docker image from buildbaseimage.sh I believe. Basically in our case its an empty docker container with the empty folder ~/catkin_ws/src/buzzmobile defined.
  2. sudo docker start -a -i <THE_HASH_PRINTED>

Then, in the docker image

  1. cd .. (to ~/catkins_ws/src)
  2. git clone https://github.com/gtagency/buzzmobile.git
  3. cd buzzmobile
  4. ./install (this succeeded, unlike trying to docker run last week) (!!?????)
  5. rosrun (failed, as expected)
  6. pytest (failed, as expected)
  7. source bin/activate
  8. ci_scripts/style
  9. ci_scripts/unittest (passed, did not hang)
  10. deactivate
  11. ci_scripts/unittest (passed, did not hang)
  12. ci_scripts/simulation (failed with ImportError: No module named googlemapskey as tracked in Implement Simulation Using ROS and Gazebo #112)

I'm fully bamboozled here, because it is quite clearly hanging on CI, but its not here. As an attempted fix I guess we could try to have the install step run git clone instead of anything else, but that doesn't seem any better.

@jgkamat
Copy link

jgkamat commented Aug 27, 2017

Yup, I can't seem to reproduce it locally on a single machine (after one run).

Since it seems flakey (not all the time) and it happens on CI but not on local, this smells like a deadlock/race condition that's going on, which is somehow made worse when you have less parallization (CI only has 2 threads).

It's possible some subtle change to CircleCI caused this to start triggering now. I have no idea how ros works though, so it might not be related.

I think this a semi-recent change though, since d5aef13 builds fine (at least for one try, you could try rebuilding it to see if it happens consistently). It seems to fail once I merge master in though.

I actually noticed this back a long time ago but I thought it was something up with your testing suite, so I didn't comment.

I would try disabling one of the two tests your running (at a time) and see if that helps to help narrow down the problem. I would start with test_controller, since the test I ran just now seems to be hanging on that (but I don't know how pytest output works, so I might be wrong.

@joshuamorton
Copy link
Member

I'm still leaning towards it being an issue in pyrostest, if only because https://circleci.com/gh/gtagency/buzzmobile/596 passes.

My guess would be that there's something that can spin if the context managers aren't being used correctly, but I'm not sure why exactly that would be the case.

@joshuamorton
Copy link
Member

Ok nevermind. I added an additional test in gtagency/pyrostest#26, and that works just fine. So yeah, this appears to be a weird CI issue.

@joshuamorton
Copy link
Member

Added some new things to pyrostest that should mitigate this. In testing it appeared to make the tests fail early instead of hanging, so that's nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants