Skip to content

ISSUE #181 analysis

Nuno Guedelha edited this page Apr 22, 2015 · 7 revisions

How to reproduce the issue :

  1. set the environment variable YARP_CLOCK to “/clock”
  2. launch in one terminal: > yarpserver
  3. launch in another terminal > gazebo --verbose -s libgazebo_yarp_clock.so
  4. in gazebo, insert the “icub” model
  5. delete the model

Below, the last gazebo server log messages you should see in the terminal :

*** GazeboYarpIMU closing ***
Closing Server Inertial...

=> Gazebo time in the Client is then freezed
=> you can still run (in another terminal) the clock plugin rpc interface (> yarp roc /clock/rpc) :

  • pauseSimulation, continueSimulation, stepSimulation, resetSimulation have no effect on the simulator
  • getSimulationTime, getStepSize still return respective gazebo parameters (returns always the same freezed gazebo time)

Debugging environment :

  • build (for Xcode) yarp, icub-main, codyco-superbuild, gazebo-yarp-plugins to the latest master versions
  • update environment variables for Xcode build paths
  • open GazeboYarpPlugins project (GazeboYarpPlugins.xcodeproj) in Xcode
  • proceed with steps 1 to 3 of previous section (if gazebo is not run from terminal, the environment variables won't be set in Xcode debug environment)
  • in Xcode, attach to the new running process. For this, go to menu “Debug” —> “Attach to process” and select the process by name or PID (“gzserver”)
  • proceed with steps 4 to 5 of previous section
  • pause the process “gzserver” in the debugger

What is happening when YARP and GAZEBO hang during removal of the robot model :

Closure of SeverInertial blocked…

We have paused the execution of “gzserver”. We can than see in the Debug navigator of Xcode the backtrace of all the running threads :

thread 1 is in an infinite loop, closing the IMU. We depict below the function call tree :
=> GazeboYarpIMU::~GazeboYarpIMU()
	=> yarp::dev::PolyDriver::close()
		=> yarp::dev::ServerInertial::close()			——> displays "Closing Server Inertial…\n”, checks if IMU object still exists, and in that case calls stop()
			=> yarp::os::Thread::stop()   				——> here, stopping is set to true, call onStop() function from ServerInertial (IMU).
				=> yarp::os::impl::ThreadImpl::close()	——> ThreadImpl holds services for handling semaphores and threads (create, run, stop, close)
														closing set to true.
					=> yarp::os::impl::ThreadImpl::join(-1)	——> blocking call without timeout
						=> ACE_Thread::join(hid, NULL)		——> hid is the ID of the thread executing the run() method of ServerInertial
							… semwait_signal					——> we stay indefinitely waiting for the thread to be terminated and released (handled through a semaphore)

thread running ServerInertial…

We can now look for the thread executing the run() method of ServerInertial. For this we can either :

  • search for yarp::dev::ServerInertial::run() in the code, set a breakpoint there, restart gazebo, run the "gzserver" process until breakpoint and identify the respective running thread, then continue until deadlock occurs.
  • or instead, search in the whole threads backtrace the ThreadCallbackAdapter instance address. This leads to the owner (ServerInertial) since ThreadCallbackAdapter calls the owner’s method run().

We match the thread running yarp::dev::ServerInertial::run() (thread 22 in analysed log). The thread is indefinitely waiting for a semaphore (thread trace semaphore_wait_trap) :

yarp::dev::ServerInertial::run()
{
	…
	while(!isStopping())			<——	break condition is stopping=TRUE.
	{
		…
		// publish on YARP port Measurement data, ROS topic if required
		…
		yarp::os::Time::delay(k)	<——	minimum delay for the polling of stopping (usually around 0.01 s). The thread hangs here.
	}
	…
	}

yarp::os::Time::delay(k)
        - - - - >  yarp::os::NetworkClock::delay(double)
            => yarp::os::Semaphore::wait()
                => semaphore_wait_trap

At this point, we know that : => we don’t get back from the delay, an so, never return from thread->run() => never execute thread->threadRelease() => don’t release the semaphore blocking the thread 1 (closure of ServerInertial)

YARP Network clock and delay blocked…

In NetworkClock class, delays are handled through semaphores, in a list (class Waiters) of pairs (timeout_delay | semaphore). This list is updated periodicaly by NetworkClock::read(..). This function releases semaphores which have timed out. After the Robot model is removed, a breakpoint in this method will never trigger. We then check the normal call tree when not removing the robot model :

PortCoreInputUnit::run()
	=> ip->beginRead()											(reading carrier specific preamble)
		=> man.readBlock(ip->getReceiver().modifyIncomingData(br),id,os);		[PortCore::readBlock(ConnectionReader& reader, void *id, OutputStream *os)]
			=>result = this->reader->read(reader);						[PortCoreAdapterread(ConnectionReader& reader)]
				=>permanentReadDelegate->read(reader);				[NetworkClock::read]

After the robot is removed, we get stuck in beginRead(), waiting for the next data on from /clock port to arrive.

So we can conclude that the clock data is not published in the /clock port and that this causing the whole process to hang. Current design does not allow the gazebo clock data to be interrupted.

yarp clock plugin and gazebo…

GazeboYarpClock::clockUpdate() is called at onUpdate events of gazebo, for reading gazebo clock and publishing it to /clock port. A breakpoint in this function will never trigger after the robot model has been removed. Also, the gazebo's simulation timing is freezed as m_world->GetSimTime() will return always the same value. Actually, gazebo is waiting for ServerInertial to close before updating his world simulation clock, which makes sense if the component being closed interacts with the physics simulation (we can see this as an “instantaneous removal” in the simulated world).

If we hack the GazeboYarpClock::clockStep method for generating a fake clock into /clock port, we unblock YARP and Gazebo and all returns to normal until next model removal. So we send the "stepSimulation" command in rpc interface as many times as needed for going though the complete removal of the robot until gazebo's timing is running again.

void GazeboYarpClock::clockStep(unsigned int step)
{
    static gazebo::common::Time currentTime = m_world->GetSimTime();
    gazebo::common::Time endTime = currentTime;
    endTime += 0.05; // total wait of 50ms
    if (m_clockPort) {
        while(currentTime < endTime)
        {
            yarp::os::Bottle& b = m_clockPort->prepare();
            b.clear();
            b.addInt(currentTime.sec);
            b.addInt(currentTime.nsec);
            m_clockPort->write();
            yarp::os::SystemClock::delaySystem(getStepSize());
            currentTime += getStepSize();
       }
    }
}

conclusion…

There is no localised issue with the clock plugin, YARP or gazebo, but a global design issue regarding the time handling between those components.

Proposed solutions :

A first solution would be to use a non blocking stop procedure like for instance using askToStop() instead of stop() in the YARP thread handling. This would be less robust regarding dependencies between components, and actually doesn't work as is because we keep hanging in later calls to any delay functions.

A second and more robust solution would be to handle, within the interaction of YARP, Gazebo and the clock plugin, the switching between a simulation timing (generated by gazebo) and a system timing. We depict below the impacts on the different components :

YARP: yarp should have in its API a service to switch YARP clock from system to network and vice versa whenever necessary, in a smooth way. This service already exists (Time::useSystemClock(), Time::useNetworkClock(), etc) but it doesn't seem to handle a synchronised switching. A possible improvement would be to implement a virtual clock in YARP, which value would be continuous, while the clock frequency would change.

Gazebo Yarp Clock plugin: we should add a "onUpdate" behaviour, in case of the removal of a model, to let the plugin switch the clock source of YARP (system <—> network). This proposal needs to be elaborated …

Clone this wiki locally