Make sure the zookeeper lock obtained by CachedWebCrawlerJob is always released when the job gets interrupted #1492

marekhorst · 2024-11-18T12:22:12Z

It turned out if the first attempt of the CachedWebCrawlerJob failed due to shuffle service connectivity issue:

2024-11-16 22:38:14,911 [shuffle-client-6-1] ERROR org.apache.spark.network.client.TransportResponseHandler  - Still have 1 requests outstanding when connection from eos-m2-sn03.ocean.icm.edu.pl/10.19.65.103:7337 is closed
2024-11-16 22:38:14,912 [dispatcher-event-loop-1] ERROR org.apache.spark.storage.BlockManager  - Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds...
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.

the executor given job was run by is killed in a way the lock release operation defined in finally block code:

iis/iis-common/src/main/java/eu/dnetlib/iis/common/cache/DocumentTextCacheStorageUtils.java

Line 82 in a3c3c5a

lockManager.release(cacheRootDir.toString());

is not executed. This results in a 2nd attempt being stalled while waiting to obtain the lock which was not released by the 1st attempt.

Originally described in:
https://support.openaire.eu/issues/10157#note-2

The text was updated successfully, but these errors were encountered:

marekhorst · 2024-11-18T14:41:29Z

One viable solution is to replicate the finally block execution on the oozie workflow definition level, similarly to the metadata extraction workflow where lock management logic was handled solely within oozie workflow definition becuase of the MapReduce nature of the metadata extraction job.

CachedWebCrawlerJob in contrary is a spark job so it allowed encoding lock handling logic within the same java class but it turns out it does not guarantee the finally block to be executed due to a possible forced executor kill.

This means introducing the following action:

	<action name="release-lock-and-fail">
		<java>
			<main-class>eu.dnetlib.iis.common.java.ProcessWrapper</main-class>
			<arg>${lock_managing_process}</arg>
			<arg>-Pzk_session_timeout=${zk_session_timeout}</arg>
			<arg>-Pnode_id=${cache_location}</arg>
			<arg>-Pmode=release</arg>
		</java>
		<ok to="fail" />
		<error to="fail" />
	</action>

which should be referenced when the preceding action triggering CachedWebCrawlerJob execution fails.

marekhorst added the functionality: referenceextraction label Nov 18, 2024

marekhorst self-assigned this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure the zookeeper lock obtained by CachedWebCrawlerJob is always released when the job gets interrupted #1492

Make sure the zookeeper lock obtained by CachedWebCrawlerJob is always released when the job gets interrupted #1492

marekhorst commented Nov 18, 2024

marekhorst commented Nov 18, 2024 •

edited

Loading

Make sure the zookeeper lock obtained by CachedWebCrawlerJob is always released when the job gets interrupted #1492

Make sure the zookeeper lock obtained by CachedWebCrawlerJob is always released when the job gets interrupted #1492

Comments

marekhorst commented Nov 18, 2024

marekhorst commented Nov 18, 2024 • edited Loading

marekhorst commented Nov 18, 2024 •

edited

Loading