Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make LzoTextInputFormat#listStatus thread safe for concurrent call #120

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xq262144
Copy link

@xq262144 xq262144 commented Oct 8, 2016

DeprecatedLzoTextInputFormat and LzoTextInputFormat do not thread-safe.
Use ConcurrentHashMap instead of HashMap.

@sjlee
Copy link
Collaborator

sjlee commented Nov 9, 2016

Thanks @xq262144 for your contribution.

While I understand the desire to make these classes thread safe, I don't think in general that there is no guarantee or expectation that an InputFormat or OutputFormat class should be thread safe. What is the case where you're running into thread-safety issues? Can you not make it work by simply instantiating a new instance for each thread?

@xq262144
Copy link
Author

@sjlee Thank you for your response.

I found this thread-safe issue while trying to integrate DeprecatedLzoTextInputFormat with Hive, and I got my Hive job blocking in HashMap.put method call on a corrupted HashMap data.

Then I analyzed the call stack and found Hive has some sort of input format caching mechanism in here https://github.com/apache/hive/blob/41fbe7bb7d4ad1eb0510a08df22db59e7a81c245/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L250, thus it requires input formats to be thread-safe.

To solve this, it's ok to add an exception for lzo input formats in Hive or make lzo input formats thread-safe.

I choose to make lzo input formats thread-safe while comparing to modifying Hive. Because it's a rather small code base and simpler to upgrade in a production environment.

And since hadoop-lzo is so widely used, I suppose making it thread-safe is not a bad idea. :)

@sjlee
Copy link
Collaborator

sjlee commented Nov 22, 2016

Thanks for the explanation. It is iffy that Hive caches input format instances and lets them be used by multiple concurrent threads. Hadoop-lzo might not be the only input format types that might have issues.

Have you tried opening a discussion with the Hive community? While I'm not necessarily against making this change (seems fairly low risk), I'm more curious to what the Hive community has to say.

@CLAassistant
Copy link

CLAassistant commented Jul 18, 2019

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants