Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing an opaque transactional HdfsState #23

Open
rangatdt opened this issue Jan 9, 2015 · 1 comment
Open

Implementing an opaque transactional HdfsState #23

rangatdt opened this issue Jan 9, 2015 · 1 comment

Comments

@rangatdt
Copy link

rangatdt commented Jan 9, 2015

Thanks much for this contrib.

First, could you confirm that the current HdfsState implementation is non-transactional state and so there is no guarantee that data gets written to HDFS exactly once?

Second, wanted your opinion on implementing an opaque transactional state for writes to HDFS:

A naive implementation of maintaining the state of the file as of previous batch separate from the current file will likely be expensive to implement without the support of file appends. For instance, in a sample implementation, every batch of writes will end up in its own file with no batching efficiencies for downstream consumers.

An alternative implementation could be to have the file f and the prev batch b as 2 separate files where f is always in an open state while b is written afresh and closed for every batch. The name of the file to store b could itself be the "previous tx id". When the current write happens with a different tx id, the execute() function reads b and writes to f. It then deletes b. A new file with the current tx id is created and this stores the current b.

At the time of rotation, b is read and written to f which is rotated away. File storing b is emptied since we will need the tx id for the next write.

When the current write happens with the same tx id as the previous attempt, then b is overwritten with the current batch's data. In particular, f is untouched.

Appreciate your feedback
Thanks
-Ranga

@rangatdt
Copy link
Author

In case someone else is interested in what came about from this, I have documented our findings and eventual approach over at http://blog.thedatateam.in/2015/02/guaranteeing-exactly-once-load.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant