Implementing an opaque transactional HdfsState #23

rangatdt · 2015-01-09T06:00:10Z

Thanks much for this contrib.

First, could you confirm that the current HdfsState implementation is non-transactional state and so there is no guarantee that data gets written to HDFS exactly once?

Second, wanted your opinion on implementing an opaque transactional state for writes to HDFS:

A naive implementation of maintaining the state of the file as of previous batch separate from the current file will likely be expensive to implement without the support of file appends. For instance, in a sample implementation, every batch of writes will end up in its own file with no batching efficiencies for downstream consumers.

An alternative implementation could be to have the file f and the prev batch b as 2 separate files where f is always in an open state while b is written afresh and closed for every batch. The name of the file to store b could itself be the "previous tx id". When the current write happens with a different tx id, the execute() function reads b and writes to f. It then deletes b. A new file with the current tx id is created and this stores the current b.

At the time of rotation, b is read and written to f which is rotated away. File storing b is emptied since we will need the tx id for the next write.

When the current write happens with the same tx id as the previous attempt, then b is overwritten with the current batch's data. In particular, f is untouched.

Appreciate your feedback
Thanks
-Ranga

rangatdt · 2015-02-15T11:47:20Z

In case someone else is interested in what came about from this, I have documented our findings and eventual approach over at http://blog.thedatateam.in/2015/02/guaranteeing-exactly-once-load.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing an opaque transactional HdfsState #23

Implementing an opaque transactional HdfsState #23

rangatdt commented Jan 9, 2015

rangatdt commented Feb 15, 2015

Implementing an opaque transactional HdfsState #23

Implementing an opaque transactional HdfsState #23

Comments

rangatdt commented Jan 9, 2015

rangatdt commented Feb 15, 2015