You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, could you confirm that the current HdfsState implementation is non-transactional state and so there is no guarantee that data gets written to HDFS exactly once?
Second, wanted your opinion on implementing an opaque transactional state for writes to HDFS:
A naive implementation of maintaining the state of the file as of previous batch separate from the current file will likely be expensive to implement without the support of file appends. For instance, in a sample implementation, every batch of writes will end up in its own file with no batching efficiencies for downstream consumers.
An alternative implementation could be to have the file f and the prev batch b as 2 separate files where f is always in an open state while b is written afresh and closed for every batch. The name of the file to store b could itself be the "previous tx id". When the current write happens with a different tx id, the execute() function reads b and writes to f. It then deletes b. A new file with the current tx id is created and this stores the current b.
At the time of rotation, b is read and written to f which is rotated away. File storing b is emptied since we will need the tx id for the next write.
When the current write happens with the same tx id as the previous attempt, then b is overwritten with the current batch's data. In particular, f is untouched.
Appreciate your feedback
Thanks
-Ranga
The text was updated successfully, but these errors were encountered:
Thanks much for this contrib.
First, could you confirm that the current HdfsState implementation is non-transactional state and so there is no guarantee that data gets written to HDFS exactly once?
Second, wanted your opinion on implementing an opaque transactional state for writes to HDFS:
A naive implementation of maintaining the state of the file as of previous batch separate from the current file will likely be expensive to implement without the support of file appends. For instance, in a sample implementation, every batch of writes will end up in its own file with no batching efficiencies for downstream consumers.
An alternative implementation could be to have the file f and the prev batch b as 2 separate files where f is always in an open state while b is written afresh and closed for every batch. The name of the file to store b could itself be the "previous tx id". When the current write happens with a different tx id, the execute() function reads b and writes to f. It then deletes b. A new file with the current tx id is created and this stores the current b.
At the time of rotation, b is read and written to f which is rotated away. File storing b is emptied since we will need the tx id for the next write.
When the current write happens with the same tx id as the previous attempt, then b is overwritten with the current batch's data. In particular, f is untouched.
Appreciate your feedback
Thanks
-Ranga
The text was updated successfully, but these errors were encountered: