-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-17658. HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss #7179
base: trunk
Are you sure you want to change the base?
Conversation
…ion blocks are sufficiently replicated which causes HDFS Data Loss
💔 -1 overall
This message was automatically generated. |
Javadoc Failures are due to usage of
SpotBugs Failures:
Unit Test Report:
The The 4 flaky tests are unrelated to this change. |
💔 -1 overall
This message was automatically generated. |
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
Show resolved
Hide resolved
...ct/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
Outdated
Show resolved
Hide resolved
...dfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/UnderConstructionBlocks.java
Show resolved
Hide resolved
ucBlocksByDatanode.getOrDefault(dn, Collections.emptyList()); | ||
final String ucBlocksString = | ||
ucBlocks.stream().map(Object::toString).collect(Collectors.joining(",")); | ||
LOG.info("Cannot decommission datanode {} with {} UC blocks: [{}]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this info log become noisy if the client takes time to close the stream ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log will be printed at most once per datanode per DatanodeAdminDefaultMonitor cycle. This means in a 1k datanode HDFS cluster, there could be up to 1k log lines printed every 30 seconds (if all the 1k datanodes have under construction blocks).
This behaviour matches existing behaviour where if a decommissioning datanode has under replicated blocks, then 1 log line will be printed every single DatanodeAdminDefaultMonitor cycle (for that datanode):
return; | ||
} | ||
if (reportingNode == null) { | ||
LOG.warn("Unexpected null input {}", reportingNode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have better msg here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should never occur in practice & so I don't know what additional details/guidance we can add here. Please let me know if you have any specific thoughts/suggestions on this
Since there is a different "Unexpected null input" log printed for various different methods in UnderConstructionBlocks, I have gone ahead & added the method name to the log so that they can be distinguished
...dfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/UnderConstructionBlocks.java
Show resolved
Hide resolved
...dfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/UnderConstructionBlocks.java
Outdated
Show resolved
Hide resolved
@KevinWikant Thanks for your report and contribution. Just wonder which version do you deployed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please consider adding javadoc in the correct format to all the methods and detailed comments.
- I have tried to comment two methods, please consider following a similar pattern for the other methods.
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
Show resolved
Hide resolved
@@ -190,6 +190,7 @@ public class BlockManager implements BlockStatsMXBean { | |||
|
|||
private final PendingDataNodeMessages pendingDNMessages = | |||
new PendingDataNodeMessages(); | |||
private final UnderConstructionBlocks ucBlocks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this UnderConstruction or UnderReplication ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I am definitely open to changing the terminology here.
I don't know if UnderReplication
is the correct term though, I think this may be more applicable to a block which needs to be replicated asynchronously by the Namenode to meet replication factor (i.e. due to datanode decommissioning or datanode failures).
From the Namenode logs we can see the block replica state is reported as RBW
(replica being written) \ RECEIVING_BLOCK
:
2024-12-05 15:51:09,399 DEBUG blockmanagement.BlockManager: Reported block blk_1073741825_1001 on 172.31.93.89:9866 size 268435456 replicaState = RBW
2024-12-05 15:51:09,399 DEBUG BlockStateChange: BLOCK* block RECEIVING_BLOCK: blk_1073741825_1001 is received from 172.31.93.89:9866
I took the term UnderConstruction
from here in the code:
Line 305 in cd2cffe
if (bc.isUnderConstruction() && block.equals(bc.getLastBlock())) {
However, upon further inspection it seems this term is related to a BlockCollection (as opposed to a block replica):
Let me know your thoughts. I can refactor UnderConstructionBlocks
to something like RbwBlocks
or ReplicaBeingWrittenBlocks
if this is a more accurate term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seeing that "UnderConstruction" terminology is already used in the code for both blocks & block replicas:
- https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/ReplicaUnderConstruction.java
- https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockUnderConstructionFeature.java
Therefore, I do think that the name UnderConstructionBlocks
aligns with existing terminology already used in the code base
...ct/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
Outdated
Show resolved
Hide resolved
...dfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/UnderConstructionBlocks.java
Outdated
Show resolved
Hide resolved
...dfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/UnderConstructionBlocks.java
Show resolved
Hide resolved
...dfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/UnderConstructionBlocks.java
Show resolved
Hide resolved
...dfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/UnderConstructionBlocks.java
Show resolved
Hide resolved
...dfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/UnderConstructionBlocks.java
Show resolved
Hide resolved
...dfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/UnderConstructionBlocks.java
Show resolved
Hide resolved
Thank you @vnhive @Hexiaoqiao & @shameersss1 for the feedback on this change. Apologies for the delayed response, I was out travelling past several days. I will address the remaining Javadoc & Spotbugs warnings. I will also address all of the above conversations/comments. Note that the unit test failures are unrelated:
|
@Hexiaoqiao the testing was conducted against Hadoop version 3.4.0 Please let me know if you have any questions or comments regarding the details/description in the Pull Request and/or the JIRA |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
One more Javadoc issue to fix, interestingly when I run
One of the unit test failures is potentially related for
Will wait to see if there are any more follow-up comments before pushing a new revision |
Javadoc FailureI have fixed the last javadoc warning:
I am probably missing something here, but when I previously ran javadoc on my local I did not see this error:
Unit Test Failure - testDecommissionWithUCBlocksFeatureDisabledAndDefaultMonitorI ran the following command 100 times in a loop:
I then ran the following command 50 times in a loop:
I did not see any failures for I suspect the Proposing that we do not block this change on this potentially flaky test.
Also, about 1 in 5 test runs I was seeing failures in |
Unit Test Failure - testDecommissionWithUCBlocksFeatureEnabledAndBackoffMonitorWhen I run The root cause is as follows:
Relevant logs:
Note how on datanode However, on The solution I have implemented is to only add the replica to
|
🎊 +1 overall
This message was automatically generated. |
Problem
Problem background:
There is logic in the DatanodeAdminManager/DatanodeAdminMonitor to avoid transitioning datanodes to decommissioned state when they have open (i.e. Under Construction) blocks:
hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java
Line 357 in cd2cffe
hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java
Line 305 in cd2cffe
This logic does not work correctly because, as mentioned above, DatanodeAdminMonitor currently only considers blocks in the DatanodeDescriptor StorageInfos which does not include Under Construction blocks for which the DFSOutputStream has not been closed yet.
There is also logic in the HDFS DataStreamer client which will replace bad/dead datanodes in the block write pipeline. Note that:
Overall, the Namenode should not be putting datanodes with open blocks into decommissioned state & hope that the DataStreamer client is able to replace them when the decommissioned datanodes are terminated. This will not work depending on the timing & therefore is not a solution which guarantees correctness.
The Namenode needs to honor the rule that "a datanode should only enter decommissioned state if all the blocks on the datanode are sufficiently replicated to other live datanodes", even for blocks which are currently Under Construction.
Potential Solutions
One possible opinion is that if the DFSOutputStream has not been successfuly closed yet, then the client should be able to replay all the data if there is a failure. The client should not have any expectation the data is committed to HDFS until the DFSOutputStream is closed. There are a few reasons I do not think this makes sense:
Another possible option that comes to mind is to add blocks to StorageInfos before they are finalized. However, this change also is likely to have wider implications on block management.
Without modifying any existing block management logic, we can add a new data structure (UnderConstructionBlocks) which temporarily tracks the Under Construction blocks in-memory until they are committed/finalized & added to the StorageInfos.
Solution
Add a new data structure (UnderConstructionBlocks) which temporarily tracks the Under Construction blocks in-memory until they are committed/finalized & added to the StorageInfos.
Pros:
Implementation Details
Feature is behind a configuration "dfs.namenode.decommission.track.underconstructionblocks" which is disabled by default.
In the regular case, when a DFSOutputStream is closed it takes 1-2 seconds for the block replicas to be removed from UnderConstructionBlocks & added to the StorageInfos. Therefore, datanode decommissioning is only blocked until the DFSOutputStream is closed & the write operation is finished, after this time there is minimal delay in unblocking decommissioning.
In the unhappy case, when an HDFS client fails & the DFSOutputStream is never closed, then it takes
dfs.namenode.lease-hard-limit-sec = 20
minutes before the lease expires & the Namenode recovers the block. As part of block recovery, the block replicas are removed from UnderConstructionBlocks & added to the StorageInfos. Therefore, if an HDFS client fails it will (by default) take 20 minutes before decommissioning becomes unblocked.The UnderConstructionBlocks data structure is in-memory only & therefore if the Namenode is restarted then it will lose track of any previously reported Under Construction blocks. This means that datanodes can be decommissioned with Under Construction blocks if the Namenode is restarted (which makes HDFS data loss & write failures possible again).
Testing shows that UnderConstructionBlocks should not leak any Under Construction blocks (i.e. which are never removed). However, as a safeguard to monitor for this issue, any block replica open for over 2 hours will have a WARN log printed by the Namenode every 30 minutes which mentions how long the block has been open for.
The implementation of UnderConstructionBlocks was borrowed from existing code in PendingDataNodeMessages. PendingDataNodeMessages is already used by the BlockManager to track in-memory the block replicas which have been reported to the standby namenode out-of-order.
How was this patch tested?
Test Methodology
I have used the following HDFS client code to test various scenarios on a Hadoop 3.4.0 cluster with 6 datanodes:
The code enables testing 2 key scenarios:
Both these scenarios can also be tested for:
Test Results Summary
Tests which show the behaviour before & after this change:
Tests which show how long these changes can possibly block datanode decommissioning:
dfs.namenode.lease-hard-limit-sec = 20 minutes
from the time the client failed, the Namenode transitions the block replicas into recovery & the decommissioning process can then complete as-usual.In other words:
The above scenarios where tested with various different configurations:
Below are the detailed logs for a single test case 4.
Test#4 Detailed Results , Before this Change
The following are the HDFS client logs for the duration of the test:
First step is to identify the block locations:
Then we wait for the block to be closed & re-opened. Once this is completed, we can start decommissioning the 3 datanodes with under construction block replicas:
Check the decommissioning status of the datanodes to confirm they are decommissioned:
Then we stop/terminate all 3 decommissioned datanodes at the same time, this is expected to be safe since they are in decommissioned state:
After this occurs, we see HDFS write failures from the client:
And we also see the underlying HDFS data is lost:
Introspecting further into the logs, we see that when the block was re-opened:
Test#5 Detailed Results , After this Change
The following are the HDFS client logs for the duration of the test:
First step is to identify the block locations:
Then we wait for the block to be closed & re-opened. Once this is completed, we can start decommissioning the 3 datanodes with under construction block replicas:
Check the decommissioning status of the datanodes to confirm they are not decommissioned. Wait for the DFSOutputStream to be closed & then confirm decommissioning finishes after this time:
Then we stop/terminate all 3 decommissioned datanodes at the same time, this is expected to be safe since they are in decommissioned state:
There is no HDFS data loss because the block replicas are moved to other live datanodes before decommissioning finishes:
Introspecting further into the logs, we see that:
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?