-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFA-FIX:CEPH-83595932-To verify crashes while executing drain and mgr failover commands #4230
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: SrinivasaBharath The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
35175b4
to
51b25da
Compare
f"OSD remove operation is in progress {osd_id}\nOperations: {entry}" | ||
) | ||
except json.JSONDecodeError: | ||
log.info(f"The OSD removal is completed on OSD : {osd_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could either mean that OSD removal is complete, or No OSD removal was started in the 1st place.
@@ -135,6 +135,12 @@ def run(ceph_cluster, **kw): | |||
log.info( | |||
f"The OSDs in the drain node before starting the test - {osd_count_before_test} " | |||
) | |||
cmd_set_unmanaged = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is expecting that the OSD spec on the cluster is with all-available-devices.
Can we write a generic method to export the spec from cluster, add a new unmanaged=True
key and apply the spec?
Something similar to set_mon_service_managed_type()
present in monitor workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New method is created and included.
@@ -152,6 +158,12 @@ def run(ceph_cluster, **kw): | |||
"The traceback messages are noticed in logs.The error snippets are noticed in the MGR logs" | |||
) | |||
return 1 | |||
cmd_unset_unmanaged = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. let's use the new method that will be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used new method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good.
New method to set services to managed and unmanaged to true/ false should be created and used here.
"This pull request now has conflicts with the target branch. Could you please resolve conflicts and force push the corrected changes?" |
4b5e8fa
to
ffb1680
Compare
dee438f
to
01241e3
Compare
Preempt scrub fix pass log- http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-HWPGNS |
ceph/rados/utils.py
Outdated
@@ -45,7 +45,6 @@ def set_osd_devices_unmanaged(ceph_cluster, osd_id, unmanaged): | |||
break | |||
|
|||
if not service_name: | |||
log.error(f"No orch service found for osd: {osd_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SrinivasaBharath please explain why we are making this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a mistake from my side. I reverted the changes.
log.debug( | ||
f"Setting the {service_type} service as unmanaged by cephadm. current status : {out}" | ||
) | ||
out["unmanaged"] = "false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here instead of explicitly setting it to False, we should remove the "unmanaged" key from the dictionary if it exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fine. Would not make difference with workflow.
ceph/rados/core_workflows.py
Outdated
|
||
time.sleep(10) | ||
# Checking for the unmanaged setting on service | ||
cmd = "ceph orch ls" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace the command with "ceph orch ls {service_type}" to avoid the for loop traversal
[ceph: root@ceph-hakumar-ryth74-node1-installer /]# ceph orch ls mon -f json-pretty
[
{
"placement": {
"label": "mon"
},
"service_name": "mon",
"service_type": "mon",
"status": {
"created": "2024-11-15T20:59:25.948847Z",
"last_refresh": "2024-11-22T10:12:24.482009Z",
"running": 3,
"size": 3
}
}
]
replicated_config = config.get("replicated_pool") | ||
pool_name = replicated_config["pool_name"] | ||
active_osd_list = rados_obj.get_osd_list(status="up") | ||
active_osd_list = rados_obj.get_active_osd_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert the change in this line to "active_osd_list = rados_obj.get_osd_list(status="up")"
The get_active_osd_list() method no longer exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_active_osd_list method is the local method combines the in and up osd list
@@ -133,8 +134,10 @@ def run(ceph_cluster, **kw): | |||
try: | |||
osd_count_before_test = get_node_osd_list(rados_obj, ceph_nodes, drain_host) | |||
log.info( | |||
f"The OSDs in the drain node before starting the test - {osd_count_before_test} " | |||
f"st The OSDs in the drain node before starting the te- {osd_count_before_test} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the typo
@@ -194,7 +200,7 @@ def run(ceph_cluster, **kw): | |||
return 1 | |||
|
|||
if bug_exists: | |||
active_osd_list = rados_obj.get_osd_list(status="up") | |||
active_osd_list = rados_obj.get_active_osd_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert to using get_osd_list(status="up")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -85,7 +85,7 @@ def run(ceph_cluster, **kw): | |||
|
|||
log_lines = [ | |||
"head preempted", | |||
"WaitReplicas::react(const GotReplicas&) PREEMPTED", | |||
"WaitReplicas::react(const GotReplicas&) PREEMPTED!", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SrinivasaBharath I request more clarity here.
The comparison happening in def verify_preempt_log
is not in
So previously if a subset of the sentence was not being found in log lines, how is adding an additional character '!' going to make any difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the changes. The fix will be provided in a separate PR.
e8e6839
to
d91abdf
Compare
replicated_config = config.get("replicated_pool") | ||
pool_name = replicated_config["pool_name"] | ||
active_osd_list = rados_obj.get_osd_list(status="up") | ||
active_osd_list = get_active_osd_list(rados_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reuse existing method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used the existing method.
dc036db
to
73598f5
Compare
ceph/rados/core_workflows.py
Outdated
@@ -4615,3 +4615,71 @@ def get_rados_df(self, pool_name: str = None): | |||
out = self.run_ceph_command(cmd=_cmd, client_exec=True) | |||
|
|||
return out["pools"][0] if pool_name else out | |||
|
|||
def get_service_list(self, service_type=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check existing method def list_orch_services(self, service_type=None) -> list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code.
… failover commands and preempt scrub fix Signed-off-by: Srinivasa Bharath Kanta <[email protected]>
7d7078c
to
1c7a594
Compare
Pass log with Reef- http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-N6PKVO/ |
Description
In test case execution failed and the failed log is -
http://magna002.ceph.redhat.com/cephci-jenkins/results/openstack/RH/8.0/rhel-9/Regression/19.2.0-52/rados/45/tier-2_rados_test-drain-customer-issue
Jira tasks to track the issue are -
Please include Automation development guidelines. Source of Test case - New Feature/Regression Test/Close loop of customer BZs
click to expand checklist