Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFA-FIX:CEPH-83595932-To verify crashes while executing drain and mgr failover commands #4230

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SrinivasaBharath
Copy link
Contributor

@SrinivasaBharath SrinivasaBharath commented Nov 18, 2024

Description

In test case execution failed and the failed log is -
http://magna002.ceph.redhat.com/cephci-jenkins/results/openstack/RH/8.0/rhel-9/Regression/19.2.0-52/rados/45/tier-2_rados_test-drain-customer-issue

Jira tasks to track the issue are -

  1. https://issues.redhat.com/browse/RHCEPHQE-16628
  2. https://issues.redhat.com/browse/RHCEPHQE-16835

Please include Automation development guidelines. Source of Test case - New Feature/Regression Test/Close loop of customer BZs

click to expand checklist
  • Create a test case in Polarion reviewed and approved.
  • Create a design/automation approach doc. Optional for tests with similar tests already automated.
  • Review the automation design
  • Implement the test script and perform test runs
  • Submit PR for code review and approve
  • Update Polarion Test with Automation script details and update automation fields
  • If automation is part of Close loop, update BZ flag qe-test_coverage “+” and link Polarion test

Copy link
Contributor

openshift-ci bot commented Nov 18, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SrinivasaBharath

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@SrinivasaBharath SrinivasaBharath added RADOS Rados Core tfa-issue-fix TFA automation issue fix labels Nov 18, 2024
f"OSD remove operation is in progress {osd_id}\nOperations: {entry}"
)
except json.JSONDecodeError:
log.info(f"The OSD removal is completed on OSD : {osd_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could either mean that OSD removal is complete, or No OSD removal was started in the 1st place.

@@ -135,6 +135,12 @@ def run(ceph_cluster, **kw):
log.info(
f"The OSDs in the drain node before starting the test - {osd_count_before_test} "
)
cmd_set_unmanaged = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expecting that the OSD spec on the cluster is with all-available-devices.

Can we write a generic method to export the spec from cluster, add a new unmanaged=True key and apply the spec?

Something similar to set_mon_service_managed_type() present in monitor workflows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New method is created and included.

@@ -152,6 +158,12 @@ def run(ceph_cluster, **kw):
"The traceback messages are noticed in logs.The error snippets are noticed in the MGR logs"
)
return 1
cmd_unset_unmanaged = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. let's use the new method that will be created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used new method.

Copy link
Contributor

@pdhiran pdhiran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

New method to set services to managed and unmanaged to true/ false should be created and used here.

Copy link
Contributor

mergify bot commented Nov 22, 2024

"This pull request now has conflicts with the target branch. Could you please resolve conflicts and force push the corrected changes?"

@SrinivasaBharath
Copy link
Contributor Author

Preempt scrub fix pass log- http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-HWPGNS

@@ -45,7 +45,6 @@ def set_osd_devices_unmanaged(ceph_cluster, osd_id, unmanaged):
break

if not service_name:
log.error(f"No orch service found for osd: {osd_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SrinivasaBharath please explain why we are making this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a mistake from my side. I reverted the changes.

log.debug(
f"Setting the {service_type} service as unmanaged by cephadm. current status : {out}"
)
out["unmanaged"] = "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here instead of explicitly setting it to False, we should remove the "unmanaged" key from the dictionary if it exists

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fine. Would not make difference with workflow.


time.sleep(10)
# Checking for the unmanaged setting on service
cmd = "ceph orch ls"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace the command with "ceph orch ls {service_type}" to avoid the for loop traversal

[ceph: root@ceph-hakumar-ryth74-node1-installer /]# ceph orch ls mon -f json-pretty         

[
  {
    "placement": {
      "label": "mon"
    },
    "service_name": "mon",
    "service_type": "mon",
    "status": {
      "created": "2024-11-15T20:59:25.948847Z",
      "last_refresh": "2024-11-22T10:12:24.482009Z",
      "running": 3,
      "size": 3
    }
  }
]

replicated_config = config.get("replicated_pool")
pool_name = replicated_config["pool_name"]
active_osd_list = rados_obj.get_osd_list(status="up")
active_osd_list = rados_obj.get_active_osd_list()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert the change in this line to "active_osd_list = rados_obj.get_osd_list(status="up")"
The get_active_osd_list() method no longer exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_active_osd_list method is the local method combines the in and up osd list

@@ -133,8 +134,10 @@ def run(ceph_cluster, **kw):
try:
osd_count_before_test = get_node_osd_list(rados_obj, ceph_nodes, drain_host)
log.info(
f"The OSDs in the drain node before starting the test - {osd_count_before_test} "
f"st The OSDs in the drain node before starting the te- {osd_count_before_test} "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the typo

@@ -194,7 +200,7 @@ def run(ceph_cluster, **kw):
return 1

if bug_exists:
active_osd_list = rados_obj.get_osd_list(status="up")
active_osd_list = rados_obj.get_active_osd_list()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert to using get_osd_list(status="up")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -85,7 +85,7 @@ def run(ceph_cluster, **kw):

log_lines = [
"head preempted",
"WaitReplicas::react(const GotReplicas&) PREEMPTED",
"WaitReplicas::react(const GotReplicas&) PREEMPTED!",
Copy link
Contributor

@harshkumarRH harshkumarRH Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SrinivasaBharath I request more clarity here.
The comparison happening in def verify_preempt_log is not in
So previously if a subset of the sentence was not being found in log lines, how is adding an additional character '!' going to make any difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the changes. The fix will be provided in a separate PR.

replicated_config = config.get("replicated_pool")
pool_name = replicated_config["pool_name"]
active_osd_list = rados_obj.get_osd_list(status="up")
active_osd_list = get_active_osd_list(rados_obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reuse existing method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used the existing method.

@SrinivasaBharath SrinivasaBharath force-pushed the rados_tfa_wip branch 2 times, most recently from dc036db to 73598f5 Compare November 27, 2024 03:25
@@ -4615,3 +4615,71 @@ def get_rados_df(self, pool_name: str = None):
out = self.run_ceph_command(cmd=_cmd, client_exec=True)

return out["pools"][0] if pool_name else out

def get_service_list(self, service_type=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check existing method def list_orch_services(self, service_type=None) -> list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code.

… failover commands and preempt scrub fix

Signed-off-by: Srinivasa Bharath Kanta <[email protected]>
@SrinivasaBharath
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RADOS Rados Core tfa-issue-fix TFA automation issue fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants