Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-5962] - Implement Basic TLS workflow #13

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

skourta
Copy link
Contributor

@skourta skourta commented Dec 17, 2024

This PR implements basic TLS workflow for enabling/disabling TLS on startup and while the cluster is running.

TLS Support Enhancements:

  • metadata.yaml: Added new interfaces for peer-certificates and client-certificates to support TLS certificates.
  • src/charm.py: Introduced TLSManager, TLSEvents, and RollingOpsManager to manage TLS transitions and rolling restarts. [1] [2]
  • src/core/models.py: Added TLSState enum and properties to handle TLS state and certificate readiness in EtcdServer. [1] [2]

Dependency Updates:

  • pyproject.toml: Updated dependencies for cryptography and pydantic to specific versions to ensure compatibility with TLS certificate handling.

EtcdClient Enhancements:

  • src/common/client.py: Added methods for member list retrieval and health checks, and updated _run_etcdctl to support TLS arguments. [1] [2]

Code Documentation:

@skourta skourta force-pushed the DPE-5962-tls-implement-basic-flow branch from 0a3ece6 to ea43909 Compare January 8, 2025 15:35
@skourta skourta marked this pull request as ready for review January 8, 2025 15:52
Copy link
Collaborator

@reneradoi reneradoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Smail, thanks for the implementation! It looks quite good already, I have added some questions where we need to clarify things, and some remarks for things that should be changed.

One general question: Could you please add some more unit test coverage for adding a certificate/ca, e.g. via certificate availabe?

If you need clarification, we can also schedule a call to go through the topics.

Thank you!

src/common/secrets.py Outdated Show resolved Hide resolved
src/managers/tls.py Show resolved Hide resolved
src/charm.py Outdated Show resolved Hide resolved
src/common/client.py Outdated Show resolved Hide resolved
capture_output=True,
input=use_input,
timeout=10,
).stdout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the strip()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is unnecessary and should not be stripped at this point. We use the function to return multiple formats including the json format which returns a parseable string directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an earlier PR, Mehdi explicitly asked for it: #5 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it back. Thanks, René.

"""The common name for the server certificate."""
cn = self.relation_data.get("common-name", "")
if not cn:
cn = f"{self.unit_name}-{uuid.uuid4()!s}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we add a unique identifier to the common name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came up in a discussion with Mykola. The unit_name alone might be unique in some cases like cross-model relations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand that. But I thought that's what the argument add_unique_id_to_subject_name in generate_csr of the tls-lib is for?

When inspecting one of the certs written to disk, I see this:
Subject: CN = etcd/7-2403feb0-985f-4ac9-9ab3-233fe636f831, x500UniqueIdentifier = 5df60c61-6e07-4e05-8ae6-704a199f780a, O = peer

Looks like we have to uuid's now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing this in a meeting, we agreed to switch the common name back to the unit_name and leave a TODO comment to handle this in case we have cross-model relations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still address it in this PR - please replace the uuid by the model UUID - it can be retrieved from: charm.model.uuid

src/events/etcd.py Show resolved Hide resolved
"""
self.charm.tls_manager.set_tls_state(state=TLSState.TO_TLS)

def _on_relation_joined(self, event: RelationJoinedEvent):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the logic in the handler can be removed for now, it is not really necessary to defer this event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to block the user from adding only one relation and not the other. We need both client and peer relations to be present together. This blocks the charm until they are both created.

Copy link
Collaborator

@reneradoi reneradoi Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: This comment was wrongly placed here previously.

I understand your idea and we can keep it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this choice? Since we ended up using the 2 relations approach, why is it not possible for the user to choose to enable TLS for only one of the 2?

Comment on lines +174 to +178
if (
not self.charm.state.unit_server.peer_cert_ready
and not self.charm.state.unit_server.client_cert_ready
):
self.charm.rolling_restart()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means removing only one of the relations (e.g. switching the client certificate provider) would not restart the units. Is this a good idea? We would then leave the cluster in Error state forever, no?

Copy link
Contributor Author

@skourta skourta Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My line of thinking was to enable/disable both the client and peer together we do not want a state where only one of the relations is there and the other is not.

We should further discuss this scenario.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We either need to find a way to automate replacing only one of the cert providers, or document it in the user docs how to handle this use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing this in a meeting, I will perform manual tests to ensure the code handles this case. The rolling restart should be initiated when the new relation is created and the new certs arrive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested it and it fails:

  • juju integrate etcd:peer-certificates self-signed-certificates:certificates
  • juju integrate etcd:client-certificates self-signed-certificates:certificates
  • juju remove-relation etcd:client-certificates self-signed-certificates:certificates
  • juju integrate etcd:client-certificates self-signed-certificates:certificates

results in:

etcd/9*                      blocked   idle   12       10.105.253.156         failed to transition to/from tls state
etcd/10                      blocked   idle   10       10.105.253.228         failed to transition to/from tls state
etcd/11                      blocked   idle   11       10.105.253.70          failed to transition to/from tls state               

debug-log:

unit-etcd-11: 08:43:07 DEBUG unit.etcd/11.juju-log restart:12: Member list: None
unit-etcd-11: 08:43:07 ERROR unit.etcd/11.juju-log restart:12: Enabling TLS failed: member list command failed
unit-etcd-11: 08:43:07 ERROR unit.etcd/11.juju-log restart:12: failed to transition to/from tls state

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make a decision here, are we supporting 1 type of certificates or it's either 2 or 0.

src/managers/cluster.py Outdated Show resolved Hide resolved
Comment on lines +130 to +134
if key_path.exists():
key_path.unlink()

if ca_path.exists():
ca_path.unlink()
Copy link
Collaborator

@reneradoi reneradoi Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found a small issue by testing: when the client-certificates relation is removed, the key and ca-cert remain on disk:

root@juju-e1b312-7:/var/snap/charmed-etcd/common/tls# ls
client.key  client_ca.pem

I believe this is because the key_path and ca_path variables are still set to the peer paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is true. I missed adding ca_path = Path(self.workload.paths.tls.client_ca) under the CertType.CLIENT condition.

) -> str:
"""Write data to etcd using `etcdctl` via `juju ssh`."""
etcd_command = f"{SNAP_NAME}.etcdctl put {key} {value} --endpoints={endpoints}"
if user:
etcd_command = f"{etcd_command} --user={user}"
if password:
etcd_command = f"{etcd_command} --password={password}"
if tls_enabled:
etcd_command = f"{etcd_command} --cacert /var/snap/charmed-etcd/common/tls/client_ca.pem --cert /var/snap/charmed-etcd/common/tls/client.pem --key /var/snap/charmed-etcd/common/tls/client.key"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry Smail for bringing up another one after you already worked on all my previous comments. Could you please make this a multiline string?

etcd_command = f"""{etcd_command} \
                             --cacert ... \
                             --cert ... \
...

That would make it easier to read.

Copy link

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Smail, great work! I left some comments and some questions

if self.state.unit_server.tls_state == TLSState.TO_TLS:
try:
logger.debug("Enabling TLS through rolling restart")
logger.debug("Broadcasting new peer url")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To not pollute the charm.py with too many log lines, this and similar debug logs should be moved to the cluster_manager.broadcast_peer_url and other manager methods.

logger.debug("Writing configuration")
self.config_manager.set_config_properties()

self.tls_manager.set_tls_state(state=TLSState.TLS)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this should be the job of the manager which only acts as a proxy to self.charm.state. Consider doing it here directly

Comment on lines +70 to +74
raise Exception("Failed to check health of the member after restart")

except Exception as e:
logger.error(f"Enabling TLS failed: {e}")
self.set_status(Status.TLS_TRANSITION_FAILED)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to raise an exception here, log and directly set the status instead.

logger.debug("Enabling TLS through rolling restart")
logger.debug("Broadcasting new peer url")
self.cluster_manager.broadcast_peer_url(
self.state.unit_server.peer_url.replace("http://", "https://")
Copy link

@Mehdi-Bendriss Mehdi-Bendriss Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state.unit_server.peer_url is no longer up-to-date if we do this.
We should rather:

  1. introduce a new state: scheme which will hold / return either values: http, https
  2. update the state here
  3. broadcast_peer_url can take no argument, and will fetch the right peer_url and peer_scheme and form the full url.
  4. I think the events/tls module is better suited for changing the state - the restart should rather be dumber imo

@@ -44,6 +52,57 @@ def set_status(self, key: Status) -> None:
getattr(logger, log_level.lower())(status.message)
self.unit.status = status

def _restart(self, _) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this method carries quite a bit of duplicated logic, incorporating the feedback points, we could simplify it like:

def _restart(self, _) -> None:
    """Restart callback for the rolling ips lib."""
    status_on_failure = FAILED_TO_RESTART
    if self.state.unit_server.tls_state == TLSState.TO_TLS:
        logger.debug("Enabling TLS through rolling restart")
        self.state.unit_server.update({"peer-scheme": "https", "tls-state": TLSState.TLS})
        status_on_failure = Status.TLS_TRANSITION_FAILED
    elif self.state.unit_server.tls_state == TLSState.TO_NO_TLS:
        logger.debug("Disabling TLS through rolling restart")
        self.tls_manager.delete_certificates()
        self.state.unit_server.update({"peer-scheme": "http", "tls-state": TLSState.NO_TLS})
        status_on_failure = Status.TLS_TRANSITION_FAILED

    self.cluster_manager.broadcast_peer_url()
    self.config_manager.set_config_properties()

    if not self.cluster_manager.restart_member():
        self.set_status(status_on_failure)
        logger.error("Failed to check health of the member after restart")

key=TEST_KEY,
)
== TEST_VALUE
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a new .put (different key/value and assert on it)

@pytest.mark.runner(["self-hosted", "linux", "X64", "jammy", "large"])
@pytest.mark.group(1)
@pytest.mark.abort_on_fail
async def test_turning_on_tls(ops_test: OpsTest) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def test_turning_on_tls(ops_test: OpsTest) -> None:
async def test_enable_tls(ops_test: OpsTest) -> None:

Comment on lines +139 to +141
assert ops_test.model
model = ops_test.model_full_name
assert model is not None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment, is this needed?


# enable TLS and check if the cluster is still accessible
await ops_test.model.integrate(f"{APP_NAME}:peer-certificates", TLS_NAME)
await ops_test.model.integrate(f"{APP_NAME}:client-certificates", TLS_NAME)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment on status

tls_enabled=True,
)
== TEST_VALUE
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a .put here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants