-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-5962] - Implement Basic TLS workflow #13
base: main
Are you sure you want to change the base?
Conversation
0a3ece6
to
ea43909
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Smail, thanks for the implementation! It looks quite good already, I have added some questions where we need to clarify things, and some remarks for things that should be changed.
One general question: Could you please add some more unit test coverage for adding a certificate/ca, e.g. via certificate availabe?
If you need clarification, we can also schedule a call to go through the topics.
Thank you!
src/common/client.py
Outdated
capture_output=True, | ||
input=use_input, | ||
timeout=10, | ||
).stdout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove the strip()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is unnecessary and should not be stripped at this point. We use the function to return multiple formats including the json
format which returns a parseable string directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an earlier PR, Mehdi explicitly asked for it: #5 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add it back. Thanks, René.
"""The common name for the server certificate.""" | ||
cn = self.relation_data.get("common-name", "") | ||
if not cn: | ||
cn = f"{self.unit_name}-{uuid.uuid4()!s}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we add a unique identifier to the common name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came up in a discussion with Mykola. The unit_name
alone might be unique in some cases like cross-model relations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand that. But I thought that's what the argument add_unique_id_to_subject_name
in generate_csr
of the tls-lib is for?
When inspecting one of the certs written to disk, I see this:
Subject: CN = etcd/7-2403feb0-985f-4ac9-9ab3-233fe636f831, x500UniqueIdentifier = 5df60c61-6e07-4e05-8ae6-704a199f780a, O = peer
Looks like we have to uuid's now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing this in a meeting, we agreed to switch the common name back to the unit_name
and leave a TODO comment to handle this in case we have cross-model relations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still address it in this PR - please replace the uuid by the model UUID - it can be retrieved from: charm.model.uuid
""" | ||
self.charm.tls_manager.set_tls_state(state=TLSState.TO_TLS) | ||
|
||
def _on_relation_joined(self, event: RelationJoinedEvent): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the logic in the handler can be removed for now, it is not really necessary to defer this event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to block the user from adding only one relation and not the other. We need both client
and peer
relations to be present together. This blocks the charm until they are both created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: This comment was wrongly placed here previously.
I understand your idea and we can keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this choice? Since we ended up using the 2 relations approach, why is it not possible for the user to choose to enable TLS for only one of the 2?
if ( | ||
not self.charm.state.unit_server.peer_cert_ready | ||
and not self.charm.state.unit_server.client_cert_ready | ||
): | ||
self.charm.rolling_restart() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means removing only one of the relations (e.g. switching the client certificate provider) would not restart the units. Is this a good idea? We would then leave the cluster in Error state forever, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My line of thinking was to enable/disable both the client
and peer
together we do not want a state where only one of the relations is there and the other is not.
We should further discuss this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. We either need to find a way to automate replacing only one of the cert providers, or document it in the user docs how to handle this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing this in a meeting, I will perform manual tests to ensure the code handles this case. The rolling restart should be initiated when the new relation is created and the new certs arrive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested it and it fails:
- juju integrate etcd:peer-certificates self-signed-certificates:certificates
- juju integrate etcd:client-certificates self-signed-certificates:certificates
- juju remove-relation etcd:client-certificates self-signed-certificates:certificates
- juju integrate etcd:client-certificates self-signed-certificates:certificates
results in:
etcd/9* blocked idle 12 10.105.253.156 failed to transition to/from tls state
etcd/10 blocked idle 10 10.105.253.228 failed to transition to/from tls state
etcd/11 blocked idle 11 10.105.253.70 failed to transition to/from tls state
debug-log:
unit-etcd-11: 08:43:07 DEBUG unit.etcd/11.juju-log restart:12: Member list: None
unit-etcd-11: 08:43:07 ERROR unit.etcd/11.juju-log restart:12: Enabling TLS failed: member list command failed
unit-etcd-11: 08:43:07 ERROR unit.etcd/11.juju-log restart:12: failed to transition to/from tls state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make a decision here, are we supporting 1 type of certificates or it's either 2 or 0.
if key_path.exists(): | ||
key_path.unlink() | ||
|
||
if ca_path.exists(): | ||
ca_path.unlink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found a small issue by testing: when the client-certificates
relation is removed, the key and ca-cert remain on disk:
root@juju-e1b312-7:/var/snap/charmed-etcd/common/tls# ls
client.key client_ca.pem
I believe this is because the key_path
and ca_path
variables are still set to the peer
paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is true. I missed adding ca_path = Path(self.workload.paths.tls.client_ca)
under the CertType.CLIENT
condition.
) -> str: | ||
"""Write data to etcd using `etcdctl` via `juju ssh`.""" | ||
etcd_command = f"{SNAP_NAME}.etcdctl put {key} {value} --endpoints={endpoints}" | ||
if user: | ||
etcd_command = f"{etcd_command} --user={user}" | ||
if password: | ||
etcd_command = f"{etcd_command} --password={password}" | ||
if tls_enabled: | ||
etcd_command = f"{etcd_command} --cacert /var/snap/charmed-etcd/common/tls/client_ca.pem --cert /var/snap/charmed-etcd/common/tls/client.pem --key /var/snap/charmed-etcd/common/tls/client.key" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry Smail for bringing up another one after you already worked on all my previous comments. Could you please make this a multiline string?
etcd_command = f"""{etcd_command} \
--cacert ... \
--cert ... \
...
That would make it easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Smail, great work! I left some comments and some questions
if self.state.unit_server.tls_state == TLSState.TO_TLS: | ||
try: | ||
logger.debug("Enabling TLS through rolling restart") | ||
logger.debug("Broadcasting new peer url") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To not pollute the charm.py
with too many log lines, this and similar debug logs should be moved to the cluster_manager.broadcast_peer_url
and other manager methods.
logger.debug("Writing configuration") | ||
self.config_manager.set_config_properties() | ||
|
||
self.tls_manager.set_tls_state(state=TLSState.TLS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this should be the job of the manager which only acts as a proxy to self.charm.state
. Consider doing it here directly
raise Exception("Failed to check health of the member after restart") | ||
|
||
except Exception as e: | ||
logger.error(f"Enabling TLS failed: {e}") | ||
self.set_status(Status.TLS_TRANSITION_FAILED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to raise an exception here, log and directly set the status instead.
logger.debug("Enabling TLS through rolling restart") | ||
logger.debug("Broadcasting new peer url") | ||
self.cluster_manager.broadcast_peer_url( | ||
self.state.unit_server.peer_url.replace("http://", "https://") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The state.unit_server.peer_url
is no longer up-to-date if we do this.
We should rather:
- introduce a new state:
scheme
which will hold / return either values:http, https
- update the state here
broadcast_peer_url
can take no argument, and will fetch the rightpeer_url
andpeer_scheme
and form the full url.- I think the
events/tls
module is better suited for changing the state - the restart should rather be dumber imo
@@ -44,6 +52,57 @@ def set_status(self, key: Status) -> None: | |||
getattr(logger, log_level.lower())(status.message) | |||
self.unit.status = status | |||
|
|||
def _restart(self, _) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this method carries quite a bit of duplicated logic, incorporating the feedback points, we could simplify it like:
def _restart(self, _) -> None:
"""Restart callback for the rolling ips lib."""
status_on_failure = FAILED_TO_RESTART
if self.state.unit_server.tls_state == TLSState.TO_TLS:
logger.debug("Enabling TLS through rolling restart")
self.state.unit_server.update({"peer-scheme": "https", "tls-state": TLSState.TLS})
status_on_failure = Status.TLS_TRANSITION_FAILED
elif self.state.unit_server.tls_state == TLSState.TO_NO_TLS:
logger.debug("Disabling TLS through rolling restart")
self.tls_manager.delete_certificates()
self.state.unit_server.update({"peer-scheme": "http", "tls-state": TLSState.NO_TLS})
status_on_failure = Status.TLS_TRANSITION_FAILED
self.cluster_manager.broadcast_peer_url()
self.config_manager.set_config_properties()
if not self.cluster_manager.restart_member():
self.set_status(status_on_failure)
logger.error("Failed to check health of the member after restart")
key=TEST_KEY, | ||
) | ||
== TEST_VALUE | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a new .put (different key/value and assert on it)
@pytest.mark.runner(["self-hosted", "linux", "X64", "jammy", "large"]) | ||
@pytest.mark.group(1) | ||
@pytest.mark.abort_on_fail | ||
async def test_turning_on_tls(ops_test: OpsTest) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async def test_turning_on_tls(ops_test: OpsTest) -> None: | |
async def test_enable_tls(ops_test: OpsTest) -> None: |
assert ops_test.model | ||
model = ops_test.model_full_name | ||
assert model is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment, is this needed?
|
||
# enable TLS and check if the cluster is still accessible | ||
await ops_test.model.integrate(f"{APP_NAME}:peer-certificates", TLS_NAME) | ||
await ops_test.model.integrate(f"{APP_NAME}:client-certificates", TLS_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment on status
tls_enabled=True, | ||
) | ||
== TEST_VALUE | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a .put
here
This PR implements basic TLS workflow for enabling/disabling TLS on startup and while the cluster is running.
TLS Support Enhancements:
metadata.yaml
: Added new interfaces forpeer-certificates
andclient-certificates
to support TLS certificates.src/charm.py
: IntroducedTLSManager
,TLSEvents
, andRollingOpsManager
to manage TLS transitions and rolling restarts. [1] [2]src/core/models.py
: AddedTLSState
enum and properties to handle TLS state and certificate readiness inEtcdServer
. [1] [2]Dependency Updates:
pyproject.toml
: Updated dependencies forcryptography
andpydantic
to specific versions to ensure compatibility with TLS certificate handling.EtcdClient
Enhancements:src/common/client.py
: Added methods for member list retrieval and health checks, and updated_run_etcdctl
to support TLS arguments. [1] [2]Code Documentation:
src/common/secrets.py
: Added detailed docstrings to improve code documentation and clarify the purpose and usage of methods. [1] [2]