Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor testnet integration testing #81

Closed
7 of 8 tasks
tegefaulkes opened this issue Dec 18, 2023 · 22 comments
Closed
7 of 8 tasks

Refactor testnet integration testing #81

tegefaulkes opened this issue Dec 18, 2023 · 22 comments
Assignees
Labels
development Standard development r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Dec 18, 2023

Specification

We need to refactor and streamline testnet integration testing. There are a few core goals of this.

  1. Extract and simplify integration specific tests and move them to a separate test directory.
  2. simplify the normal tests by removing the testif test switching and removing connections to the testnet.
  3. update ci jobs to do integration testing after the seed nodes have updated with the new deployment.

For All of the testif usage in the standard tests need to be removed. The integration tests should not just be repeats of certain standard tests. The integration tests may still use the testif utility to prevent some tests on certain platforms. This would need to be looked into.

Integration tests need to be moved into a separate test directory to keep it isolated from the standard tests. This will allow us to easily run the standard and integration tests separately. The Integration tests should focus explicitly on two aspects.

  1. Running built and package executable to verify that they are working
  2. Doing some basic checks while connecting to the testnet.

The ci jobs need to be updated as well. Currently we have a job that triggers the seed node tasks to restart with the latest image we deployed. This will be removed and handled by the infrastructure repo. This step should be replaced with a job that waits for the seed nodes to update before proceeding. We'll need an API endpoint on the polykey dashboard that exposes the seed node versions in some fashion to help with this. @tegefaulkes will need to work with @amydevs to spec this out and address it.

Waiting for the seed nodes to deploy and update is not essential here. Any breaking changes on the agent-agent RPC should be pretty rare. For now I'm going to address it last as things should function without it for now.

Additional context

Tasks

  • 1. All the standard tests will no attempt connections to any network. They should explicitly be started with no seed nodes.
  • 2. The integration tests will be separated from the standard tests.
    • 1. remove usage of the testif utility from the standard tests.
    • 2. integration tests will be in a separate folder structure from the standard tests.
    • 3. integration tests will focus on connecting to the testnet.
  • 3. integration tests need to wait for the testnet to be updated. to this end the following changes are to be made.
    • 1. CLI ci job for for triggering the testnet seed nodes using the new image will be removed. This will be handled by the testnet infrastructure.
    • 2. A job will be created to wait for the seed nodes to switch to the new version. This will be done by polling a polykey dashboard endpoint that will either return the when all seednodes have been updated, or list the versions of all the seed nodes. This will need to be speced out.
@tegefaulkes tegefaulkes added the development Standard development label Dec 18, 2023
@tegefaulkes tegefaulkes self-assigned this Dec 18, 2023
@tegefaulkes
Copy link
Contributor Author

To simplify how the integration tests are run, We're removing all usage of env variables and switching in the tests. The command used for the integration tests will be coded into the test and not expected to he set by the CI job as an env variable.

In the case of the docker tests the ci job is simplified to

      nix-shell --arg ci true --run $'
      image_and_tag="$(docker load --input ./builds/*docker* | cut -d\' \' -f3)";
      docker tag "$image_and_tag" "polykey:testTarget";
      npm run test tests/integration/docker;

The command used in the docker test will be hard coded as docker run ${dockerOptions} polykey:testTarget. So there will be no need for env variables. So long as a polykey image tagged as polykey:testTarget exists then the docker test can be run.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 19, 2023 via email

@tegefaulkes
Copy link
Contributor Author

I'll change the name of the tagged image, but I don't know of a good way to pass the image name in when running the tests. I'm keeping that hard coded for now.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Dec 19, 2023

I need to spec out how to handle checking if the seed nodes have updated.

This can be done simply with two pieces of information.

  1. Knowing that the new desired version is.
  2. asking the dashboard for every seed node status and checking their versions.

We can compare the seed nodes to the new version and wait for all live nodes to match that version. We can take the current version from the config file to know 1. And we'll need to add an API endpoint to the Polykey dashboard for 2.

The Polykey dashboard API needs an endpoint that returns all of the active nodes and their status. This will be similar to getting the status of a node via the CLI but we only want to present public information here. We can include the following information. Any suggestions are welcome.

{
  nodeId: NodeIdEncoded,
  agentHost: Host,
  agentPort: Port,
  upTime: number,
  connectionsActive: number, // Number of active connections
  nodesTotal: number, // Number of seen connections in the node graph
  versionCLI: string, // Version of the Polykey CLI code
  versionPolykey: string, // Version of the Polykey lib code
  versionState: number, // Version of the State schema
  versionNetwork: number, // Version of the network API
}

The dashboard will have to poll the seed nodes for this information and cache it. It won't be ideal for every request to trigger asking every seed node.

To implement this API endpoint we need some way of asking seed nodes this information. I am unsure how we check seed nodes already. Is this done via the client RPC? Maybe we just need to expand the existing status RPC call to return the version information.

@tegefaulkes will need to work with @amydevs to expand this spec. And maybe make a new issue for tracking this functionality specifically.

@amydevs
Copy link
Contributor

amydevs commented Dec 19, 2023

i think @CMCDragonkai mentioned wanting to separate the version information to a separate endpoint from the basic information like remote info

@amydevs
Copy link
Contributor

amydevs commented Dec 19, 2023

other than that, i think the properties are okay.
Although i'm returning a POJO with the keys being set to the nodeId, and the value being a POJO that contains the related information for that node

@tegefaulkes
Copy link
Contributor Author

That works too, only that it's a little more annoying to iterate over keys in a POJO.

@tegefaulkes
Copy link
Contributor Author

Parts 1, 2 can be tested but it needs to be merged first. I can defer part 3 to a new issue/PR as it depends on some changes external to this repo. I'm going to move forward with testing part 1 and 2.

@amydevs
Copy link
Contributor

amydevs commented Dec 19, 2023

i think it would make sense for the versions to be exposed via both the agentStatus RPC call and a separate version RPC call.

@tegefaulkes
Copy link
Contributor Author

Things are mostly working with the docker integration tests now except for 1 problem.

One of the tests is failing due to a CLI call being made to an agent failing. It's failing with...

{"level":"ERROR","keys":"polykey.PolykeyClient.WebSocketClient","msg":"ErrorWebSocketConnectionLocal: WebSocket Connection local error - WebSocket could not open due to internal error"}
{"level":"ERROR","keys":"polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0","msg":"ErrorWebSocketConnectionLocal: WebSocket Connection local error - WebSocket could not open due to internal error"}
{"type":"ErrorWebSocketConnectionLocal","data":{"message":"WebSocket could not open due to internal error","timestamp":"2023-12-20T03:55:11.025Z","data":{"errorCode":1011,"reason":"WebSocket could not open due to internal error"},"cause":{"errno":-111,"code":"ECONNREFUSED","syscall":"connect","address":"127.0.0.1","port":36569},"stack":"ErrorWebSocketConnectionLocal: WebSocket could not open due to internal error\n    at WebSocket.openErrorHandler (/builds/MatrixAI/open-source/Polykey-CLI/node_modules/@matrixai/ws/src/WebSocketConnection.ts:693:16)\n    at Object.onceWrapper (node:events:629:26)\n    at WebSocket.emit (node:events:514:28)\n    at emitErrorAndClose (/builds/MatrixAI/open-source/Polykey-CLI/node_modules/ws/lib/websocket.js:1016:13)\n    at ClientRequest.<anonymous> (/builds/MatrixAI/open-source/Polykey-CLI/node_modules/ws/lib/websocket.js:864:5)\n    at ClientRequest.emit (node:events:514:28)\n    at TLSSocket.socketErrorListener (node:_http_client:501:9)\n    at TLSSocket.emit (node:events:514:28)\n    at emitErrorNT (node:internal/streams/destroy:151:8)\n    at emitErrorCloseNT (node:internal/streams/destroy:116:3)\n    at processTicksAndRejections (node:internal/process/task_queues:82:21)"}}

Specifically the ECONNREFUSED when making the connection. The agent is running fine here, it works locally so I can only assume it's some weirdness with the DIND networking setup in the CI job.

I'll do a little more digging but this isn't the priority right now. So I may have to shortcut the tests for now. I can see that the connections are being made and the agent isn't crashing so that'll have to be enough for now.

@CMCDragonkai
Copy link
Member

Did you make sure you're using 0.0.0.0 and always binding to wildcard inside docker?

@tegefaulkes
Copy link
Contributor Author

The CI pipeline has passed now. So points 1 and 2 are confirmed done. The docker integration tests are really basic right now with a really basic start in foreground and start with joining the testnet. The testnet test just runs the agent with the network set. We don't do any checks right now. but now we have a good basis for expanding the integration tests.

There are some things that need to be looked into with the docker integration tests. Right now there is an issue with making client CLI call but starting an agent works fine. Since we're running DIND to do the testing there's a level of what the hell is going on when there are connection problems so it takes some time to iron out. The MDNS tests is failing in the CI. It could all be explained by docker instance being unable to connect to each other. I did confirm that external connections are working fine.

@CMCDragonkai
Copy link
Member

Maybe this is why a diagnostics and audit domains is going to be important to provide that observability.

Remember the struggles you're having right now is the same struggles that customers will have when they deploy PK into their containerized infrastructure. So it's important for you work through and have a clear understanding of "what the hell is going on" and all the failure modes here.

@tegefaulkes
Copy link
Contributor Author

Did you make sure you're using 0.0.0.0 and always binding to wildcard inside docker?

Yes, it's using the default so it should be binding to the wildcard. I saw that external connections to the testnet were being made while testing so that's working.

@CMCDragonkai
Copy link
Member

Don't rely on defaults when doing tests. You never know what the defaults may change to later in the future when we are updating the code. In tests, it's better to be explicitly configured than to rely on implicit defaults. Make sure it is in fact binding to the right places.

@tegefaulkes
Copy link
Contributor Author

I'm considering this done now. I've modified the deployment jobs to wait for the seed nodes to update. I have not removed the part that triggers the deployment since I haven't confirmed that the labda that handles this now is working properly. I don't think it is triggering properly but I don't know enough about how it works to debug it quickly and @amydevs has already left for Christmas.

Everything still functions without it for now. But when it does work I need to modify the deployment job one last time.

@amydevs
Copy link
Contributor

amydevs commented Dec 22, 2023

I think i know how to fix it, i'll do it later. I realised that the execution role of the lambda needed a policy that allowed it to execute itself. It's a bit unintuitive, but i think the reason is cos the eventrule assumes the execution role when executing the eventtarget lambda

@CMCDragonkai
Copy link
Member

I think i know how to fix it, i'll do it later. I realised that the execution role of the lambda needed a policy that allowed it to execute itself. It's a bit unintuitive, but i think the reason is cos the eventrule assumes the execution role when executing the eventtarget lambda

Is there a way to visualize this somehow? https://www.pulumi.com/blog/six-things-about-pulumi-service/ - I kind of want to know what happened to running our own DB, if we are using pulumi's state, what's the billing situation for it, is this a trial? I don't remember setting up a pulumi account for Matrix AI.

I think in the new year it would be good for you to do a presentation explaining how the pulumi stack works in PKI. For @tegefaulkes and myself and others. Would be a good blog post.

I want to compare it with https://www.winglang.io/ - lots of interesting lessons from Matrix OS - Architect language work that Matrix AI worked on before.

@amydevs
Copy link
Contributor

amydevs commented Jan 5, 2024

I'm considering this done now. I've modified the deployment jobs to wait for the seed nodes to update. I have not removed the part that triggers the deployment since I haven't confirmed that the labda that handles this now is working properly. I don't think it is triggering properly but I don't know enough about how it works to debug it quickly and @amydevs has already left for Christmas.

Everything still functions without it for now. But when it does work I need to modify the deployment job one last time.

I've fixed this. Lambda works as intended now.

@CMCDragonkai
Copy link
Member

The lambda has apparently a different set of permissions? Did you configure it all into pulumi?

@amydevs
Copy link
Contributor

amydevs commented Jan 5, 2024

The lambda has apparently a different set of permissions? Did you configure it all into pulumi?

Yep @CMCDragonkai

@CMCDragonkai
Copy link
Member

Ok make sure you've documented it there too.

@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

No branches or pull requests

3 participants