-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration Tests for testnet.polykey.com
#71
Comments
Seems like our tests aren't simulating long-running behaviour. This is unfortunate, because certain bugs only became apparent after running the nodes on the testnet for longer than 1 hr. We might need to increase our modelling tooling to help with correctness. As well as monitoring tools/remote debugging tools applied to our testnet to enable observation of memory and CPU cycles. Nodejs provides this via their debugging port. This could be enabled on our testnet nodes. Possibly bootstrapping off the client port, and related to MatrixAI/Polykey#412. |
I'm also noticing that the re-connection attempts discussed in MatrixAI/Polykey#413 and MatrixAI/Polykey#415 don't appear when I use the same setup locally (two local agents where the one that gets started first is set as a seed node for the second one), so even if we had tests to simulate long-running behaviour they may not have picked up these issues since they may be limited to the testnet environment. |
Also unlike our NAT tests which use conditional testing with This approach means you use We could re-use |
The only difference in the testnet environment is running in a docker container. If the docker container is producing different behaviour, that has to be illuminated with the docker integration testing MatrixAI/Polykey#407. Caveat: when using |
@emmacasolin you can always run your own docker container locally and set that up as your local "seed node" and have it continuously run while you write tests against it. Just make sure to feed it all the required env variables, or the parameters and mount the necessary namespaces. Meet with @tegefaulkes about this, he's already doing this currently. |
If we continue down the path of reusing |
This is actually incorrect. I left my local setup going in the background for several hours and both agents had attempted multiple node connections to the other agent when I checked back (so many that they were cut off). At one point when I checked one of the agents was displaying these logs (rapidly) and the other was silent but now both of them are silent. This might be from the refresh buckets queue if this is something that happens every hour? I'm going to try reducing the time between refreshes and adding timestamps to the logs. As a side note, I think these logs appear to be infinite and constant on the testnet because it's a lot slower than my local machine, so it's only able to attempt a connection every 20 seconds, and since it takes so long to get through them it's refreshing them again by the time it's finished. |
I think I know what the main issue causing our "infinite loop" is. The timeout for opening a forward proxy connection is 20 seconds, so if we try to connect to a node that is offline it blocks the refresh buckets queue for 20 seconds. The refresh timer for the refresh buckets queue is an hour (i.e. buckets are added to the queue every hour). 1 hour / 20 seconds is 180 nodes per hour if we try to connect to an offline node for every bucket (which is the case if we only have one node in our node graph and it's offline). Since there are 256 buckets, this means we won't get through all of the buckets within the hour, and buckets will begin to be added to the queue again at the same rate that they're removed. So the queue will have 256-180=76 buckets in it forever (until the node in our node graph comes back online). I'm not sure if this blocks the entire event loop as well, in which case this is definitely a problem. |
The refresh bucket and ping node queues work asynchronously in the background. I don't think it will block the event loop. The excessive contacting of nodes as part of the refresh bucket queue is not ideal. I think we do need to optimise this but the problem is, how? We need a way to determine if we are not gaining any new information and just stop refreshing buckets for a while. But right now the real problem is that we're attempting to contact the same offline node over and over again. Right now two things come to mind to address this.
Just a note that more aggressive removal of nodes from the node graph such as removing a node if we see it's offline would fix this. However this will lead to use removing nodes that may be temporarily offline. Or worse, if a node's network goes down it will clean out it's nodeGraph. |
Pleas create a new PR to tackle this issue, you may wish to incorporate the subissues within this epic too. |
testnet.polykey.io
testnet.polykey.io
This takes over from MatrixAI/Polykey#159. The last few comments is useful MatrixAI/Polykey#159 (comment) regarding any re-deployment of testnet. |
@emmacasolin please check MatrixAI/Polykey#148 in relation to these tests, what needs to be done for that issue. |
This is because these tests call out to the external network. Our check stage unit tests should not require external network access for running those tests, as in those tests should pass even when offline. In MatrixAI/Polykey#435 I'm proposing the use of directories to represent "allow lists". Because we now have groups of tests that we want to run during "check stage" (even cross-platform check stage) and groups of tests that we want to run as part of "integration stage", then these tests are part of the integration stage, as they test the integration with Our initial group should just be During integration testing (where it is testing each platform), it will also on top of this test against the testnet as well. So for example @tegefaulkes during the docker integration tests, it will not only be testing We also nix integration testing, which should be testing Right now integration testing would mostly reuse tests from Once this is setup, we can evaluate whether all the regular unit tests including |
Now these tests may still fail, so you need to write stubs for all the preconditions and postconditions, taking into account MatrixAI/Polykey#403, and also the changes we will be making in MatrixAI/Polykey#329. The PR for the issues within this epic should be targeting Finally MatrixAI/Polykey#434 and MatrixAI/Polykey#432 should be done first and merged into staging before attempting this PR. Focus on making a start on all the test cases even if they will be failing for now, fixes should be pushed into staging, or assigned to MatrixAI/Polykey#419. |
I added MatrixAI/Polykey#388 as a subtask of this. |
As discussed just now. With the use of domains containing multiple Given a node graph with the following mappings, we can end up with 4 cases that express these relationships.
The NG isn't aware of the gestalt graph. And neither is it aware of certificate chain relationship. So it's not aware of both axes. Both axes will be dealt with by other systems.
To fully support all of this we need to apply 2 thing.
Ultimately the idea is that if we go looking for a node we can find that specific node. Each node is unique and what we're after when connecting to nodes is reasonably specific to that node. Kademlia and NAT punch-through depends on this. The problem with using NLBs here is that we end up with a situation where we have multiple nodes on the same IP and port with no way to discerningly connect to one or the other intentionally. This is not so much a problem when entering the network where we only care about contacting any node that we can trust. for any other interaction we depend of information and state unique to the node we're looking for. |
For the infrastructure. the go to structure is that we set up and EC2 instance that runs a single node. To make this work with the ECS we need to apply a few things.
|
MatrixAI/Polykey#488 leads us to refactor the TTLs and expiry for See: MatrixAI/Polykey#488 (comment) But basically until MatrixAI/Polykey#365 is possible, it is necessary to special case the seed nodes:
MatrixAI/Polykey#365 generalises this process to a random set of X number of nodes in the entire network. |
I've added task 17. as well. @tegefaulkes tomorrow Tuesday, you want to finish MatrixAI/Polykey#483 and MatrixAI/Polykey#487 and address task 17. too. |
The final manual test should involve the 2 nodes on the testnet, ensure that they are discovering each other on network entry, and do the NAT to CGNAT. |
The current architecture of multi-seed nodes doesn't actually shard any work between the nodes. This is due to a couple of reasons:
For sharding connections, this has to wait until MatrixAI/Polykey#365. For sharing signal relaying, we can do now. |
There is special casing now for the seed nodes:
During network entry, it's important to retry seed node connections too. But this could be done over time. This means during sync node graph, this has to be a background operation that repeats connection attempts to all the seed nodes. This can be done with an timeout that tries every 1 second to try to connect to the seed nodes with an exponential timeout, doubling to 20 seconds. |
Those special cases will get removed when MatrixAI/Polykey#365 is done. |
I've closed MatrixAI/Polykey#487. Copying over the conclusion from here.
So the priorities are now:
|
testnet.polykey.io
testnet.polykey.com
As per MatrixAI/Polykey#551, we now have a successful connection to a stable running testnet in testnet.polykey.com. It also makes sense to have a test suite for integrating to testnet.polykey.com here in Polykey repo, but that runs separately to the CI's main pipeline, so it doesn't block anything because the testnet can be a bit flaky. However as we go ahead it should become more stable. We should start writing some simple tests that can be separately run, and separate from |
I've changed the docker integration tests temporarily to simply run the image with What needs to be done: The problem at hand is that the tests are binding the agent socket to the localhost interface. This will need to be changed to an ipv4 supported wildcard interface ( Notes about container network behaviour: |
Moving this to Polykey-CLI since integration tests of this sort can only be done as a "process". Although lighter integration tests should still be in PK the library. |
@tegefaulkes this issue can be closed once:
This can then re-enable integration tests in our integration jobs after deployment. And those nodes should be connecting to the testnet and doing all the tests. To do this the the docker integration tests need to bind to wildcard address to avoid problems with connecting to the internet - the testnet. That means for now, this issue is blocked on @amydevs completing #599. Also I've removed the 2 subissues relating to NAT testing, because those would need to be done in the PKI - not here. |
This epic is almost ready to be closed. @tegefaulkes focus on getting the integration tests cleaned up and working. And work with @amydevs to get the API call to |
To clarify all tests prior to integration tests should be configured to not connect to any network at all, unless it's simulating a local network within the tests. |
We're going to streamline how the integration tests work. This is going to be done with the following changes.
I'm going to make a new issue to track this work and add it to this epic. |
There are some ideas for tests coming from the OP spec:
|
Also our current simulated NAT tests have been disabled for some time:
These tests can be adapted to a Polykey Infrastructure to test it at scale. It might be more "maintainable" if we do it via AWS rather than simulating it locally which has alot of constraints on the platform. |
This is done now except for 1 minor change that still needs to be done. I'll be creating an issue for that as I can't deal with it now. |
Specification
We need a suite of tests to cover interacting with a deployed agent, which we can do using
testnet.polykey.io
. These tests need to cover various different connection scenarios:These tests should go into their own subdirectory
tests/testnet
and should not be run with the other tests. They should be disabled in our jest config and should only run when explicitly called (which will happen during the integration stage of our pipelines).Required tests:
tests/testnet/testnetConnection.test.ts
tests/testnet/testnetPing.test.ts
tests/testnet/testnetNAT.test.ts
Additional context
testnet.polykey.io
Starting Connection Forward
infinite loop Polykey#413 - Issue for potential infinite loop when connecting to deployed seed nodeTasks
testnet.polykey.io
[ ] 3. Create tests for edge cases and previous bugs- most edge cases will go to the simulation suite[ ] 4. Create tests for connecting to a deployed seed node from behind a NAT- this can only be done as part of a simulation suite, since we don't control host firewalls[ ] 7. Add a logging debug filter to command line arguments to take a regular expression. Should be global option like- not relevant to integration testing, the--log='/regex/'
.js-logger
does support REGEX filtering, but the PK CLI currently doesn't have this option.[ ] 8. Use https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ContainerOverride.html to be able to easily try different debugging levels and filters.- cannot use container overrides in services, can be done as part of tasks though, will have to just redeploy service with different task definition each time[ ] 9.agent status
command needs to display useful information like the polykey version and other useful statistics like active connections, number of node graph entries etc etc.[ ] 10.- this is pending a change to being able to use--client-host
needs to support host names.PolykeyClient
to connect to a host name - which would require using the DNS-SD SRV records. This still needs to be specced out how this would work because in some cases you want to connect to a SINGLE Node, in other cases you are "discovering" a node to connect to, but it's not relevant to this epic.src/config.ts
- from Infrastructure setup for testnet should automate multiple instances for multiple nodes Polykey#488Emergent bugs
ConnectionReverse.start()
betweenStarting Connection Reverse
andStarted Connection Reverse
. Start by testing locally.testnet.polykey.com
#71testnet.polykey.com
#71testnet.polykey.com
#71The text was updated successfully, but these errors were encountered: