Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration Tests for testnet.polykey.com #71

Closed
14 tasks done
emmacasolin opened this issue Jul 13, 2022 · 60 comments
Closed
14 tasks done

Integration Tests for testnet.polykey.com #71

emmacasolin opened this issue Jul 13, 2022 · 60 comments
Assignees
Labels
development Standard development epic Big issue with multiple subissues r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@emmacasolin
Copy link

emmacasolin commented Jul 13, 2022

Specification

We need a suite of tests to cover interacting with a deployed agent, which we can do using testnet.polykey.io. These tests need to cover various different connection scenarios:

  • Connecting to a deployed agent as a seed node during startup
  • Pinging a deployed agent
  • Using the deployed agent as a signaller (and eventually a relay once this is implemented)
  • Tests for when the deployed agent is contacting an agent behind a NAT (NAT-Traversal Testing with testnet.polykey.io Polykey#159)
  • Any bugs that are discovered during the above tasks

These tests should go into their own subdirectory tests/testnet and should not be run with the other tests. They should be disabled in our jest config and should only run when explicitly called (which will happen during the integration stage of our pipelines).

Required tests:

  • tests/testnet/testnetConnection.test.ts
    • Can connect to the testnet
      • Within a reasonable amount of time
      • Without errors/shutting down the local agent
      • Without errors/shutting down the testnet
    • Can disconnect from the testnet
      • Within a reasonable amount of time
      • Without errors/shutting down the local agent
      • Without errors/shutting down the testnet
    • Can reconnect to the testnet
      • Able to handle different node ids (testnet is a cluster of nodes)
  • tests/testnet/testnetPing.test.ts
    • Can ping the testnet
      • Able to handle different node ids (testnet is a cluster of nodes)
    • Can ping another node via the testnet (signaling)
    • Can ping another node via the testnet (relay)
    • Can attempt to ping another node that doesn't exist
      • Without shutting down the testnet
  • tests/testnet/testnetNAT.test.ts
    • Can ping a node that is behind endpoint-independent NAT via the testnet
      • From a node that is not behind a NAT (DMZ)
      • From a node that is behind endpoint-independent NAT
      • From a node that is behind endpoint-dependent NAT
    • Can ping a node that is behind endpoint-dependent NAT via the testnet
      • From a node that is not behind a NAT (DMZ)
      • From a node that is behind endpoint-independent NAT
      • From a node that is behind endpoint-dependent NAT
  • Should also incorporate tests from Testnet Deployment Polykey#326 (comment)

Additional context

Tasks

  • 1. Attempt connections to the deployed seed node and create issues for all bugs discovered (and resolve them)
  • 2. Create tests for simple connections to testnet.polykey.io
    • 1 node connected to testnet.polykey.io and maintains connection
    • 2 nodes connected to testnet.polykey.io and can ping each other (they will have the same IP but different ports)
  • [ ] 3. Create tests for edge cases and previous bugs - most edge cases will go to the simulation suite
  • [ ] 4. Create tests for connecting to a deployed seed node from behind a NAT - this can only be done as part of a simulation suite, since we don't control host firewalls
  • 5. Finish off all diagrams as part of NAT testing Add diagrams to NAT Traversal tests Polykey#388
  • 6. Add new INFO level logs for situations where connections are going into stopping. This is next to the debug logs.
  • [ ] 7. Add a logging debug filter to command line arguments to take a regular expression. Should be global option like --log='/regex/'. - not relevant to integration testing, the js-logger does support REGEX filtering, but the PK CLI currently doesn't have this option.
  • [ ] 8. Use https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ContainerOverride.html to be able to easily try different debugging levels and filters. - cannot use container overrides in services, can be done as part of tasks though, will have to just redeploy service with different task definition each time
  • [ ] 9. agent status command needs to display useful information like the polykey version and other useful statistics like active connections, number of node graph entries etc etc.
  • [ ] 10. --client-host needs to support host names. - this is pending a change to being able to use PolykeyClient to connect to a host name - which would require using the DNS-SD SRV records. This still needs to be specced out how this would work because in some cases you want to connect to a SINGLE Node, in other cases you are "discovering" a node to connect to, but it's not relevant to this epic.
  • 11. EC2 setup with idempotency
  • 12. Multi Node Setup on AWS
  • 13. Recovery Code Pool on AWS
  • 14. Multi-Host DNS resolution
  • 15. Multi Node Resolver
  • 16. NodeGraph KeyPath to lift Host and Port to the key path
  • 17. Put the trusted testnet seed nodes into the src/config.ts - from Infrastructure setup for testnet should automate multiple instances for multiple nodes Polykey#488

Emergent bugs

@emmacasolin emmacasolin added development Standard development epic Big issue with multiple subissues labels Jul 13, 2022
@emmacasolin emmacasolin self-assigned this Jul 13, 2022
@CMCDragonkai
Copy link
Member

Seems like our tests aren't simulating long-running behaviour. This is unfortunate, because certain bugs only became apparent after running the nodes on the testnet for longer than 1 hr.

We might need to increase our modelling tooling to help with correctness. As well as monitoring tools/remote debugging tools applied to our testnet to enable observation of memory and CPU cycles. Nodejs provides this via their debugging port. This could be enabled on our testnet nodes. Possibly bootstrapping off the client port, and related to MatrixAI/Polykey#412.

@emmacasolin
Copy link
Author

I'm also noticing that the re-connection attempts discussed in MatrixAI/Polykey#413 and MatrixAI/Polykey#415 don't appear when I use the same setup locally (two local agents where the one that gets started first is set as a seed node for the second one), so even if we had tests to simulate long-running behaviour they may not have picked up these issues since they may be limited to the testnet environment.

@CMCDragonkai
Copy link
Member

Also unlike our NAT tests which use conditional testing with describeIf and testIf, these tests are conditional on a certain stage in our pipeline. We may continue to use this technique rather than explicitly excluding the tests in the jest config (or using explicit group tagging https://stackoverflow.com/questions/50171932/run-jest-test-suites-in-groups and https://morioh.com/p/33c2bd031589).

This approach means you use describeIf as well, but then depend on a condition that is only available via environment variables. This would be similar to @tegefaulkes work with docker integration tests which rely on a special test command environment variable.

We could re-use NODE_ENV or create our own env variable that is then funnelled into the jest.config.js as global parameters.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 14, 2022

The only difference in the testnet environment is running in a docker container. If the docker container is producing different behaviour, that has to be illuminated with the docker integration testing MatrixAI/Polykey#407.


Caveat: when using testnet.polykey.io you are using NLBs as well... so that adds extra complexity. But this is why we are sticking to the public IPs first.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 14, 2022

@emmacasolin you can always run your own docker container locally and set that up as your local "seed node" and have it continuously run while you write tests against it.

Just make sure to feed it all the required env variables, or the parameters and mount the necessary namespaces. Meet with @tegefaulkes about this, he's already doing this currently.

@CMCDragonkai
Copy link
Member

Also unlike our NAT tests which use conditional testing with describeIf and testIf, these tests are conditional on a certain stage in our pipeline. We may continue to use this technique rather than explicitly excluding the tests in the jest config (or using explicit group tagging https://stackoverflow.com/questions/50171932/run-jest-test-suites-in-groups and https://morioh.com/p/33c2bd031589).

This approach means you use describeIf as well, but then depend on a condition that is only available via environment variables. This would be similar to @tegefaulkes work with docker integration tests which rely on a special test command environment variable.

We could re-use NODE_ENV or create our own env variable that is then funnelled into the jest.config.js as global parameters.

If we continue down the path of reusing describeIf, it would be optimal to move our imports to be asynchronous under the describe. That would be similar to our bin commands where we use dynamic imports. Problem is, describe doesn't support asynchronous callbacks. A long term solution is to use top-level await when it becomes available in jest: jestjs/jest#2235 (comment)

@emmacasolin
Copy link
Author

I'm also noticing that the re-connection attempts discussed in MatrixAI/Polykey#413 and MatrixAI/Polykey#415 don't appear when I use the same setup locally (two local agents where the one that gets started first is set as a seed node for the second one), so even if we had tests to simulate long-running behaviour they may not have picked up these issues since they may be limited to the testnet environment.

This is actually incorrect. I left my local setup going in the background for several hours and both agents had attempted multiple node connections to the other agent when I checked back (so many that they were cut off). At one point when I checked one of the agents was displaying these logs (rapidly) and the other was silent but now both of them are silent. This might be from the refresh buckets queue if this is something that happens every hour? I'm going to try reducing the time between refreshes and adding timestamps to the logs.

As a side note, I think these logs appear to be infinite and constant on the testnet because it's a lot slower than my local machine, so it's only able to attempt a connection every 20 seconds, and since it takes so long to get through them it's refreshing them again by the time it's finished.

@emmacasolin
Copy link
Author

I think I know what the main issue causing our "infinite loop" is. The timeout for opening a forward proxy connection is 20 seconds, so if we try to connect to a node that is offline it blocks the refresh buckets queue for 20 seconds. The refresh timer for the refresh buckets queue is an hour (i.e. buckets are added to the queue every hour). 1 hour / 20 seconds is 180 nodes per hour if we try to connect to an offline node for every bucket (which is the case if we only have one node in our node graph and it's offline). Since there are 256 buckets, this means we won't get through all of the buckets within the hour, and buckets will begin to be added to the queue again at the same rate that they're removed. So the queue will have 256-180=76 buckets in it forever (until the node in our node graph comes back online).

I'm not sure if this blocks the entire event loop as well, in which case this is definitely a problem.

@tegefaulkes
Copy link
Contributor

The refresh bucket and ping node queues work asynchronously in the background. I don't think it will block the event loop.

The excessive contacting of nodes as part of the refresh bucket queue is not ideal. I think we do need to optimise this but the problem is, how? We need a way to determine if we are not gaining any new information and just stop refreshing buckets for a while. But right now the real problem is that we're attempting to contact the same offline node over and over again. Right now two things come to mind to address this.

  1. Just use a smaller timeout. 20 seconds seems a bit much.
  2. Track the last time we tried to contact a node and maybe the attempts. If we keep trying to contact it then we can skip it for a period of time.

Just a note that more aggressive removal of nodes from the node graph such as removing a node if we see it's offline would fix this. However this will lead to use removing nodes that may be temporarily offline. Or worse, if a node's network goes down it will clean out it's nodeGraph.

@CMCDragonkai
Copy link
Member

Pleas create a new PR to tackle this issue, you may wish to incorporate the subissues within this epic too.

@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 25, 2022
@CMCDragonkai CMCDragonkai changed the title Tests for testnet.polykey.io Integration Tests for testnet.polykey.io Jul 29, 2022
@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 29, 2022

This takes over from MatrixAI/Polykey#159. The last few comments is useful MatrixAI/Polykey#159 (comment) regarding any re-deployment of testnet.

@CMCDragonkai
Copy link
Member

@emmacasolin please check MatrixAI/Polykey#148 in relation to these tests, what needs to be done for that issue.

@CMCDragonkai
Copy link
Member

These tests should go into their own subdirectory tests/testnet and should not be run with the other tests. They should be disabled in our jest config and should only run when explicitly called (which will happen during the integration stage of our pipelines).

This is because these tests call out to the external network. Our check stage unit tests should not require external network access for running those tests, as in those tests should pass even when offline.

In MatrixAI/Polykey#435 I'm proposing the use of directories to represent "allow lists".

Because we now have groups of tests that we want to run during "check stage" (even cross-platform check stage) and groups of tests that we want to run as part of "integration stage", then these tests are part of the integration stage, as they test the integration with testnet.polykey.io.

Our initial group should just be tests/integration to indicate integration tests. In there, for this epic, we should have tests/integration/testnet.

During integration testing (where it is testing each platform), it will also on top of this test against the testnet as well.

So for example @tegefaulkes during the docker integration tests, it will not only be testing tests/integration/docker, but also tests/integration/testnet. While for windows it would be tests/integration/windows and tests/integration/testnet.

We also nix integration testing, which should be testing tests/integration/nix and tests/integration/testnet.

Right now integration testing would mostly reuse tests from tests/bin (and until MatrixAI/Polykey#435 is resolved, it cannot really change to testing tests/integration/testnet).

Once this is setup, we can evaluate whether all the regular unit tests including tests/bin should be moved down one directory to tests/check.

@CMCDragonkai
Copy link
Member

Now these tests may still fail, so you need to write stubs for all the preconditions and postconditions, taking into account MatrixAI/Polykey#403, and also the changes we will be making in MatrixAI/Polykey#329.

The PR for the issues within this epic should be targeting staging branch but should be cherry picking changes that are occurring in MatrixAI/Polykey#419 as that's where the initial new DB will be applied and many concurrency issues resolved. Myself and @tegefaulkes will be focusing on MatrixAI/Polykey#419 while @emmacasolin will be working on this issue.

Finally MatrixAI/Polykey#434 and MatrixAI/Polykey#432 should be done first and merged into staging before attempting this PR.

Focus on making a start on all the test cases even if they will be failing for now, fixes should be pushed into staging, or assigned to MatrixAI/Polykey#419.

@CMCDragonkai
Copy link
Member

I added MatrixAI/Polykey#388 as a subtask of this.

@tegefaulkes
Copy link
Contributor

As discussed just now.

With the use of domains containing multiple A records E.G. our testnet testnet.polykey.io it seems evident that the mapping of NodeIds to IPs is a many to many relationship.

Given a node graph with the following mappings, we can end up with 4 cases that express these relationships.

NodeGraph<NodeID, Host> {
  // C1 The same node ID on different IPs (NOT POSSIBLE)
  NID2 -> IP1,
  NID2 -> IP2,
  // C2 Multiple NIDs on the same IP
  NID5 -> IP3,
  NID6 -> IP3,
  // C3 The same node ID on different hostnames (NOT POSSIBLE)
  NID1 -> HOSTNAME1,
  NID1 -> HOSTNAME2,
  // C4 Multiple NIDs on the same host name
  NID3 -> HOSTNAME3,
  NID4 -> HOSTNAME3,
  // It's possible to have unions of all 4 cases
  NID7 -> IP4,
  NID7 -> IP5,
  NID7 -> HOSTNAME1,
  NID8 -> IP4,
  NID8 -> HOSTNAME2,
  NID9 -> IP4,
  NID9 -> HOSTNAME1
  NID9 -> HOSTNAME2
};

The NG isn't aware of the gestalt graph. And neither is it aware of certificate chain relationship. So it's not aware of both axes. Both axes will be dealt with by other systems.

  • The gestalt graph relationship is dealt with at the "social network level".
  • The certificate chain relationship is dealt with at the "TLS level".
  • The node graph deals with just NID -> HOST resolution. But it can be complex here.
  • Connection Establishment means we connect and Certificate verification passes
  • C1: is one node to many addresses.

    • Look up NID2
    • Race the connection establishment to IP1 and IP2 concurrently
    • Only the first one wins.
    • All other connections are discarded even if they succeeded.
  • C2: is many nodes to one address.

    • Look up NID5 (you ignore NID6)
    • Connection establishment to IP3
    • It wins or loses
  • C3 is one node to many host names.

    • Look up NID1
    • Look up HOSTNAME1 and HOSTNAME2
    • Combine HOSTNAME1.IPs and HOSTNAME2.IPs
    • Recursion to C1
  • C4: is many nodes to one host name.

    • Look up NID3 (you ignore NID4)
    • Look up HOSTNAME3
    • You get HOSTNAME3.IPs
    • Recursion to C1

To fully support all of this we need to apply 2 thing.

  1. When establishing a connection to a node, if we are given a host name/domain we need to resolve this to one or more IPs. With this set of IPs we need to attempt connections to each one with some degree of concurrency. When we find the node we are after we use that connection. All other connections should be ignored, either cancelled, failing to connect or rejected based on verification.
  2. For general network entry we need to support connecting to a hostname/domain with the expectation that we find one or more of a set of provided nodes. So we can connect to a set of nodes on a single domain such as the testnet.
For the user to attempt a connection to multiple [NID7, NID8, NID9] with a set of IPs/connections.
--seed-nodes="[email protected]:1314;[email protected]:1314"

  // NETWORK ENTRY
  Promise.allSettled([
    nodeManager.getConnection(nid1),
    nodeManager.getConnection(nid2),
  ]);

Ultimately the idea is that if we go looking for a node we can find that specific node. Each node is unique and what we're after when connecting to nodes is reasonably specific to that node. Kademlia and NAT punch-through depends on this.

The problem with using NLBs here is that we end up with a situation where we have multiple nodes on the same IP and port with no way to discerningly connect to one or the other intentionally. This is not so much a problem when entering the network where we only care about contacting any node that we can trust. for any other interaction we depend of information and state unique to the node we're looking for.

@tegefaulkes
Copy link
Contributor

For the infrastructure. the go to structure is that we set up and EC2 instance that runs a single node. To make this work with the ECS we need to apply a few things.

  1. The EC2 needs a role with the permission to allow it to work with the ECS.
  2. The EC2 instances need to have some configuration applied with the user data field. This is a small script that sets a variable for the ECS network in a config.
  3. The EC2 instance needs to use a AWS AIM image.
  4. This all needs to share a security group to allow connections within the network. Not important for a single instance.
  5. This needs to share a VPC to allow connections between each other.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 31, 2022

MatrixAI/Polykey#488 leads us to refactor the TTLs and expiry for NodeConnectionManager and NodeGraph.

See: MatrixAI/Polykey#488 (comment)

But basically until MatrixAI/Polykey#365 is possible, it is necessary to special case the seed nodes:

  1. They must always maintain their connection (specifically proxy connection), so since closing node connection GRPC closes the proxy connection, the node connection TTL must be disabled for these connections. This is necessary to ensure that their NAT external port is maintained (ensuring some address stability), but also ensure that the seed nodes can contact them back if they are behind a NAT for the purposes of signalling hole punch messages.
  2. Their NodeId is not allowed to be removed from the NodeGraph. They must always be preferred to exist even if they aren't responding. Network entry may need to repeatedly try to maintain connection with these nodes. That means a loop that regularly attempts pings on the seed node list is necessary.

MatrixAI/Polykey#365 generalises this process to a random set of X number of nodes in the entire network.

@CMCDragonkai
Copy link
Member

I've added task 17. as well. @tegefaulkes tomorrow Tuesday, you want to finish MatrixAI/Polykey#483 and MatrixAI/Polykey#487 and address task 17. too.

@CMCDragonkai
Copy link
Member

The final manual test should involve the 2 nodes on the testnet, ensure that they are discovering each other on network entry, and do the NAT to CGNAT.

@CMCDragonkai
Copy link
Member

The current architecture of multi-seed nodes doesn't actually shard any work between the nodes. This is due to a couple of reasons:

  1. Without decentralised NAT hole punching, every node connects to every seed node. Thus there's no sharding of proxy connections between seed nodes. This is also necessary because one has to maintain a connection to the seed node, for their public node address to be accurate due to NAT mapping timeouts.
  2. Furthermore during any connection between nodes, their ICE procedure would send a relay signalling message to all seed nodes, and all seed nodes will relay that message to the target node, the target node will then receive multiple relay messages, and coalesce them all via connection locking. Thus there's no sharding of relaying operations between seed nodes. This can be changed through client-side round-robin/random load balanced relay messages. This can work because it is expected that all seed nodes have a connection to all other nodes, and thus any seed node is capable of doing the signal relay operation.

For sharding connections, this has to wait until MatrixAI/Polykey#365. For sharing signal relaying, we can do now.

@CMCDragonkai
Copy link
Member

There is special casing now for the seed nodes:

  1. When connecting to seed nodes, no relay messages are sent.
  2. Seed nodes never get removed from the node graph. (This means if seed nodes change, they must get a version update in software).
  3. Seed node connections never get TTLed (and thus their proxy connections are maintained).

During network entry, it's important to retry seed node connections too. But this could be done over time. This means during sync node graph, this has to be a background operation that repeats connection attempts to all the seed nodes. This can be done with an timeout that tries every 1 second to try to connect to the seed nodes with an exponential timeout, doubling to 20 seconds.

@tegefaulkes

@CMCDragonkai
Copy link
Member

Those special cases will get removed when MatrixAI/Polykey#365 is done.

@CMCDragonkai
Copy link
Member

I've closed MatrixAI/Polykey#487. Copying over the conclusion from here.

This can be closed now. We know a couple of things:

  1. It is not possible to connect to a node on the same network without Local Network Traversal - Multicast Discovery js-mdns#1. Thus any connection tests from the same network is bound to fail.
  2. Local NAT simulation tests are working now again according to @tegefaulkes in ci: merge staging to master Polykey#474.
  3. There are still problems with the testnet nodes failing when automated testnet connection tests terminate/finish the agent process.
  4. Network is still flaky and causes timeout errors as per ci: merge staging to master Polykey#474.
  5. We know that NAT to CGNAT works. And seed nodes can contact each other.

A final test is required involving NAT to CGNAT and the 2 seed nodes together. In total 4 nodes should be tested. However with the amount of failures we're going to blocked on this until we really simplify our networking and RPC code.

So the priorities are now:

  1. Testnet nodes should not crash when we terminate terminating connections for connecting nodes.
  2. Testnet node should not randomly crash.
  3. Do Local Network Traversal - Multicast Discovery js-mdns#1 so that we can connect to nodes within the same network.
  4. Complete a 4-node manual test with 2 seed nodes and 2 connecting nodes.

@CMCDragonkai CMCDragonkai transferred this issue from MatrixAI/Polykey Oct 19, 2023
@CMCDragonkai CMCDragonkai changed the title Integration Tests for testnet.polykey.io Integration Tests for testnet.polykey.com Oct 19, 2023
@CMCDragonkai CMCDragonkai transferred this issue from MatrixAI/Polykey-CLI Oct 19, 2023
@CMCDragonkai
Copy link
Member

As per MatrixAI/Polykey#551, we now have a successful connection to a stable running testnet in testnet.polykey.com.

It also makes sense to have a test suite for integrating to testnet.polykey.com here in Polykey repo, but that runs separately to the CI's main pipeline, so it doesn't block anything because the testnet can be a bit flaky. However as we go ahead it should become more stable.

We should start writing some simple tests that can be separately run, and separate from npm run test script. One way is grouping, or another is just by directory. If with directory it would be important to subdirectory all of the unit tests.

@amydevs
Copy link
Contributor

amydevs commented Dec 8, 2023

I've changed the docker integration tests temporarily to simply run the image with docker run. This is so that we can get a github release with the working binary executables, otherwise integration:docker will fail on some tests.

What needs to be done:

The problem at hand is that the tests are binding the agent socket to the localhost interface. This will need to be changed to an ipv4 supported wildcard interface (::, 0.0.0.0) etc. The tests are timing out because the node ChildProcess is unable to kill the agent when in the process is in a state that the program that docker is running is already crashed. The process is crashing in these tests, because they are attempting to connect to the testnet while being bound on a localhost interface, so the agent will not be able to send packets to any globally routable ip addresses. Hence, by specifying the globally routable ip address of a testnet node, it will throw an EINVAL, noting that a globally routable address is an invalid argument to send command on a socket that is bound to a localhost interface.

Notes about container network behaviour:

Untitled-2023-10-23-0424 excalidraw(6)

@CMCDragonkai
Copy link
Member

Moving this to Polykey-CLI since integration tests of this sort can only be done as a "process".

Although lighter integration tests should still be in PK the library.

@CMCDragonkai CMCDragonkai transferred this issue from MatrixAI/Polykey Dec 8, 2023
@CMCDragonkai
Copy link
Member

@tegefaulkes this issue can be closed once:

  1. In the CI job for PK-CLI we introduce polling calls to Polykey-Network-Status to ask if the currently distributed/released image version has been deployed for testnet.polykey.com to then run the integration tests.
  2. This means @amydevs you want to expose on the API of PK-N-S that the currently deployed version.
  3. If the version isn't deployed in sufficient time... then that would block the rest of the CI, that's ok. The pipeline will timeout and we would restart the pipeline afterwards.

This can then re-enable integration tests in our integration jobs after deployment. And those nodes should be connecting to the testnet and doing all the tests.

To do this the the docker integration tests need to bind to wildcard address to avoid problems with connecting to the internet - the testnet.

That means for now, this issue is blocked on @amydevs completing #599.

Also I've removed the 2 subissues relating to NAT testing, because those would need to be done in the PKI - not here.

@CMCDragonkai
Copy link
Member

This epic is almost ready to be closed. @tegefaulkes focus on getting the integration tests cleaned up and working. And work with @amydevs to get the API call to testnet.polykey.com/api to be able to know what the current version is. Work out a spec for what the API should return, and how you would know. As well as timeout - sufficient time for deployment.

@CMCDragonkai
Copy link
Member

To clarify all tests prior to integration tests should be configured to not connect to any network at all, unless it's simulating a local network within the tests.

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Dec 18, 2023

We're going to streamline how the integration tests work. This is going to be done with the following changes.

  1. All the standard tests will no attempt connections to any network. They should explicitly be started with no seed nodes.
  2. The integration tests will be separated from the standard tests.
    a. remove usage of the testif utility from the standard tests.
    b. integration tests will be in a separate folder structure from the standard tests.
    integration tests will focus on connecting to the testnet.
  3. integration tests need to wait for the testnet to be updated. to this end the following changes are to be made.
    a. CLI ci job for for triggering the testnet seed nodes using the new image will be removed. This will be handled by the testnet infrastructure.
    b. A job will be created to wait for the seed nodes to switch to the new version. This will be done by polling a polykey dashboard endpoint that will either return the when all seednodes have been updated, or list the versions of all the seed nodes. This will need to be speced out.

I'm going to make a new issue to track this work and add it to this epic.

@CMCDragonkai
Copy link
Member

There are some ideas for tests coming from the OP spec:

  1. Functionally disconnecting from the testnet seed nodes and reconnecting to it.
  2. Using tc or firewall rules to break the connection to a particular node, and then seeing how PK reacts to that, and also re-enabling a few seconds later.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 18, 2023

Also our current simulated NAT tests have been disabled for some time:

»» ~/Projects/Polykey-CLI/tests/nat
 ♖ tree .                                                                                                 (staging) pts/7 9:53:42
.
├── DMZ.test.ts
├── endpointDependentNAT.test.ts
├── endpointIndependentNAT.test.ts
└── utils.ts

1 directory, 4 files

These tests can be adapted to a Polykey Infrastructure to test it at scale. It might be more "maintainable" if we do it via AWS rather than simulating it locally which has alot of constraints on the platform.

@tegefaulkes
Copy link
Contributor

This is done now except for 1 minor change that still needs to be done. I'll be creating an issue for that as I can't deal with it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development epic Big issue with multiple subissues r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

No branches or pull requests

4 participants