Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting up static sharding fleet for Status #1914

Closed
jm-clius opened this issue Aug 15, 2023 · 17 comments
Closed

Setting up static sharding fleet for Status #1914

jm-clius opened this issue Aug 15, 2023 · 17 comments
Assignees
Labels
E:Targeted Status Communities dogfooding See https://github.com/waku-org/pm/issues/97 for details

Comments

@jm-clius
Copy link
Contributor

jm-clius commented Aug 15, 2023

Background

This issue tracks the work necessary to set up a static sharding fleet for Status Communities. This forms part of the 10K users epic.
It involves:

Why is this issue not in an infra repo?

Some requirements need to be agreed upon, before opening an issue in the relevant infra repo.

Requirements

I suggest the following configuration:

  • Fleet of (at least) 10 nodes, split into two sub-fleets:
    • status.sharding.store: 5 nodes configured only with relay and store. These will be the store/historical message providers
    • status.sharding.bootstrap: 5 nodes configured with relay, filter, lightpush, peer-exchange. These are the main bootstrap nodes and also provide services to resource-restricted nodes.
  • all nodes should be configured only for the Status Internal CC Community static shards
  • status.sharding.store should be configured with a single, shared PostgreSQL backend
  • status-go nodes should preferably only use the bootstrap nodes to prime their discv5 routing tables, as minimal mechanism to limit unnecessary interaction with store nodes.
  • no websockets configuration for now
  • 1 node in status.sharding.bootstrap should be set up with trace-level message logs, in order to facilitate future end-to-end message tracing and debugging.

Tracking Issue: status-im/infra-status#2

@jm-clius
Copy link
Contributor Author

Some questions on the above:

  • @richard-ramos can you take a look if this approach makes sense, especially splitting the fleet between the bootstrap nodes and the store nodes? Would it be possible/easier to e.g. set up two different DNS node trees - one for store and one for bootstrapping? The status-go nodes would then populate their store providers with one query and use another query to bootstrap connection to the network? The alternative is to simply use a single DNS list retrievable via a single query, and then use the already-existing capability differentiation to populate store nodes and bootstrap to the remaining nodes.
  • @richard-ramos do we already have an idea of what specific shards we'd use for the first (internal CCs?) Status Community?
  • @Ivansete-status we probably want to prioritise final postgresql deployment for wakuv2.sharding and use that as a blueprint to create something similar for status.sharding.store.

@richard-ramos
Copy link
Member

All nodes should be configured only for the Status Internal CC Community static shards causes me some confusion - what should be the behavior for the app for both new users (not having joined a community), and the behavior for users that are part of the status community. Should new users utilize the status.prod fleet? and only if they join the status community, they should use this new fleet? if so, it means that status-go should somehow associate communities to fleets. Such behavior is currently not implemented.

Something else to take into account is that status-go currently behaves like this:

  1. It uses DNSDiscovery to obtain the discv5 bootstrap nodes to discover relay peers and filter peers (if using light mode).
  2. The list of store nodes is hardcoded in status-go (although you can manually add nodes by invoking a rpc method).

We probably should discuss this with status team to design what makes sense for retrieving message history, since I assume that status.sharding.store will return messages only from the status shards, but then, 1:1 messages, group chats and other communities for the time being use the default pubsub topic. So some heuristic needs to be formulated in order to decide what store nodes to use to retrieve messages (hardcoded store node lists, or associating fleets to communities or TBD).

The same applies for peer discovery for shards, (which is an open item for status-go). Should status-go use status.sharding.bootstrap for discovering all peers regardless of the shards they belong to? or should we use both the nodes from status.prod fleet and status.sharding.bootstrap fleet.

@richard-ramos
Copy link
Member

do we already have an idea of what specific shards we'd use for the first (internal CCs?) Status Community?

Nope, but setting a shard cluster/index for a community is easy-sh. For now i imagine that setting up any shard index between 128 - 767 in cluster 16 and using that should be fine?

@alrevuelta
Copy link
Contributor

@jm-clius Perhaps this should go somewhere else? Its more related to deployments and ofc there are some things to figure out, but I don't see this as a research problem?

@jm-clius
Copy link
Contributor Author

what should be the behavior for the app for both new users (not having joined a community), and the behavior for users that are part of the status community

I would say there shouldn't really be any complicated logic done her for specific fleets. The heuristic I suggest (for now):

  • a Status node is subscribed to a couple of static shards via relay (it may have other static shards configured for light protocols only, e.g. certain control message shards)
  • user running the app for the first time may have some default static shard subscriptions (if e.g. it needs 1:1 messages by default, there should be a static shard assigned for that and be part of the initial app subscriptions)
  • user can join Status Community, which would add relay subscription to the appropriate static shard(s)
  • when user opens the app, a DNS query is performed which returns a list of bootstrap nodes. The discovery protocol should filter only those nodes that serve the subscribed static shards the user is interested in. This could also apply to populating the store node table.
  • user continues using the shared discovery layer (which will be across all fleets), but keep filtering for nodes that belong to the static shards the user is interested in (i.e. only connect to these).

From what I understand, status-go does not yet support filtering the subscribed shards in peer discovery? That would need to be implemented, but shouldn't for now affect how we deploy the fleet - namely a fleet specifically configured to serve (only) the Status community. We could consider using the same fleet for the 1:1/group chat message static shards for now, though we'll likely split this off in future as well. The point is that afaics there would be no need for any fleet-specific configuration if we have a shared discovery layer and a Status node that can filter discovered peers on shard.

The assumption is that we'll launch first for some test communities and thereafter only for the Status Internal CC community. We can choose to expand support on this fleet for more shards to make the process simpler, but at no point should anything other than static shards be used (i.e. Status nodes shouldn't be subscribed to the default pubsub topic). We can also increment towards something more sophisticated here (e.g. use the same store nodes for all shards and keep hardcoding these in the interim).

@jm-clius jm-clius transferred this issue from waku-org/research Aug 16, 2023
@jm-clius
Copy link
Contributor Author

Perhaps this should go somewhere else? Its more related to deployments

@alrevuelta yeah, you're right. I've moved it to nwaku for now, as this is a nwaku deployment. I would say that some thinking re infrastructure will form part of research roadmaps (e.g. hammering out how a distributed bootstrap network will look for autosharding), which is why I had this issue in research first. But it's hopefully going to be closed and related tasks open in an infra repo soon.

@jm-clius
Copy link
Contributor Author

@richard-ramos a couple of questions in order to get bootstrapping going in the simplest way possible:

  • will the store nodes for now still be hardcoded? If so, what do you think of setting up DNS discovery for only status.sharding.bootstrap for now and still hardcoding status.sharding.store for the start of dogfooding? Two other alternatives:
    (1) set up all nodes (store and bootstrap) into the same DNS discovery list and assume that the app will eventually be able to determine which services it can get from which fleet node
    (2) we could also set up two different DNS discovery domains for the bootstrap and store fleets if you think this will be more future proof? That way dogfooding can start with store nodes still hardcoded (or populated by separate DNS query) and bootstrapping only via the bootstrap fleet.
  • should we pre-generate the pubsub topic(s) and key for at least one test community to get started? Afaik new communities will require manually updating the fleet every time. If we have the key/shard for the first one, we could save some effort on behalf of infra.

@richard-ramos
Copy link
Member

  1. I like the second alternative. Let's setup status.sharding.bootstrap and status.sharding.store dns discovery URLs!. I'll update status-go to include the store nodes hardcoded since this is the easiest change that can be done, while attempting to introduce dns discovery for retriving the history. In the future we can just use the same dns query for both bootstrapping and history.
  2. Yes, let's setup at least one shard for doing small scale dogfooding between go-waku and status-go devs

@jm-clius
Copy link
Contributor Author

Will do, thanks!
For:

  1. Yes, let's setup at least one shard for doing small scale dogfooding between go-waku and status-go devs

Will you provide me with a sharded pubsub topic(s) and public key?

@richard-ramos
Copy link
Member

0x045ced3b90fabf7673c5165f9cc3a038fd2cfeb96946538089c310b5eaa3a611094b27d8216d9ec8110bd0e0e9fa7a7b5a66e86a27954c9d88ebd41d0ab6cfbb91
/waku/2/rs/16/128

0x049022b33f7583f34463f5b7622e5da29f99f993e6858a478465c68ee114ccf142204eff285ed922349c4b71b178a2e1a2154b99bcc2d8e91b3994626ffa9f1a6c
/waku/2/rs/16/256

I can provide the private keys via DM on status, just let me know!
cc: @cammellos @ilmotta

@fryorcraken fryorcraken changed the title chore: setting up static sharding fleet for Status [Epic] setting up static sharding fleet for Status Aug 24, 2023
@fryorcraken fryorcraken added Epic and removed milestone Tracks a subteam milestone labels Aug 24, 2023
@jm-clius
Copy link
Contributor Author

Weekly Update

  • achieved: final infra definition, including generated keys and shards, specified in infra-status issue
  • next: ensure fleet gets deployed as specified

@jm-clius
Copy link
Contributor Author

jm-clius commented Sep 1, 2023

Weekly Update

  • achieved: negotiation with infra to improve fleet definition, clarify postgresql deployment
  • next: ensure fleet gets deployed as specified

@fryorcraken fryorcraken added E:Static sharding See https://github.com/waku-org/pm/issues/15 for details and removed E:2023-10k-users labels Sep 22, 2023
@fryorcraken
Copy link
Collaborator

Weekly Update

  • achieved: fleet has been deployed, PostgreSQL setup has been tested.
  • next: Do some basic dogfooding with Status Desktop.

@richard-ramos
Copy link
Member

richard-ramos commented Oct 17, 2023

New PRs related to static sharding for Status:

So far, I've been able to get messages going back and forth while using different shards. I defined the following shards:

  • Shard 32 - Used as default for all messages instead of the default pubsub topic, since we can't mix named and static sharding. It's somewhat problematic because this is a breaking change, and once merged, clients using this version wont receive messages from older versions. Happy to brainstorm a possible 'fix' for this problem.
  • Shard 64 - Used for point of contacts for a community, i.e. the CommunityRequestToJoin / CommunityRequestToJoinResponse messages. These need to go on a separate shard because they can't be protected. For now they don't require a signature, but maybe it's something that we can add in the future if required, by having the status clients contain a private key injected during the build process.
  • Shard 128 and 256: these are shards defined to test community DoS protection.

--

I opened this in status-desktop: status-im/status-desktop#12443 Without it it's currently not possible to choose the shards.test fleet. In status-im/status-desktop#12344 i 'solve' it by hardcoding the fleet name, but it's not a proper solution, but a hack to be able to test the fleet

--

Discovery is currently not working. I'm investigating an issue in ⁠ENRs on shards.test fleet⁠. While testing this fleet, I found out something weird. The ENRs defined for the bootnodes in https://fleets.status.im for that fleet are the following:

"enr/p2p/waku/boot": {
                "boot-01.do-ams3.shards.test": "enr:-Ny4QIGdHrr3QQCyGzro0mleJaWdhYI4RJZiDx_Tf0TnSON3NpJP0l7Tk3xfeJqGCkIeEQU1UckwC6muubC4tgB8FZYBgmlkgnY0gmlwhKdjEy-KbXVsdGlhZGRyc68ALTYoYm9vdC0wMS5kby1hbXMzLnNoYXJkcy50ZXN0LnN0YXR1c2ltLm5ldAZ2X4Jyc4sAEAQAIABAAIABAIlzZWNwMjU2azGhAt60bRUEoHNuLlnsM12sU2PIQwBwfLIJ8a_ZPEY2-Rnkg3RjcIJ2X4N1ZHCCIyiFd2FrdTIN",
                "boot-01.gc-us-central1-a.shards.test": "enr:-Oa4QLx_yxPWXpA8W9TJkHbbj6hec6RKWgXko7Fx3hIcPd8UUXnhH3SP6e1Jj1mKBgwWmK4d6XbOkQ0eOh93w8xc0MoBgmlkgnY0gmlwhCKHDVeKbXVsdGlhZGRyc7g4ADY2MWJvb3QtMDEuZ2MtdXMtY2VudHJhbDEtYS5zaGFyZHMudGVzdC5zdGF0dXNpbS5uZXQGdl-CcnOLABAEACAAQACAAQCJc2VjcDI1NmsxoQLGOqANDRbJFI6KVhTfYMDmT9c2UOKzebVV1eQr3EzqQ4N0Y3CCdl-DdWRwgiMohXdha3UyDQ",
                "boot-01.ac-cn-hongkong-c.shards.test": "enr:-Oa4QNivsUYDIbwqfZmFFi-82umI5pafhfNiqkjojH104FvNIhkPIOlY9fm8G643ZOqvgwhI5SX5ucekJFkolb8Wk7QBgmlkgnY0gmlwhAjaF0yKbXVsdGlhZGRyc7g4ADY2MWJvb3QtMDEuYWMtY24taG9uZ2tvbmctYy5zaGFyZHMudGVzdC5zdGF0dXNpbS5uZXQGdl-CcnOLABAEACAAQACAAQCJc2VjcDI1NmsxoQM_sJtGT5gonA4UUzhn2d7LQY9ztY8loLAaSk1HKVruYIN0Y3CCdl-DdWRwgiMohXdha3UyDQ"
            },

and looking at https://enr-viewer.com, i can see that the rs field is there (with the value 0x0010040020004000800100)
however, when I used discv5 with those bootnodes, the following ENRs where returned:

enr:-Ne4QHOpWLyVVZMzJwXcc00CNp16vB5x2WFy6WQAEKyaOf_UMWKvz2a0HN9QCoSyBYmudBKspqYa_U6tJ64B0TqLzy0BgmlkgnY0gmlwhAjarmyKbXVsdGlhZGRyc7g4ADY2MWJvb3QtMDIuYWMtY24taG9uZ2tvbmctYy5zaGFyZHMudGVzdC5zdGF0dXNpbS5uZXQGdl-Jc2VjcDI1NmsxoQNeQXcyqdYwEjflVdLKYAusuZJ93fpGiFwqK1jU9ISQC4N0Y3CCdl-DdWRwgiMohXdha3UyDQ
16Uiu2HAmJzva9cFZdiLEeaXC4rLTZGH8DmrTetPfpmngrcaaNhUN [/ip4/8.218.174.108/tcp/30303/p2p/16Uiu2HAmJzva9cFZdiLEeaXC4rLTZGH8DmrTetPfpmngrcaaNhUN /dns4/boot-02.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmJzva9cFZdiLEeaXC4rLTZGH8DmrTetPfpmngrcaaNhUN] <nil>

enr:-Ne4QJKpiQqwYpo0p1yDW6opKFYzh801nhSzX65S_x892UXABVYzFBrdFwCPiWwXlKqVz5sXkTzYtUuX1wg2sW5DZnwBgmlkgnY0gmlwhCIfDu-KbXVsdGlhZGRyc7g4ADY2MWJvb3QtMDIuZ2MtdXMtY2VudHJhbDEtYS5zaGFyZHMudGVzdC5zdGF0dXNpbS5uZXQGdl-Jc2VjcDI1NmsxoQJm8YcPIYhI5rvlLJJRlpebApk6w4uOLdFgAeHN2wO9N4N0Y3CCdl-DdWRwgiMohXdha3UyDQ
16Uiu2HAm2MXB1WzsGKnYrcX8GRSvunQ1riJmPzVZuvUphM1YE4pn [/ip4/34.31.14.239/tcp/30303/p2p/16Uiu2HAm2MXB1WzsGKnYrcX8GRSvunQ1riJmPzVZuvUphM1YE4pn /dns4/boot-02.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAm2MXB1WzsGKnYrcX8GRSvunQ1riJmPzVZuvUphM1YE4pn] <nil>

enr:-M24QJDZfhB_wN_PHOAQuzgnta20xKUsZl5kdhBeQJM16gdldCJNAKQp6dgbwo-MTRJxYVNCr85cHRAJxtNLR4vTbP0BgmlkgnY0gmlwhKdjEy-KbXVsdGlhZGRyc68ALTYoYm9vdC0wMS5kby1hbXMzLnNoYXJkcy50ZXN0LnN0YXR1c2ltLm5ldAZ2X4lzZWNwMjU2azGhAt60bRUEoHNuLlnsM12sU2PIQwBwfLIJ8a_ZPEY2-Rnkg3RjcIJ2X4N1ZHCCIyiFd2FrdTIN
16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 [/ip4/167.99.19.47/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 /dns4/boot-01.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31] <nil>

enr:-M24QAsRRxoLDnnXFGnbHGUKjtqgXOVxb2Cian1vegc1rtY0Yk5wXDF7NeBzPl7frvyxo3Vt-xSL0vUa2jazchNIS_oBgmlkgnY0gmlwhLKAj_GKbXVsdGlhZGRyc68ALTYoYm9vdC0wMi5kby1hbXMzLnNoYXJkcy50ZXN0LnN0YXR1c2ltLm5ldAZ2X4lzZWNwMjU2azGhAtsXOrELG9R5LlIbF6bqeLC0tg7bmNzQ0JkSmEO3zxqzg3RjcIJ2X4N1ZHCCIyiFd2FrdTIN
16Uiu2HAmAAuoviraBqSBcR5eC346RK46SruiPKdFQBvWrFjXEkLr [/ip4/178.128.143.241/tcp/30303/p2p/16Uiu2HAmAAuoviraBqSBcR5eC346RK46SruiPKdFQBvWrFjXEkLr /dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAAuoviraBqSBcR5eC346RK46SruiPKdFQBvWrFjXEkLr] <nil>

and interestingly enough, none of these ENRs have the rs field. I'm curious also about these boot-02 nodes that don't appear in https://fleets.status.im

@richard-ramos
Copy link
Member

Weekly Update

  • achieved: setup a separate shard for community points of contact, and another one for 1:1/group messages
  • next: investigate/fix discv5 not working when static sharding is being used.

@fryorcraken fryorcraken removed the E:Static sharding See https://github.com/waku-org/pm/issues/15 for details label Oct 20, 2023
@fryorcraken fryorcraken changed the title [Epic] setting up static sharding fleet for Status Setting up static sharding fleet for Status Oct 20, 2023
@fryorcraken fryorcraken removed the Epic label Oct 20, 2023
@fryorcraken fryorcraken added the E:Targeted Status Communities dogfooding See https://github.com/waku-org/pm/issues/97 for details label Oct 20, 2023
@fryorcraken
Copy link
Collaborator

This looks done but will wait for @jm-clius to be back (end of October) before closing just in case we missed something.

@SionoiS SionoiS moved this to In Progress in Waku Oct 24, 2023
@jm-clius
Copy link
Contributor Author

Indeed. As far as I can tell the fleet has been successfully deployed, Postgresql setup and tested and bootstrap DNS entries are available. Any further issues and investigations could be better tracked in new, separate issues.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Waku Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E:Targeted Status Communities dogfooding See https://github.com/waku-org/pm/issues/97 for details
Projects
Archived in project
Development

No branches or pull requests

4 participants