Setting up static sharding fleet for Status #1914

jm-clius · 2023-08-15T16:59:38Z

Background

This issue tracks the work necessary to set up a static sharding fleet for Status Communities. This forms part of the 10K users epic.
It involves:

setting the requirements for the fleet and creating an issue in https://github.com/status-im/infra-status
enumerating responsibilities and transferring ownership to a new fleet owner: docs: requirements for being a Waku fleet owner pm#61

Why is this issue not in an infra repo?

Some requirements need to be agreed upon, before opening an issue in the relevant infra repo.

Requirements

I suggest the following configuration:

Fleet of (at least) 10 nodes, split into two sub-fleets:
- status.sharding.store: 5 nodes configured only with relay and store. These will be the store/historical message providers
- status.sharding.bootstrap: 5 nodes configured with relay, filter, lightpush, peer-exchange. These are the main bootstrap nodes and also provide services to resource-restricted nodes.
all nodes should be configured only for the Status Internal CC Community static shards
status.sharding.store should be configured with a single, shared PostgreSQL backend
status-go nodes should preferably only use the bootstrap nodes to prime their discv5 routing tables, as minimal mechanism to limit unnecessary interaction with store nodes.
no websockets configuration for now
1 node in status.sharding.bootstrap should be set up with trace-level message logs, in order to facilitate future end-to-end message tracing and debugging.

Tracking Issue: status-im/infra-status#2

The text was updated successfully, but these errors were encountered:

jm-clius · 2023-08-15T17:05:33Z

Some questions on the above:

@richard-ramos can you take a look if this approach makes sense, especially splitting the fleet between the bootstrap nodes and the store nodes? Would it be possible/easier to e.g. set up two different DNS node trees - one for store and one for bootstrapping? The status-go nodes would then populate their store providers with one query and use another query to bootstrap connection to the network? The alternative is to simply use a single DNS list retrievable via a single query, and then use the already-existing capability differentiation to populate store nodes and bootstrap to the remaining nodes.
@richard-ramos do we already have an idea of what specific shards we'd use for the first (internal CCs?) Status Community?
@Ivansete-status we probably want to prioritise final postgresql deployment for wakuv2.sharding and use that as a blueprint to create something similar for status.sharding.store.

richard-ramos · 2023-08-15T18:56:01Z

All nodes should be configured only for the Status Internal CC Community static shards causes me some confusion - what should be the behavior for the app for both new users (not having joined a community), and the behavior for users that are part of the status community. Should new users utilize the status.prod fleet? and only if they join the status community, they should use this new fleet? if so, it means that status-go should somehow associate communities to fleets. Such behavior is currently not implemented.

Something else to take into account is that status-go currently behaves like this:

It uses DNSDiscovery to obtain the discv5 bootstrap nodes to discover relay peers and filter peers (if using light mode).
The list of store nodes is hardcoded in status-go (although you can manually add nodes by invoking a rpc method).

We probably should discuss this with status team to design what makes sense for retrieving message history, since I assume that status.sharding.store will return messages only from the status shards, but then, 1:1 messages, group chats and other communities for the time being use the default pubsub topic. So some heuristic needs to be formulated in order to decide what store nodes to use to retrieve messages (hardcoded store node lists, or associating fleets to communities or TBD).

The same applies for peer discovery for shards, (which is an open item for status-go). Should status-go use status.sharding.bootstrap for discovering all peers regardless of the shards they belong to? or should we use both the nodes from status.prod fleet and status.sharding.bootstrap fleet.

richard-ramos · 2023-08-15T19:42:15Z

do we already have an idea of what specific shards we'd use for the first (internal CCs?) Status Community?

Nope, but setting a shard cluster/index for a community is easy-sh. For now i imagine that setting up any shard index between 128 - 767 in cluster 16 and using that should be fine?

alrevuelta · 2023-08-16T08:23:29Z

@jm-clius Perhaps this should go somewhere else? Its more related to deployments and ofc there are some things to figure out, but I don't see this as a research problem?

jm-clius · 2023-08-16T12:45:13Z

what should be the behavior for the app for both new users (not having joined a community), and the behavior for users that are part of the status community

I would say there shouldn't really be any complicated logic done her for specific fleets. The heuristic I suggest (for now):

a Status node is subscribed to a couple of static shards via relay (it may have other static shards configured for light protocols only, e.g. certain control message shards)
user running the app for the first time may have some default static shard subscriptions (if e.g. it needs 1:1 messages by default, there should be a static shard assigned for that and be part of the initial app subscriptions)
user can join Status Community, which would add relay subscription to the appropriate static shard(s)
when user opens the app, a DNS query is performed which returns a list of bootstrap nodes. The discovery protocol should filter only those nodes that serve the subscribed static shards the user is interested in. This could also apply to populating the store node table.
user continues using the shared discovery layer (which will be across all fleets), but keep filtering for nodes that belong to the static shards the user is interested in (i.e. only connect to these).

From what I understand, status-go does not yet support filtering the subscribed shards in peer discovery? That would need to be implemented, but shouldn't for now affect how we deploy the fleet - namely a fleet specifically configured to serve (only) the Status community. We could consider using the same fleet for the 1:1/group chat message static shards for now, though we'll likely split this off in future as well. The point is that afaics there would be no need for any fleet-specific configuration if we have a shared discovery layer and a Status node that can filter discovered peers on shard.

The assumption is that we'll launch first for some test communities and thereafter only for the Status Internal CC community. We can choose to expand support on this fleet for more shards to make the process simpler, but at no point should anything other than static shards be used (i.e. Status nodes shouldn't be subscribed to the default pubsub topic). We can also increment towards something more sophisticated here (e.g. use the same store nodes for all shards and keep hardcoding these in the interim).

jm-clius · 2023-08-16T12:47:49Z

Perhaps this should go somewhere else? Its more related to deployments

@alrevuelta yeah, you're right. I've moved it to nwaku for now, as this is a nwaku deployment. I would say that some thinking re infrastructure will form part of research roadmaps (e.g. hammering out how a distributed bootstrap network will look for autosharding), which is why I had this issue in research first. But it's hopefully going to be closed and related tasks open in an infra repo soon.

jm-clius · 2023-08-23T11:32:47Z

@richard-ramos a couple of questions in order to get bootstrapping going in the simplest way possible:

will the store nodes for now still be hardcoded? If so, what do you think of setting up DNS discovery for only status.sharding.bootstrap for now and still hardcoding status.sharding.store for the start of dogfooding? Two other alternatives:
(1) set up all nodes (store and bootstrap) into the same DNS discovery list and assume that the app will eventually be able to determine which services it can get from which fleet node
(2) we could also set up two different DNS discovery domains for the bootstrap and store fleets if you think this will be more future proof? That way dogfooding can start with store nodes still hardcoded (or populated by separate DNS query) and bootstrapping only via the bootstrap fleet.
should we pre-generate the pubsub topic(s) and key for at least one test community to get started? Afaik new communities will require manually updating the fleet every time. If we have the key/shard for the first one, we could save some effort on behalf of infra.

richard-ramos · 2023-08-23T13:50:20Z

I like the second alternative. Let's setup status.sharding.bootstrap and status.sharding.store dns discovery URLs!. I'll update status-go to include the store nodes hardcoded since this is the easiest change that can be done, while attempting to introduce dns discovery for retriving the history. In the future we can just use the same dns query for both bootstrapping and history.
Yes, let's setup at least one shard for doing small scale dogfooding between go-waku and status-go devs

jm-clius · 2023-08-23T14:40:54Z

Will do, thanks!
For:

Yes, let's setup at least one shard for doing small scale dogfooding between go-waku and status-go devs

Will you provide me with a sharded pubsub topic(s) and public key?

richard-ramos · 2023-08-23T15:56:34Z

0x045ced3b90fabf7673c5165f9cc3a038fd2cfeb96946538089c310b5eaa3a611094b27d8216d9ec8110bd0e0e9fa7a7b5a66e86a27954c9d88ebd41d0ab6cfbb91
/waku/2/rs/16/128

0x049022b33f7583f34463f5b7622e5da29f99f993e6858a478465c68ee114ccf142204eff285ed922349c4b71b178a2e1a2154b99bcc2d8e91b3994626ffa9f1a6c
/waku/2/rs/16/256

I can provide the private keys via DM on status, just let me know!
cc: @cammellos @ilmotta

jm-clius · 2023-08-25T14:11:17Z

Weekly Update

achieved: final infra definition, including generated keys and shards, specified in infra-status issue
next: ensure fleet gets deployed as specified

jm-clius · 2023-09-01T16:09:46Z

Weekly Update

achieved: negotiation with infra to improve fleet definition, clarify postgresql deployment
next: ensure fleet gets deployed as specified

fryorcraken · 2023-10-11T03:09:03Z

Weekly Update

achieved: fleet has been deployed, PostgreSQL setup has been tested.
next: Do some basic dogfooding with Status Desktop.

richard-ramos · 2023-10-17T12:36:57Z

New PRs related to static sharding for Status:

So far, I've been able to get messages going back and forth while using different shards. I defined the following shards:

Shard 32 - Used as default for all messages instead of the default pubsub topic, since we can't mix named and static sharding. It's somewhat problematic because this is a breaking change, and once merged, clients using this version wont receive messages from older versions. Happy to brainstorm a possible 'fix' for this problem.
Shard 64 - Used for point of contacts for a community, i.e. the CommunityRequestToJoin / CommunityRequestToJoinResponse messages. These need to go on a separate shard because they can't be protected. For now they don't require a signature, but maybe it's something that we can add in the future if required, by having the status clients contain a private key injected during the build process.
Shard 128 and 256: these are shards defined to test community DoS protection.

--

I opened this in status-desktop: status-im/status-desktop#12443 Without it it's currently not possible to choose the shards.test fleet. In status-im/status-desktop#12344 i 'solve' it by hardcoding the fleet name, but it's not a proper solution, but a hack to be able to test the fleet

--

Discovery is currently not working. I'm investigating an issue in ⁠ENRs on shards.test fleet⁠. While testing this fleet, I found out something weird. The ENRs defined for the bootnodes in https://fleets.status.im for that fleet are the following:

"enr/p2p/waku/boot": {
                "boot-01.do-ams3.shards.test": "enr:-Ny4QIGdHrr3QQCyGzro0mleJaWdhYI4RJZiDx_Tf0TnSON3NpJP0l7Tk3xfeJqGCkIeEQU1UckwC6muubC4tgB8FZYBgmlkgnY0gmlwhKdjEy-KbXVsdGlhZGRyc68ALTYoYm9vdC0wMS5kby1hbXMzLnNoYXJkcy50ZXN0LnN0YXR1c2ltLm5ldAZ2X4Jyc4sAEAQAIABAAIABAIlzZWNwMjU2azGhAt60bRUEoHNuLlnsM12sU2PIQwBwfLIJ8a_ZPEY2-Rnkg3RjcIJ2X4N1ZHCCIyiFd2FrdTIN",
                "boot-01.gc-us-central1-a.shards.test": "enr:-Oa4QLx_yxPWXpA8W9TJkHbbj6hec6RKWgXko7Fx3hIcPd8UUXnhH3SP6e1Jj1mKBgwWmK4d6XbOkQ0eOh93w8xc0MoBgmlkgnY0gmlwhCKHDVeKbXVsdGlhZGRyc7g4ADY2MWJvb3QtMDEuZ2MtdXMtY2VudHJhbDEtYS5zaGFyZHMudGVzdC5zdGF0dXNpbS5uZXQGdl-CcnOLABAEACAAQACAAQCJc2VjcDI1NmsxoQLGOqANDRbJFI6KVhTfYMDmT9c2UOKzebVV1eQr3EzqQ4N0Y3CCdl-DdWRwgiMohXdha3UyDQ",
                "boot-01.ac-cn-hongkong-c.shards.test": "enr:-Oa4QNivsUYDIbwqfZmFFi-82umI5pafhfNiqkjojH104FvNIhkPIOlY9fm8G643ZOqvgwhI5SX5ucekJFkolb8Wk7QBgmlkgnY0gmlwhAjaF0yKbXVsdGlhZGRyc7g4ADY2MWJvb3QtMDEuYWMtY24taG9uZ2tvbmctYy5zaGFyZHMudGVzdC5zdGF0dXNpbS5uZXQGdl-CcnOLABAEACAAQACAAQCJc2VjcDI1NmsxoQM_sJtGT5gonA4UUzhn2d7LQY9ztY8loLAaSk1HKVruYIN0Y3CCdl-DdWRwgiMohXdha3UyDQ"
            },

and looking at https://enr-viewer.com, i can see that the rs field is there (with the value 0x0010040020004000800100)
however, when I used discv5 with those bootnodes, the following ENRs where returned:

enr:-Ne4QHOpWLyVVZMzJwXcc00CNp16vB5x2WFy6WQAEKyaOf_UMWKvz2a0HN9QCoSyBYmudBKspqYa_U6tJ64B0TqLzy0BgmlkgnY0gmlwhAjarmyKbXVsdGlhZGRyc7g4ADY2MWJvb3QtMDIuYWMtY24taG9uZ2tvbmctYy5zaGFyZHMudGVzdC5zdGF0dXNpbS5uZXQGdl-Jc2VjcDI1NmsxoQNeQXcyqdYwEjflVdLKYAusuZJ93fpGiFwqK1jU9ISQC4N0Y3CCdl-DdWRwgiMohXdha3UyDQ
16Uiu2HAmJzva9cFZdiLEeaXC4rLTZGH8DmrTetPfpmngrcaaNhUN [/ip4/8.218.174.108/tcp/30303/p2p/16Uiu2HAmJzva9cFZdiLEeaXC4rLTZGH8DmrTetPfpmngrcaaNhUN /dns4/boot-02.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmJzva9cFZdiLEeaXC4rLTZGH8DmrTetPfpmngrcaaNhUN] <nil>

enr:-Ne4QJKpiQqwYpo0p1yDW6opKFYzh801nhSzX65S_x892UXABVYzFBrdFwCPiWwXlKqVz5sXkTzYtUuX1wg2sW5DZnwBgmlkgnY0gmlwhCIfDu-KbXVsdGlhZGRyc7g4ADY2MWJvb3QtMDIuZ2MtdXMtY2VudHJhbDEtYS5zaGFyZHMudGVzdC5zdGF0dXNpbS5uZXQGdl-Jc2VjcDI1NmsxoQJm8YcPIYhI5rvlLJJRlpebApk6w4uOLdFgAeHN2wO9N4N0Y3CCdl-DdWRwgiMohXdha3UyDQ
16Uiu2HAm2MXB1WzsGKnYrcX8GRSvunQ1riJmPzVZuvUphM1YE4pn [/ip4/34.31.14.239/tcp/30303/p2p/16Uiu2HAm2MXB1WzsGKnYrcX8GRSvunQ1riJmPzVZuvUphM1YE4pn /dns4/boot-02.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAm2MXB1WzsGKnYrcX8GRSvunQ1riJmPzVZuvUphM1YE4pn] <nil>

enr:-M24QJDZfhB_wN_PHOAQuzgnta20xKUsZl5kdhBeQJM16gdldCJNAKQp6dgbwo-MTRJxYVNCr85cHRAJxtNLR4vTbP0BgmlkgnY0gmlwhKdjEy-KbXVsdGlhZGRyc68ALTYoYm9vdC0wMS5kby1hbXMzLnNoYXJkcy50ZXN0LnN0YXR1c2ltLm5ldAZ2X4lzZWNwMjU2azGhAt60bRUEoHNuLlnsM12sU2PIQwBwfLIJ8a_ZPEY2-Rnkg3RjcIJ2X4N1ZHCCIyiFd2FrdTIN
16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 [/ip4/167.99.19.47/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 /dns4/boot-01.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31] <nil>

enr:-M24QAsRRxoLDnnXFGnbHGUKjtqgXOVxb2Cian1vegc1rtY0Yk5wXDF7NeBzPl7frvyxo3Vt-xSL0vUa2jazchNIS_oBgmlkgnY0gmlwhLKAj_GKbXVsdGlhZGRyc68ALTYoYm9vdC0wMi5kby1hbXMzLnNoYXJkcy50ZXN0LnN0YXR1c2ltLm5ldAZ2X4lzZWNwMjU2azGhAtsXOrELG9R5LlIbF6bqeLC0tg7bmNzQ0JkSmEO3zxqzg3RjcIJ2X4N1ZHCCIyiFd2FrdTIN
16Uiu2HAmAAuoviraBqSBcR5eC346RK46SruiPKdFQBvWrFjXEkLr [/ip4/178.128.143.241/tcp/30303/p2p/16Uiu2HAmAAuoviraBqSBcR5eC346RK46SruiPKdFQBvWrFjXEkLr /dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAAuoviraBqSBcR5eC346RK46SruiPKdFQBvWrFjXEkLr] <nil>

and interestingly enough, none of these ENRs have the rs field. I'm curious also about these boot-02 nodes that don't appear in https://fleets.status.im

richard-ramos · 2023-10-17T23:37:01Z

Weekly Update

achieved: setup a separate shard for community points of contact, and another one for 1:1/group messages
next: investigate/fix discv5 not working when static sharding is being used.

fryorcraken · 2023-10-20T04:00:39Z

This looks done but will wait for @jm-clius to be back (end of October) before closing just in case we missed something.

jm-clius · 2023-10-26T05:41:23Z

Indeed. As far as I can tell the fleet has been successfully deployed, Postgresql setup and tested and bootstrap DNS entries are available. Any further issues and investigations could be better tracked in new, separate issues.

jm-clius added the E:2023-10k-users label Aug 15, 2023

fryorcraken mentioned this issue Aug 16, 2023

[Milestone] Waku Network Can Support 10K Users waku-org/pm#12

Closed

22 tasks

fryorcraken added the milestone Tracks a subteam milestone label Aug 16, 2023

github-project-automation bot added this to Waku Aug 16, 2023

jm-clius transferred this issue from waku-org/research Aug 16, 2023

fryorcraken changed the title ~~chore: setting up static sharding fleet for Status~~ [Epic] setting up static sharding fleet for Status Aug 24, 2023

fryorcraken added Epic and removed milestone Tracks a subteam milestone labels Aug 24, 2023

jm-clius mentioned this issue Aug 25, 2023

Static sharding fleets for Status Communities status-im/infra-status#2

Closed

fryorcraken added E:Static sharding See https://github.com/waku-org/pm/issues/15 for details and removed E:2023-10k-users labels Sep 22, 2023

fryorcraken assigned jm-clius and fryorcraken Oct 9, 2023

fryorcraken removed the E:Static sharding See https://github.com/waku-org/pm/issues/15 for details label Oct 20, 2023

fryorcraken changed the title ~~[Epic] setting up static sharding fleet for Status~~ Setting up static sharding fleet for Status Oct 20, 2023

fryorcraken removed the Epic label Oct 20, 2023

fryorcraken mentioned this issue Oct 20, 2023

[Epic] Targeted dogfooding for Status Communities waku-org/pm#97

Closed

3 tasks

fryorcraken added the E:Targeted Status Communities dogfooding See https://github.com/waku-org/pm/issues/97 for details label Oct 20, 2023

SionoiS moved this to In Progress in Waku Oct 24, 2023

jm-clius closed this as completed Oct 26, 2023

github-project-automation bot moved this from In Progress to Done in Waku Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting up static sharding fleet for Status #1914

Setting up static sharding fleet for Status #1914

jm-clius commented Aug 15, 2023 •

edited by chair28980

Loading

jm-clius commented Aug 15, 2023

richard-ramos commented Aug 15, 2023

richard-ramos commented Aug 15, 2023

alrevuelta commented Aug 16, 2023

jm-clius commented Aug 16, 2023

jm-clius commented Aug 16, 2023

jm-clius commented Aug 23, 2023

richard-ramos commented Aug 23, 2023

jm-clius commented Aug 23, 2023

richard-ramos commented Aug 23, 2023

jm-clius commented Aug 25, 2023

jm-clius commented Sep 1, 2023 •

edited by fryorcraken

Loading

fryorcraken commented Oct 11, 2023

richard-ramos commented Oct 17, 2023 •

edited

Loading

richard-ramos commented Oct 17, 2023

fryorcraken commented Oct 20, 2023

jm-clius commented Oct 26, 2023

Setting up static sharding fleet for Status #1914

Setting up static sharding fleet for Status #1914

Comments

jm-clius commented Aug 15, 2023 • edited by chair28980 Loading

Background

Why is this issue not in an infra repo?

Requirements

jm-clius commented Aug 15, 2023

richard-ramos commented Aug 15, 2023

richard-ramos commented Aug 15, 2023

alrevuelta commented Aug 16, 2023

jm-clius commented Aug 16, 2023

jm-clius commented Aug 16, 2023

jm-clius commented Aug 23, 2023

richard-ramos commented Aug 23, 2023

jm-clius commented Aug 23, 2023

richard-ramos commented Aug 23, 2023

jm-clius commented Aug 25, 2023

jm-clius commented Sep 1, 2023 • edited by fryorcraken Loading

fryorcraken commented Oct 11, 2023

richard-ramos commented Oct 17, 2023 • edited Loading

richard-ramos commented Oct 17, 2023

fryorcraken commented Oct 20, 2023

jm-clius commented Oct 26, 2023

jm-clius commented Aug 15, 2023 •

edited by chair28980

Loading

jm-clius commented Sep 1, 2023 •

edited by fryorcraken

Loading

richard-ramos commented Oct 17, 2023 •

edited

Loading