Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: wait for metrics model to be available locally #3265

Closed
wants to merge 8 commits into from

Conversation

stbrody
Copy link
Contributor

@stbrody stbrody commented Jul 19, 2024

Builds on #3260

When starting up a clean js-ceramic node for the first time with node metrics publishing enabled, the node needs to wait for the NodeMetrics model to be synced via recon and made available locally before it tries to index it. Example out from a clean node as it waits for this to happen below:

[2024-07-19T18:56:30.467Z] IMPORTANT: Connecting to IPFS node with version "ceramic-one/0.28.1" available as /ip4/127.0.0.1/udp/4101/quic-v1/p2p/12D3KooWL9zNBFnNEgEPRCtrzFXX7iyN8vnGWJGX7z6eaFkmdy5e, /ip4/10.0.0.188/udp/4101/quic-v1/p2p/12D3KooWL9zNBFnNEgEPRCtrzFXX7iyN8vnGWJGX7z6eaFkmdy5e, /ip4/127.0.0.1/tcp/4101/p2p/12D3KooWL9zNBFnNEgEPRCtrzFXX7iyN8vnGWJGX7z6eaFkmdy5e, /ip4/10.0.0.188/tcp/4101/p2p/12D3KooWL9zNBFnNEgEPRCtrzFXX7iyN8vnGWJGX7z6eaFkmdy5e
[2024-07-19T18:56:30.469Z] IMPORTANT: Initializing SQLite connection
[2024-07-19T18:56:30.509Z] IMPORTANT: Node DID set to 'did:key:z6MkhXxQCx33gofNSNSRuipu5xr8cNnW2b4aoXebyfG6FXwe'. This DID will be used to authenticate to the anchor service
[2024-07-19T18:56:30.509Z] IMPORTANT: Connecting to ceramic network 'testnet-clay' using ceramic-one with Recon for data synchronization.
[2024-07-19T18:56:30.519Z] IMPORTANT: Creating ComposeDB config table: ceramic_models
[2024-07-19T18:56:30.519Z] IMPORTANT: Creating ComposeDB config table: ceramic_config
[2024-07-19T18:56:30.519Z] IMPORTANT: Creating ComposeDB config table: ceramic_model_implements
[2024-07-19T18:56:30.536Z] IMPORTANT: Connecting to bootstrap peers for network testnet-clay
[2024-07-19T18:56:30.543Z] WARNING: Can not connect to /dns4/bootstrap-tnet-rust-ceramic-1.3box.io/tcp/4101/p2p/12D3KooWMqCFj5bnwuNi6D6KLhYiK4C8Eh9xSUKv2E6Jozs4nWEE
[2024-07-19T18:56:30.543Z] WARNING: HTTPError: An internal error occurred
[2024-07-19T18:56:30.545Z] WARNING: Can not connect to /dns4/bootstrap-tnet-rust-ceramic-2.3box.io/tcp/4101/p2p/12D3KooWPFGbRHWfDaWt5MFFeqAHBBq3v5BqeJ4X7pmn2V1t6uNs
[2024-07-19T18:56:30.545Z] WARNING: HTTPError: An internal error occurred
[2024-07-19T18:56:30.990Z] IMPORTANT: Connected to anchor service 'https://cas-clay.3boxlabs.com' with supported anchor chains ['eip155:100']
[2024-07-19T18:56:30.992Z] WARNING: No pinned streams detected. This is expected if this is the first time this node has been run, but may indicate a problem with the node's persistence setup if it should have pinned streams
[2024-07-19T18:56:30.997Z] IMPORTANT: Waiting for Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish NodeMetrics to be available locally
[2024-07-19T18:56:34.134Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish NodeMetrics: HTTPError: block was not found locally (offline)
[2024-07-19T18:56:37.344Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish NodeMetrics: HTTPError: block was not found locally (offline)
[2024-07-19T18:56:43.757Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish NodeMetrics: HTTPError: block was not found locally (offline)
[2024-07-19T18:56:56.560Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish NodeMetrics: HTTPError: block was not found locally (offline)
[2024-07-19T18:57:09.362Z] IMPORTANT: Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish NodeMetrics loaded successfully
[2024-07-19T18:57:09.362Z] IMPORTANT: Starting indexing for Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q
[2024-07-19T18:57:09.408Z] IMPORTANT: Creating ComposeDB Indexing table for model: kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q
[2024-07-19T18:57:09.419Z] IMPORTANT: Publishing Node Metrics publicly to the Ceramic Network.  To learn more, including how to disable publishing, please see the NODE_METRICS.md file for your branch, e.g. https://github.com/ceramicnetwork/js-ceramic/blob/develop/docs-dev/NODE_METRICS.md
[2024-07-19T18:57:09.422Z] IMPORTANT: Ceramic API running on 0.0.0.0:7007'

@stbrody stbrody self-assigned this Jul 19, 2024
Copy link

linear bot commented Jul 19, 2024

}

// eslint-disable-next-line no-constant-condition
while (true) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note we will wait forever if the node config says it wants to publish node metrics but the model doesn't ever show up (e.g. because the node is isolated from the rest of the network, or connected to a local network where the Model isn't available)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that is bad we should just continue with our work

there are way too many ways for ceramic nodes to fail, i would strongly prefer NodeMetrics and observability NOT be a way that nodes error out and don't start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an argument to be made here that if the user asked for metrics in the config and we fail, they should learn that and choose to turn off metrics or to fix the underlying error.

But that said I don't feel strongly and since we plan to turn this on by default, I see the argument to minimize it's impact. I'll update to make this time out without failing startup

@stbrody stbrody requested review from gvelez17 and dav1do July 19, 2024 19:03
Copy link
Contributor

@gvelez17 gvelez17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be more forgiving please if we don't find the nodemetrics model

i really don't want to make more error paths for new devs, lets be forgiving in this until we are sure its rock solid and always should succeed in the world

also they might be in a local network without access to the metrics model, or on a new network they made

i would like NOT to error out if node metrics are not working - its not a requirement - we can test it more ourselves to make it work more often

@stbrody
Copy link
Contributor Author

stbrody commented Jul 23, 2024

Okay, new version up that waits for up to 10 seconds for the model to become available and then proceeds with startup anyway if the model hasn't been loaded in time. New output if the model is unable to be synced:

[2024-07-23T21:31:30.142Z] IMPORTANT: Connected to anchor service 'https://cas-clay.3boxlabs.com' with supported anchor chains ['eip155:100']
[2024-07-23T21:31:30.143Z] WARNING: No pinned streams detected. This is expected if this is the first time this node has been run, but may indicate a problem with the node's persistence setup if it should have pinned streams
[2024-07-23T21:31:30.151Z] IMPORTANT: Waiting for Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish Node Metrics to be available locally
[2024-07-23T21:31:30.688Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish Node Metrics: HTTPError: block was not found locally (offline)
[2024-07-23T21:31:31.229Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish Node Metrics: HTTPError: block was not found locally (offline)
[2024-07-23T21:31:31.764Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish Node Metrics: HTTPError: block was not found locally (offline)
[2024-07-23T21:31:32.291Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish Node Metrics: HTTPError: block was not found locally (offline)
[2024-07-23T21:31:32.821Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish Node Metrics: HTTPError: block was not found locally (offline)
[2024-07-23T21:31:33.352Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish Node Metrics: HTTPError: block was not found locally (offline)
[2024-07-23T21:31:33.884Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish Node Metrics: HTTPError: block was not found locally (offline)
[2024-07-23T21:31:34.413Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish Node Metrics: HTTPError: block was not found locally (offline)
[2024-07-23T21:31:34.946Z] ERROR: Error loading Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish Node Metrics: HTTPError: block was not found locally (offline)
[2024-07-23T21:31:35.471Z] WARNING: Could not load Model kjzl6hvfrbw6cb9pd0bl7zmm28h3qszh56ccpn50vsmrl7clroy4fvln00z7q6q used to publish Node Metrics, disabling metrics publishing
[2024-07-23T21:31:35.480Z] IMPORTANT: Ceramic API running on 0.0.0.0:7007'

@stbrody
Copy link
Contributor Author

stbrody commented Jul 23, 2024

Increased the timeout to 15 seconds because in my local testing the node was taking more than 10 seconds to sync the model fairly regularly.

Interestingly, how long it takes the node to sync the model on an brand new clean node seems to vary quite wildly. I saw it sync within milliseconds, the data already being present by the time js-ceramic tried to load it the first time, and I also saw it take as long as almost 20 seconds at least once.

FYI @dav1do @nathanielc

@stbrody stbrody requested a review from gvelez17 July 23, 2024 22:11
*/
async _waitForMetricsModel(model: StreamID): Promise<boolean> {
const maxWaitDuration = 1000 * 15 // 15 seconds
const retryInterval = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its going to take more than .1 sec? a cascading backoff might be nice. but this is fine for now.

Copy link
Contributor

@gvelez17 gvelez17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. the retry interval might be a bit short, an incremental backoff might be better? but this is fine for now, at worst it will just fail to do node metrics and have a 15 sec delay so its fine - thx for the error handling!

Base automatically changed from subcribe-node-metrics-aes-232 to develop July 24, 2024 15:19
@stbrody
Copy link
Contributor Author

stbrody commented Jul 24, 2024

Closing in favor of #3268

@stbrody stbrody closed this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants