Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: possible race in migration logic #4344

Closed
conorsch opened this issue May 7, 2024 · 1 comment · Fixed by #4347
Closed

storage: possible race in migration logic #4344

conorsch opened this issue May 7, 2024 · 1 comment · Fixed by #4347
Labels
A-upgrades Area: Relates to chain upgrades needs-refinement unclear, incomplete, or stub issue that needs work

Comments

@conorsch
Copy link
Contributor

conorsch commented May 7, 2024

Describe the bug
When running pd migrate, which I've been doing frequently lately, occasionally I'll encounter an error about failing to lock the rocksdb for writing:

can open database: Error { message: "IO error: lock hold by current process, acquire time 1715101599 acquiring thread 35470: /home/runner/.penumbra/testnet_data/node0/pd/rocksdb/LOCK: No locks available" }

To Reproduce
Steps to reproduce the behavior:

  1. Build a devnet, run it locally
  2. Stop it (make sure you stop it, so there's no chance pd is accessing storage dirs)
  3. Run cargo run --release --bin pd -- migrate

Expected behavior
The pd migrate command should have no problem gaining a lock on storage, and shouldn't fight with itself.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context

Full logs from a CI run (on #4337)
2024-05-07T17:06:39.654485Z  INFO pd: running command cmd=Migrate { home: None, comet_home: None } version="0.74.0-alpha.1"
2024-05-07T17:06:39.654735Z  INFO cnidarium::storage: opening rocksdb config column path="/home/runner/.penumbra/testnet_data/node0/pd/rocksdb"
2024-05-07T17:06:39.721715Z  INFO cnidarium::storage: reading prefixes from config column family
2024-05-07T17:06:39.723573Z  INFO cnidarium::storage: initializing global store config
2024-05-07T17:06:39.723584Z  INFO cnidarium::storage: creating substore config for prefix prefix="cometbft-data"
2024-05-07T17:06:39.723590Z  INFO cnidarium::storage: creating substore config for prefix prefix="ibc-data"
2024-05-07T17:06:39.723600Z  INFO cnidarium::storage: opening rocksdb path="/home/runner/.penumbra/testnet_data/node0/pd/rocksdb"
2024-05-07T17:06:39.743842Z  INFO cnidarium::storage: snapshot dispatcher task has started
2024-05-07T17:06:39.743948Z  INFO pd: last block timestamp genesis_start=Time(2024-05-07 17:05:24.489867937)
2024-05-07T17:06:39.743962Z  INFO cnidarium::storage: dispatcher task has terminated
2024-05-07T17:06:39.744001Z  INFO pd_migrate: pd: migrating pd state in /home/runner/.penumbra/testnet_data/node0/pd
2024-05-07T17:06:39.744045Z  INFO pd_migrate: cnidarium::storage: opening rocksdb config column path="/home/runner/.penumbra/testnet_data/node0/pd/rocksdb"
thread 'tokio-runtime-worker' panicked at /home/runner/actions-runner/_work/penumbra/penumbra/crates/cnidarium/src/storage.rs:73:60:
can open database: Error { message: "IO error: lock hold by current process, acquire time 1715101599 acquiring thread 35470: /home/runner/.penumbra/testnet_data/node0/pd/rocksdb/LOCK: No locks available" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: failed to upgrade state

Caused by:
    task 21 panicked

I don't yet know that this is an actual bug, but filing it early to reference later if need be.

@github-project-automation github-project-automation bot moved this to Backlog in Penumbra May 7, 2024
@github-actions github-actions bot added the needs-refinement unclear, incomplete, or stub issue that needs work label May 7, 2024
@conorsch conorsch added the A-upgrades Area: Relates to chain upgrades label May 7, 2024
@erwanor
Copy link
Member

erwanor commented May 7, 2024

pub async fn last_block_timestamp(home: PathBuf) -> anyhow::Result<tendermint::Time> {
    let rocksdb = home.join("rocksdb");
    let storage = Storage::load(rocksdb, SUBSTORE_PREFIXES.to_vec())
        .await
        .context("error loading store for timestamp")?;
    let state = storage.latest_snapshot();
    let last_block_time = state
        .get_block_timestamp()
        .await
        .context("error reading latest block timestamp")?;
    Ok(last_block_time)
}

I think adding storage.release().await? at the end of this method should help. Unfortunately, Rust does not have a real "async drop" method so we have to call this manually but granted it's not well documented.

conorsch added a commit that referenced this issue May 7, 2024
Closes #4344. Hat tip to @erwanor for spotting the omission.
conorsch added a commit that referenced this issue May 7, 2024
Closes #4344. Hat tip to @erwanor for spotting the omission.
conorsch added a commit that referenced this issue May 7, 2024
Closes #4344. Hat tip to @erwanor for spotting the omission.
@github-project-automation github-project-automation bot moved this from Backlog to Done in Penumbra May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-upgrades Area: Relates to chain upgrades needs-refinement unclear, incomplete, or stub issue that needs work
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants