Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorg handling may be broken in 0.20 #3940

Open
victorkirov opened this issue Sep 11, 2024 · 11 comments
Open

Reorg handling may be broken in 0.20 #3940

victorkirov opened this issue Sep 11, 2024 · 11 comments
Assignees
Labels

Comments

@victorkirov
Copy link
Contributor

There was a reorg on Testnet around block 2904360. Ord 0.20 successfully picked it up and executed the rollback, but the rollback has been stuck for over 2 hours where it used to be close to instant before. This is on multiple instances of the indexer, not just one, so it's not an outlier issue.

[2024-09-11T05:51:03Z INFO  ord::index] 5 block deep reorg detected at height 2904360
[2024-09-11T05:51:03Z INFO  ord::index::reorg] rolling back database after reorg of depth 5 at height 2904360

Could something have broken in a new version of redb?

@victorkirov
Copy link
Contributor Author

victorkirov commented Sep 11, 2024

It also looks like whatever state the process is in is blocking shutdown. After the first sigterm, it logs that it's gracefully shutting down and you can force shutdown by pushing ctrl-c again. A second sigterm usually kills the process at this point but it's not responding. A sigkill also does nothing, though I don't think sigkills are currently handled.

@raphjaph
Copy link
Collaborator

Hey, thanks for opening this issue! Unfortunately our testnet instance was still running v19 so I can't check this myself. I've just updated the server though. What does it say on the /status page of the server?

SIGKILL should always work. Could yo maybe provide a ps or top output? You can also send it to me privately to raphjaph AT protonmail.com.

@raphjaph raphjaph added the bug label Sep 13, 2024
@raphjaph raphjaph self-assigned this Sep 13, 2024
@raphjaph
Copy link
Collaborator

I just deliberately put testnet ord into recovery mode and SIGKILL worked for me. Weird that it doesn't work for you. I'll try simulating reorgs on regtest next and see what happens.

@victorkirov
Copy link
Contributor Author

Unfortunately, I already reverted the instance back to 0.19. I'll start a new 0.20 one and try to simulate a reorg, but it may take some time.

For SIG KILL, I just tried doing a SIGKILL on a normally running instance and it's ignored completely 👀 Maybe this is due to it running inside a container, but it shouldn't affect the signal as I'm executing it inside a bash terminal in the container (kill -9 1)

@raphjaph
Copy link
Collaborator

On regtest it seems to recover without a problem.
I did cargo run env on master then bitcoin-cli -datadir=env generatetoaddress 10 <ADDRESS> the bitcoin-cli -datadir=env invalidateblock <BLOCK_HASH> and the bitcoin-cli -datadir=env generatetoaddress 10 <ADDRESS>.

[2024-09-17T06:31:58Z INFO  ord::index::updater] Committing at block height 265, 2 outputs traversed, 3 in map, 0 cached
[2024-09-17T06:32:17Z INFO  ord::index] 6 block deep reorg detected at height 265
[2024-09-17T06:32:17Z INFO  ord::index::reorg] rolling back database after reorg of depth 6 at height 265
[2024-09-17T06:32:17Z INFO  ord::index::reorg] successfully rolled back database to height 250

@victorkirov
Copy link
Contributor Author

Hmm, I guess there may have been something else wrong. It's just really strange that it happened across multiple instances. I'll try upgrade again and see if it happens. Will close this ticket for now and reopen if I can reproduce it and get more info.

Thanks for looking into it 🙌

@dcorral
Copy link

dcorral commented Oct 3, 2024

Mainnet reorg crashing on my side as well could we reopen this to see if we can get to the root of the issue? 🙏

@victorkirov victorkirov reopened this Oct 3, 2024
@raphjaph
Copy link
Collaborator

raphjaph commented Oct 3, 2024

What block was the reorg at and do you have any log outputs?

@dcorral
Copy link

dcorral commented Oct 4, 2024

I have no logs sorry but the block was 863888, will report back if anything comes up

@dcorral
Copy link

dcorral commented Oct 6, 2024

Got the above while testing regtest:

  • generate 101 blocks
  • invalidate block 99
  • generate 3 blocks

thread '' panicked at src/index/reorg.rs:66:76:
called Option::unwrap() on a None value
stack backtrace:
0: 0x5969b7ebadb6 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h410d4c66be4e37f9
1: 0x5969b7eeb140 - core::fmt::write::he40921d4802ce2ac
2: 0x5969b7eb697f - std::io::Write::write_fmt::h5de5a4e7037c9b20
3: 0x5969b7ebab94 - std::sys_common::backtrace::print::h11c067a88e3bdb22
4: 0x5969b7ebc417 - std::panicking::default_hook::{{closure}}::h8c832ecb03fde8ea
5: 0x5969b7ebc179 - std::panicking::default_hook::h1633e272b4150cf3
6: 0x5969b7ebc8a8 - std::panicking::rust_panic_with_hook::hb164d19c0c1e71d4
7: 0x5969b7ebc749 - std::panicking::begin_panic_handler::{{closure}}::h0369088c533c20e9
8: 0x5969b7ebb2b6 - std::sys_common::backtrace::__rust_end_short_backtrace::hc11d910daf35ac2e
9: 0x5969b7ebc4d4 - rust_begin_unwind
10: 0x5969b720ebe5 - core::panicking::panic_fmt::ha6effc2775a0749c
11: 0x5969b720eca3 - core::panicking::panic::h44790a89027c670f
12: 0x5969b720eb36 - core::option::unwrap_failed::hcb3a256a9f1ca882
13: 0x5969b73e2dbe - ord::index::reorg::Reorg::handle_reorg::h5ff6b80ae2a95596
14: 0x5969b726160b - ord::index::Index::update::h7a56ebe30e595534
15: 0x5969b7817375 - std::sys_common::backtrace::__rust_begin_short_backtrace::hbf31dcab6cd6a497
16: 0x5969b77686cf - core::ops::function::FnOnce::call_once{{vtable.shim}}::h6189ab0ff9d9cd46
17: 0x5969b7ec1bc5 - std::sys::pal::unix::thread::Thread::new::thread_start::h3631815ad38387d6
18: 0x7bdadcfb439d -
19: 0x7bdadd03949c -
20: 0x0 -

I am not seeing this all the time I do the steps above though.

ord version: 0.20.0

@gus4rs
Copy link

gus4rs commented Oct 9, 2024

Version 0.20.1 dies after a reorg, doesn't respond to graceful shutdown signal and must be killed abruptly, screwing up the database. Happened to me on testnet3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

4 participants