Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable a sequencer to re-sequence batches #894

Merged
merged 21 commits into from
Sep 20, 2024

Conversation

cffls
Copy link

@cffls cffls commented Aug 1, 2024

Feature PR for #667

Overall workflow:

  • Stop sequencer

  • Unwind to target batch (Using command under https://github.com/0xPolygonHermez/cdk-erigon/blob/zkevm/cmd/integration/commands/stage_stages_zkevm.go). e.g. CDK_ERIGON_SEQUENCER=1 go run ./cmd/integration state_stages_zkevm --config=/path/to/config.yaml --unwind-batch-no=8 --chain dynamic-kurtosis --datadir /path/to/chain/data

  • Backup data stream files in case the files are corrupted during re-sequencing

  • Restart sequencer

  • The sequencer will detect rollback and start re-sequencing automatically

  • On re-sequencing, L2 blocks will either be the same as before, split into multiple, but they can never be merged. Some properties of rollback behavior:

    • If a block is split into multiple blocks after re-sequencing, the split blocks will have the same timestamp as the original block.
    • A new L2 block won’t include transactions that belonged to different blocks that got rolled back.
    • A new sequenced batch won’t include blocks that belonged to different batches that got rolled back.

Tested the following rollback scenarios in kurtosis:
The original chain (before resequenced) contains both empty blocks and non-empty blocks with uniswap transactions.

  1. Condition: ResequenceStrict=true. No L1InfoUpdate on L1. No counter limit change.
    Result: Resequenced L2 blocks have exactly the same block hashes as previously created.
  2. Condition: ResequenceStrict=true. No L1InfoUpdate on L1. Lowered counter limit.
    Result: An error was raised because a previous batch had to be split into two batches.
  3. Condition: ResequenceStrict=false. No L1InfoUpdate on L1. Lowered counter limit.
    Result: Resequence completed with more number of batches than before.
  4. Condition: ResequenceStrict=true. Multiple L1InfoUpdates on L1. No counter limit change.
    Result: Same number of blocks and batches were resequenced. L2 block hash were the same.
  5. Condition: ResequenceStrict=false. Multiple L1InfoUpdates on L1. No counter limit change.
    Result: Same number of blocks and batches were resequenced. L2 block hash were different because different L1InfoUpdate index were used.

Copy link
Collaborator

@Stefan-Ethernal Stefan-Ethernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some early feedback.

cmd/utils/flags.go Outdated Show resolved Hide resolved
zk/datastream/client/stream_client.go Outdated Show resolved Hide resolved
zk/datastream/client/stream_client.go Outdated Show resolved Hide resolved
zk/tx/tx.go Show resolved Hide resolved
zk/tx/tx.go Show resolved Hide resolved
zk/stages/stage_sequence_execute.go Outdated Show resolved Hide resolved
zk/stages/stage_sequence_execute.go Outdated Show resolved Hide resolved
zk/stages/stage_sequence_execute.go Outdated Show resolved Hide resolved
zk/stages/stage_sequence_execute.go Outdated Show resolved Hide resolved
zk/stages/stage_sequence_execute.go Outdated Show resolved Hide resolved
@cffls cffls marked this pull request as ready for review August 2, 2024 15:29
@hexoscott
Copy link
Collaborator

hexoscott commented Aug 5, 2024

Hey @cffls a few thoughts on the changes here:

  • the sequencer is the master source of the datastream on the network so connecting to itself using a stream client to get the batches to resequence might not be a good approach.
  • the sequencer code already has quite a few branches in logic around limbo and l1 recovery so the 3rd resequence might get confusing.

To my knowledge the process for marking a batch as bad would be along the lines of:

  • restart sequencer with a flag to mark an explicit stop batch height
  • allow it to run to this height where it will pause
  • let the sequence sender etc. catch up and get the network in a verified state
  • stop the sequencer
  • make the TX to the L1 to mark a batch as bad
  • wait for the block including this TX to be finalised on the L1
  • start the sequencer in L1 recovery mode...

This last step in my head would start the code we already have for L1 recovery but some new L1 tracking code would be scanning the L1 for this event around a bad batch and storing it in the database. Once we know the batch is bad we can immediately roll back the datastream to the last good batch along with the database using the normal unwind feature, from there the normal L1 recovery code would start but as part of our "bad batch" checks during recovery we'd have one more check for this scenario, when we encounter it we simply mark the batch as bad as do already and move on to normal recovery. This way blocks and batches would be an exact replicate of the L1 data apart from skipping the bad batch. Once L1 recovery is completed to the previous tip you can restart the sequencer as normal and the network continues.

I think this approach works out simpler and we don't need to do anything fancy around splitting blocks or handling timestamps, we just recreate the L1 exactly as it was before. This has the advantage of also handling things like parent hashes automatically without fuss.

What do you think?

@hexoscott
Copy link
Collaborator

I think maybe there has been something missed here with regards to the banana fork requirements of the batch that will be removed being something that happens on the L1 to affect the chains history.

Copy link
Collaborator

@hexoscott hexoscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cffls - after the discussion yesterday and looking over the changes in a new light it's looking good and making a lot more sense approach wise to me now.

There are a couple of things that would help here I think:

  1. Making this process really explicit by using a flag, if it isn't set then the sequencer won't perform the checks to see if it is behind and just behave as normal.
  2. Rather than using a datastream client pointing back to itself, we have the data_stream_server.go file that should let you work with the stream without needing the client and it will work directly with the files on disk. That approach seems cleaner to me.

If we can make those changes I think we'll be in a safe place to test this and get it merged.

@cffls
Copy link
Author

cffls commented Aug 6, 2024

There are a couple of things that would help here I think:
Making this process really explicit by using a flag, if it isn't set then the sequencer won't perform the checks to see if it is behind and just behave as normal.
Rather than using a datastream client pointing back to itself, we have the data_stream_server.go file that should let you work with the stream without needing the client and it will work directly with the files on disk. That approach seems cleaner to me.

Thanks @hexoscott . I've addressed your feedbacks accordingly.

@hexoscott
Copy link
Collaborator

Great 😄 could you give me a nudge once the latest zkevm branch stuff is merged in so I don't miss it 🙏

@cffls
Copy link
Author

cffls commented Aug 7, 2024

Thanks @hexoscott . I've merged from zkevm branch and resolved the conflicts.

Copy link
Collaborator

@hexoscott hexoscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice updates 😄 just one little thing around the use of the flag in the sequencing stage

zk/stages/stage_sequence_execute.go Show resolved Hide resolved
Copy link

sonarcloud bot commented Sep 3, 2024

Copy link
Collaborator

@hexoscott hexoscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, just a few comments from the last time I had a peek at it.

cmd/integration/commands/stage_stages_zkevm.go Outdated Show resolved Hide resolved
zk/stages/stage_sequence_execute.go Show resolved Hide resolved
zk/stages/stage_sequence_execute_utils.go Outdated Show resolved Hide resolved
zk/stages/stage_sequence_execute_utils.go Show resolved Hide resolved
@cffls cffls merged commit 292a2ec into 0xPolygonHermez:zkevm Sep 20, 2024
13 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants