Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give the option to terminate the engine without firing Events.COMPLET… #3309

Merged
merged 20 commits into from
Dec 3, 2024

Conversation

bonassifabio
Copy link
Contributor

@bonassifabio bonassifabio commented Dec 2, 2024

As per this Pytorch discussion post, This pull request aims to allow the user to terminate an engine (via engine.terminate()) without consequently firing Events.COMPLETED. This is done by passing the optional argument skip_event_completed=True to the method terminate().

While the default behavior is still to fire Events.COMPLETED after termination, it is useful to allow the user to do otherwise. This might be useful, for example, when one of handlers of Events.COMPLETED needs to be executed only upon graceful termination of the Engine.

Alternative solutions to the one proposed in this PR could be:

  • include a flag has_been_terminated that is set to True when Events.TERMINATE is triggered
  • include a new filter to Events.COMPLETED, like Events.COMPLETED(terminated=True), which would be perhaps the most "ignitic" way to add this functionality.

Other changes:

Fixed the table in the Events docs.

Related issues:

#3308

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

…ED. The default behaviour is not changed.

Note that even though Events.COMPLETED is not fired, its timer is updated.
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @bonassifabio !
I made the first pass on the code and left some comments, let me know what you think.

ignite/engine/engine.py Outdated Show resolved Hide resolved
ignite/engine/engine.py Outdated Show resolved Hide resolved
ignite/engine/engine.py Outdated Show resolved Hide resolved
ignite/engine/engine.py Outdated Show resolved Hide resolved
ignite/engine/engine.py Outdated Show resolved Hide resolved
ignite/engine/engine.py Outdated Show resolved Hide resolved
ignite/engine/events.py Outdated Show resolved Hide resolved
@bonassifabio
Copy link
Contributor Author

Thanks for your comments! At this point skip_event_completed should be renamed to skip_completed also in the docs, as per commit f1d194a.

ignite/engine/engine.py Outdated Show resolved Hide resolved
- Do not update self.state.times[Events.COMPLETED.name]  if terminated
- Fixed unit test
@bonassifabio
Copy link
Contributor Author

I just pushed 5f2cd88 where I fixed the unit test and hopefully the docs. I also followed your suggestion concerning the update of self.state.times[Events.COMPLETED.name]

ignite/engine/engine.py Outdated Show resolved Hide resolved
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Dec 2, 2024

Can you also please run the following to fix code formatting problems:

bash ./tests/run_code_style.sh install
bash ./tests/run_code_style.sh fmt

ignite/engine/engine.py Outdated Show resolved Hide resolved
bonassifabio and others added 3 commits December 3, 2024 00:38
- Engine time logging moved out of the if clause. In the log message "completed" has been replaced with "finished" to avoid confusion.
- Same changes applied to the method `_internal_run_legacy()`
Sorry for accidentally including it into the previous commit!
ignite/engine/events.py Outdated Show resolved Hide resolved
- COMPLETED : triggered when engine's run is completed
- COMPLETED : triggered when engine's run is completed or terminated with
:meth:`~ignite.engine.engine.Engine.terminate()`, unless the flag
`skip_completed` is set to True.

The table below illustrates which events are triggered when various termination methods are called.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bonassifabio actually, I find myself as well this table very misleading and hard to understand.
I think we can improve it in the following way:

  • EVENT_COMPLETED -> EPOCH_COMPLETED
  • Let's add new column on the right after "TERMINATE" and call it "COMPLETED"
  • Check symbol for terminate() line on TERMINATE_SINGLE_EPOCH column is wrong actually and should be replaced with x.

By the way, here is how it is generated now: https://deploy-preview-3309--pytorch-ignite-preview.netlify.app/generated/ignite.engine.events.events#ignite.engine.events.Events

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was interpreting the first column as Events.COMPLETED, but from your comment it would seem that's not the case.

What if we change the table to something like this?

Method TERMINATE_SINGLE_EPOCH EPOCH_COMPLETED TERMINATE COMPLETED
No termination x x
terminate_epoch() x x x
terminate() x x
terminate(skip_completed=True) x x x

Few comments:

  1. To my understanding (please correct me if I'm wrong) if terminate() is called, the epoch is not necessarily completed. If this statement is true, it would perhaps make more sense to move EPOCH_COMPLETED before TERMINATE in the list of Events, as I did in the table, since EPOCH_COMPLETED would almost never be fired after EPOCH_COMPLETED.
  2. The columns would thus be ordered so that the first two are "epoch wise", while the last two happen at the end of the engine run. Not sure I
  3. I included a new row for terminate(skip_completed=True). Not sure if that's really necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new table, looks good!
Few corrections however about its content:

The line terminate_epoch(), column "EPOCH_COMPLETED" and "COMPLETED" should be both checked as v.

About your comments, I'm OK with all your suggestions, let's keep this order and the new line.

Checking code:

from ignite.engine import Engine, Events
from ignite.utils import setup_logger, logging


train_data = range(10)
max_epochs = 5


def train_step(engine, batch):
    pass

trainer = Engine(train_step)

# Enable trainer logger for a debug mode
trainer.logger = setup_logger("trainer", level=logging.DEBUG)

@trainer.on(Events.ITERATION_COMPLETED(once=12))
def call_terminate_epoch():
    trainer.terminate_epoch()

trainer.run(train_data, max_epochs=max_epochs)

Output:

...
2024-12-03 09:06:33,056 trainer INFO: Terminate current epoch is signaled. Current epoch iteration will stop after current iteration is finished.
2024-12-03 09:06:33,058 trainer DEBUG: 2 | 12, Firing handlers for event Events.TERMINATE_SINGLE_EPOCH
2024-12-03 09:06:33,060 trainer DEBUG: 2 | 12, Firing handlers for event Events.EPOCH_COMPLETED
2024-12-03 09:06:33,061 trainer INFO: Epoch[2] Complete. Time taken: 00:00:00.023
2024-12-03 09:06:33,064 trainer DEBUG: 3 | 12, Firing handlers for event Events.EPOCH_STARTED
2024-12-03 09:06:33,066 trainer DEBUG: 3 | 12, Firing handlers for event Events.GET_BATCH_STARTED
...

You can run it in https://pytorch-ignite.ai/playground

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out!

  1. Do you think we should add an argument skip_completed to terminate_epoch() with the same behavior of suppressing the firing of Events.EPOCH_COMPLETED? That might be useful, for example, to avoid running the evaluator, the checkpointer, and the LR scheduler if an epoch has been terminated.
  2. I agree on checking EPOCH_COMPLETED for terminate_epoch(). However, I think that checking COMPLETED might seem to imply that Events.COMPLETED is also fired by terminate_epoch() right after Events.EPOCH_COMPLETED. What if we leave that cell empty?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should add an argument skip_completed to terminate_epoch() with the same behavior of suppressing the firing of Events.EPOCH_COMPLETED?

Good idea! Let's make this in a separate PR such that this one does not become too large

I agree on checking EPOCH_COMPLETED for terminate_epoch(). However, I think that checking COMPLETED might seem to imply that Events.COMPLETED is also fired by terminate_epoch() right after Events.EPOCH_COMPLETED. What if we leave that cell empty?

well, I was seeing this table as showing which events are triggered when we call terminate*() functions. I did not understand v mark on COMPLETED for terminate_epoch as the sequence of triggered events, but just whether an event is triggered or not. To avoid misunderstanding we can add a column between EPOCH_COMPLETED and TERMINATE named as for example "Other events" and check it where it is appropriate. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Let's make this in a separate PR such that this one does not become too large

Cool, I'll create a new PR when we're done with this one

I did not understand v mark on COMPLETED for terminate_epoch as the sequence of triggered events, but just whether an event is triggered or not.

I see the point, but doesn't the firing of TERMINATE and COMPLETED independent of terminate_epoch()?
See for example this code.

from ignite.engine import Engine, Events
from ignite.utils import setup_logger, logging

train_data = range(10)
max_epochs = 5


def train_step(engine, batch):
    pass

trainer = Engine(train_step)

# Enable trainer logger for a debug mode
trainer.logger = setup_logger("trainer", level=logging.DEBUG)

@trainer.on(Events.ITERATION_COMPLETED(once=12))
def call_terminate_epoch():
    trainer.terminate_epoch()

@trainer.on(Events.ITERATION_COMPLETED(once=15))
def call_terminate():
    trainer.terminate(skip_completed=True)

trainer.run(train_data, max_epochs=max_epochs)

This is probably an edge case. As a newbie of ignite, it would honestly make more sense to see no entry for TERMINATE and COMPLETED for terminate_epoch(). However, you are for sure more experienced than me, so we can go ahead with your solution if you prefer 🙂

Solution 1

Method TERMINATE_SINGLE_EPOCH EPOCH_COMPLETED TERMINATE COMPLETED
No termination x x
terminate_epoch() x
terminate() x x
terminate(skip_completed=True) x x x

Solution 2

Method TERMINATE_SINGLE_EPOCH EPOCH_COMPLETED TERMINATE COMPLETED
No termination x x
terminate_epoch() x x
terminate() x x
terminate(skip_completed=True) x x x

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point, but doesn't the firing of TERMINATE and COMPLETED independent of terminate_epoch()?

yes, they should be independent, i would say. If this is not the case, it should be a bug...

As a newbie of ignite, it would honestly make more sense to see no entry for TERMINATE and COMPLETED for terminate_epoch(). However, you are for sure more experienced than me, so we can go ahead with your solution if you prefer 🙂

Actually, your feedback is more important as other users may think the same (and those who know how it works do not read the docs :) ) !

Initially, the goal of the table is to provide visual representation of the info on which events are triggered in each case (when call terminate*()) and also compared with regular run.
Now, seeing its content I find this tabular representation more misleading than useful as triggered events depend on where the terminate function was called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to keep the table (either Table 1 or Table 2), and then to open a new issue to seek some input also from other developers/users and discuss the problem more in detail there.
We could for example think to a flow chart diagram or something like that...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would remove it and maybe added some code output to show exactly how it works...
OK, up to you, either let's revert the changes or keep the most understandable version and update it later.

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this PR @bonassifabio !
I have few comments how to improve a bit the tests

tests/ignite/engine/test_engine.py Outdated Show resolved Hide resolved
tests/ignite/engine/test_engine.py Outdated Show resolved Hide resolved
tests/ignite/engine/test_engine.py Outdated Show resolved Hide resolved
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Dec 3, 2024

@bonassifabio we have now the CI completely broken due to yesterday change of pynvml package what we should replace (temporary patch is here #3310, once it is merged, please update your PR). By the way, if you have suggestions on a similar package to be used, we are open to suggestions :)

Update requirements-dev.txt (#3310)
@bonassifabio
Copy link
Contributor Author

@vfdev-5 I am not sure why some test is failing, seems unrelated to the changes in this PR. Any input?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Dec 3, 2024

@vfdev-5 I am not sure why some test is failing, seems unrelated to the changes in this PR. Any input?

We can ignore failing hvd tests

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bonassifabio, LGTM!
We can either fix the table here or in a follow-up PR

@bonassifabio
Copy link
Contributor Author

Thanks @bonassifabio, LGTM! We can either fix the table here or in a follow-up PR

Cool, applied minimal changes to the table and will open an issue to discuss how to make it more clear.

@vfdev-5 vfdev-5 merged commit 6f8ad2a into pytorch:master Dec 3, 2024
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: engine Engine module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants