Properly handle difference between restoring only model weights and full training state #131

joeloskarsson · 2025-02-17T12:17:28Z

The --restore_opt argument is supposed to differentiate between restoring only the model weights, and restoring the full training session, including optimizer state (this is the crucial one) and epoch number etc. This is currently implemented as a bit of a hack:

neural-lam/neural_lam/models/ar_model.py

Lines 759 to 761 in 43524f7

    
           if not self.restore_opt: 
        
               opt = self.configure_optimizers() 
        
               checkpoint["optimizer_states"] = [opt.state_dict()]

that can easily start causing problems as we build on it.

A proper way to handle this in Lightning is to differentiate between instantiating the model with load_from_checkpoint and calling Trainer.fit with ckpt_path. An example implementation that could be used is given here: joeloskarsson@e7d11c9

The text was updated successfully, but these errors were encountered:

joeloskarsson added the enhancement New feature or request label Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly handle difference between restoring only model weights and full training state #131

Properly handle difference between restoring only model weights and full training state #131

joeloskarsson commented Feb 17, 2025

Properly handle difference between restoring only model weights and full training state #131

Properly handle difference between restoring only model weights and full training state #131

Comments

joeloskarsson commented Feb 17, 2025