The main part used a relatively simple training function to keep the code readable and fit Part 4 within the page limits. Optionally, we can add a linear warm-up, a cosine decay schedule, and gradient clipping to improve the training stability and convergence.
You can find the code for this more sophisticated training function in Appendix B: Adding Bells and Whistles to the Training Loop.