Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练到一定时间loss=nan #230

Open
zhanglv0209 opened this issue Nov 26, 2024 · 4 comments
Open

训练到一定时间loss=nan #230

zhanglv0209 opened this issue Nov 26, 2024 · 4 comments

Comments

@zhanglv0209
Copy link

zhanglv0209 commented Nov 26, 2024

训练到一定时间loss=nan,这个是什么原因?是不是有问题?
Steps: 2%|▍ | 92679/5900000 [19:15:43<1185:20:02, 1.36it/s, backward=0.177, data=0.00789, img_process=0.00704, loss=nan, lr=4.99e-5, unet=0.0986, vae=0.0144]
同时val里的图片:
image

@Echo-jyt
Copy link

训练到一定时间loss=nan,这个是什么原因?是不是有问题? Steps: 2%|▍ | 92679/5900000 [19:15:43<1185:20:02, 1.36it/s, backward=0.177, data=0.00789, img_process=0.00704, loss=nan, lr=4.99e-5, unet=0.0986, vae=0.0144] 同时val里的图片: image

你好,请问你解决了吗?

@foreverhell
Copy link

遇到了同样的问题,大概5000步损失就nan了。请问你是从头开始训,还是接着预训练权重训?

@zhanglv0209
Copy link
Author

训练到一定时间loss=nan,这个是什么原因?是不是有问题? Steps: 2%|▍ | 92679/5900000 [19:15:43<1185:20:02, 1.36it/s, backward=0.177, data=0.00789, img_process=0.00704, loss=nan, lr=4.99e-5, unet=0.0986, vae=0.0144] 同时val里的图片: image

你好,请问你解决了吗?

不能断点续训,从头开始就没问题了

@zhanglv0209
Copy link
Author

遇到了同样的问题,大概5000步损失就nan了。请问你是从头开始训,还是接着预训练权重训?

不能断点续训,从头开始就没问题了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants