Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

我在windows上面执行训练,报错,无法启动训练。报错信息如下,请帮忙查看下。感谢 #88

Open
120805481 opened this issue Jan 15, 2025 · 1 comment

Comments

@120805481
Copy link

120805481 commented Jan 15, 2025

[2025-01-15` 20:35:09,394] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[W socket.cpp:697] [c10d] The client socket has failed to connect to [PC-222]:25678 (system error: 10049 - �����������У�������ĵ�ַ��Ч��).
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.2.2+cu121 with CUDA 1201 (you have 2.2.2+cu121)
    Python  3.10.11 (you have 3.10.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
Traceback (most recent call last):
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\workspace\LatentSync\LatentSync\scripts\train_unet.py", line 32, in <module>
    import diffusers
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\diffusers\__init__.py", line 28, in <module>
    from .models import (
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\diffusers\models\__init__.py", line 19, in <module>
    from .attention import Transformer2DModel
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\diffusers\models\attention.py", line 43, in <module>
    import xformers.ops
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\xformers\ops\__init__.py", line 8, in <module>
    from .fmha import (
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\xformers\ops\fmha\__init__.py", line 10, in <module>
    from . import (
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\xformers\ops\fmha\triton_splitk.py", line 548, in <module>
    _get_splitk_kernel(num_groups)
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\xformers\ops\fmha\triton_splitk.py", line 503, in _get_splitk_kernel
    _fwd_kernel_splitK_unrolled = unroll_varargs(_fwd_kernel_splitK, N=num_groups)
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\xformers\triton\vararg_kernel.py", line 166, in unroll_varargs
    jitted_fn = triton.jit(fn)
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\triton\runtime\jit.py", line 882, in jit
    return decorator(fn)
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\triton\runtime\jit.py", line 871, in decorator
    return JITFunction(
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\triton\runtime\jit.py", line 717, in __init__
    self.src = self.src[re.search(r"^def\s+\w+\s*\(", self.src, re.MULTILINE).start():]
AttributeError: 'NoneType' object has no attribute 'start'
[2025-01-15 20:35:14,444] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 39356) of binary: D:\ProgramData\anaconda3\envs\latentsync\python.exe
Traceback (most recent call last):
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\ProgramData\anaconda3\envs\latentsync\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\torch\distributed\run.py", line 812, in main
    run(args)
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\torch\distributed\run.py", line 803, in run
    elastic_launch(
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\torch\distributed\launcher\api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "D:\ProgramData\anaconda3\envs\latentsync\lib\site-packages\torch\distributed\launcher\api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts.train_unet FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-15_20:35:14
  host      : PC-222
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 39356)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@wangaocheng
Copy link

你的xformers 没有安装对应版本,检查一下你的python版本和torch版本。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants