Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault and Core Dumped Error When Running experiment.sh #10

Open
X-YuL opened this issue Jan 3, 2025 · 1 comment
Open

Comments

@X-YuL
Copy link

X-YuL commented Jan 3, 2025

After installation, I encountered a Segmentation Fault and Core Dumped error when running the training script via bash experiment.sh. Has anyone experienced a similar issue, or can probably provide guidance on debugging this? Any help would be greatly appreciated. Thank you!

Observations:

  • Running experiment.py directly with python experiment.py works without issues.
  • Running bash test.sh also completes successfully.
  • The problem seems specific to experiment.sh.

System Information:

  • OS: Ubuntu 20.04
  • Python version: 3.11.4
  • CUDA version: 11.8

Error Messages:
There are two scenarios for the error messages:

  1. Detailed Error Log:
    2025-01-03 17:45:45.501487: F external/xla/xla/service/gpu/triton_autotuner.cc:624] Non-OK-status: has_executable.status() status: INTERNAL: ptxas exited with non-zero error code 139, output: : If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided. Failure occurred when compiling fusion triton_gemm_dot.1 with config '{block_m:64,block_n:32,block_k:32,split_k:16,num_stages:1,num_warps:4,num_ctas:1}'
    Fused HLO computation:
    %triton_gemm_dot.1_computation (parameter_0: f32[45,512], parameter_1: f32[512], parameter_2: f32[512,256]) -> f32[45,256] {
      %parameter_0 = f32[45,512]{1,0} parameter(0)
      %parameter_1 = f32[512]{0} parameter(1)
      %broadcast.190 = f32[45,512]{1,0} broadcast(f32[512]{0} %parameter_1), dimensions={1}, metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/Dense_0/add" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=21}
      %add.48 = f32[45,512]{1,0} add(f32[45,512]{1,0} %parameter_0, f32[45,512]{1,0} %broadcast.190), metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/Dense_0/add" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=21}
      %tanh.6 = f32[45,512]{1,0} tanh(f32[45,512]{1,0} %add.48), metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/tanh" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=22}
      %parameter_2 = f32[512,256]{1,0} parameter(2)
      ROOT %dot.4 = f32[45,256]{1,0} dot(f32[45,512]{1,0} %tanh.6, f32[512,256]{1,0} %parameter_2), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/Dense_1/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=None]" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=23}
    }
    Fatal Python error: Aborted
    
  2. Simpler Error:
    Fatal Python error: Segmentation fault
    
    

Steps Taken:

  1. Followed the installation instructions step-by-step.
  2. Verified the Python and CUDA versions are consistent with the requirements.
  3. Confirmed the installed packages match those listed in requirements.txt.
@nico-bohlinger
Copy link
Owner

I think the issue could be your CUDA version. Could you try running it with CUDA 12.x?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants