Segmentation Fault and Core Dumped Error When Running experiment.sh #10

X-YuL · 2025-01-03T17:15:25Z

After installation, I encountered a Segmentation Fault and Core Dumped error when running the training script via bash experiment.sh. Has anyone experienced a similar issue, or can probably provide guidance on debugging this? Any help would be greatly appreciated. Thank you!

Observations:

Running experiment.py directly with python experiment.py works without issues.
Running bash test.sh also completes successfully.
The problem seems specific to experiment.sh.

System Information:

OS: Ubuntu 20.04
Python version: 3.11.4
CUDA version: 11.8

Error Messages:
There are two scenarios for the error messages:

Detailed Error Log:

2025-01-03 17:45:45.501487: F external/xla/xla/service/gpu/triton_autotuner.cc:624] Non-OK-status: has_executable.status() status: INTERNAL: ptxas exited with non-zero error code 139, output: : If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided. Failure occurred when compiling fusion triton_gemm_dot.1 with config '{block_m:64,block_n:32,block_k:32,split_k:16,num_stages:1,num_warps:4,num_ctas:1}'
Fused HLO computation:
%triton_gemm_dot.1_computation (parameter_0: f32[45,512], parameter_1: f32[512], parameter_2: f32[512,256]) -> f32[45,256] {
  %parameter_0 = f32[45,512]{1,0} parameter(0)
  %parameter_1 = f32[512]{0} parameter(1)
  %broadcast.190 = f32[45,512]{1,0} broadcast(f32[512]{0} %parameter_1), dimensions={1}, metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/Dense_0/add" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=21}
  %add.48 = f32[45,512]{1,0} add(f32[45,512]{1,0} %parameter_0, f32[45,512]{1,0} %broadcast.190), metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/Dense_0/add" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=21}
  %tanh.6 = f32[45,512]{1,0} tanh(f32[45,512]{1,0} %add.48), metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/tanh" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=22}
  %parameter_2 = f32[512,256]{1,0} parameter(2)
  ROOT %dot.4 = f32[45,256]{1,0} dot(f32[45,512]{1,0} %tanh.6, f32[512,256]{1,0} %parameter_2), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/Dense_1/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=None]" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=23}
}
Fatal Python error: Aborted

Simpler Error:

Fatal Python error: Segmentation fault

Steps Taken:

Followed the installation instructions step-by-step.
Verified the Python and CUDA versions are consistent with the requirements.
Confirmed the installed packages match those listed in requirements.txt.

The text was updated successfully, but these errors were encountered:

nico-bohlinger · 2025-01-05T13:37:54Z

I think the issue could be your CUDA version. Could you try running it with CUDA 12.x?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault and Core Dumped Error When Running experiment.sh #10

Segmentation Fault and Core Dumped Error When Running experiment.sh #10

X-YuL commented Jan 3, 2025

nico-bohlinger commented Jan 5, 2025

Segmentation Fault and Core Dumped Error When Running experiment.sh #10

Segmentation Fault and Core Dumped Error When Running experiment.sh #10

Comments

X-YuL commented Jan 3, 2025

nico-bohlinger commented Jan 5, 2025