You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After installation, I encountered a Segmentation Fault and Core Dumped error when running the training script via bash experiment.sh. Has anyone experienced a similar issue, or can probably provide guidance on debugging this? Any help would be greatly appreciated. Thank you!
Observations:
Running experiment.py directly with python experiment.py works without issues.
Running bash test.sh also completes successfully.
The problem seems specific to experiment.sh.
System Information:
OS: Ubuntu 20.04
Python version: 3.11.4
CUDA version: 11.8
Error Messages:
There are two scenarios for the error messages:
Detailed Error Log:
2025-01-03 17:45:45.501487: F external/xla/xla/service/gpu/triton_autotuner.cc:624] Non-OK-status: has_executable.status() status: INTERNAL: ptxas exited with non-zero error code 139, output: : If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided. Failure occurred when compiling fusion triton_gemm_dot.1 with config '{block_m:64,block_n:32,block_k:32,split_k:16,num_stages:1,num_warps:4,num_ctas:1}'
Fused HLO computation:
%triton_gemm_dot.1_computation (parameter_0: f32[45,512], parameter_1: f32[512], parameter_2: f32[512,256]) -> f32[45,256] {
%parameter_0 = f32[45,512]{1,0} parameter(0)
%parameter_1 = f32[512]{0} parameter(1)
%broadcast.190 = f32[45,512]{1,0} broadcast(f32[512]{0} %parameter_1), dimensions={1}, metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/Dense_0/add" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=21}
%add.48 = f32[45,512]{1,0} add(f32[45,512]{1,0} %parameter_0, f32[45,512]{1,0} %broadcast.190), metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/Dense_0/add" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=21}
%tanh.6 = f32[45,512]{1,0} tanh(f32[45,512]{1,0} %add.48), metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/tanh" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=22}
%parameter_2 = f32[512,256]{1,0} parameter(2)
ROOT %dot.4 = f32[45,256]{1,0} dot(f32[45,512]{1,0} %tanh.6, f32[512,256]{1,0} %parameter_2), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_name="jit(action_selection)/jit(main)/jit(apply)/NeuralNetwork/Dense_1/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=None]" source_file="/home/yulong/Github/RL-X/one_policy_to_run_them_all/one_policy_to_run_them_all/environments/multi_robot/cpu_gpu_testing.py" source_line=23}
}
Fatal Python error: Aborted
Simpler Error:
Fatal Python error: Segmentation fault
Steps Taken:
Followed the installation instructions step-by-step.
Verified the Python and CUDA versions are consistent with the requirements.
Confirmed the installed packages match those listed in requirements.txt.
The text was updated successfully, but these errors were encountered:
After installation, I encountered a
Segmentation Fault
andCore Dumped
error when running the training script viabash experiment.sh
. Has anyone experienced a similar issue, or can probably provide guidance on debugging this? Any help would be greatly appreciated. Thank you!Observations:
experiment.py
directly withpython experiment.py
works without issues.bash test.sh
also completes successfully.experiment.sh
.System Information:
Error Messages:
There are two scenarios for the error messages:
Steps Taken:
requirements.txt
.The text was updated successfully, but these errors were encountered: