Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training #19

Open
shqmffl486 opened this issue Jul 7, 2023 · 3 comments
Open

Training #19

shqmffl486 opened this issue Jul 7, 2023 · 3 comments

Comments

@shqmffl486
Copy link

shqmffl486 commented Jul 7, 2023

How do I train with sdf_hand_mini and sdf_obj_mini that you uploaded?
I think there is a .npz file that doesn't exist because I put it in mini version.

(alignsdf) MS-7B23:~/mount4t/AlignSDF$ CUDA_VISIBLE_DEVICES=0 bash dist_train.sh 4 6666 -e experiments/obman/30k_1e2d_mlp5.json
do not support renderer in this machine
DeepSdf - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
DeepSdf - INFO - Training in distributed mode, 1 GPU per process. Process 0, total 1.
DeepSdf - INFO - Experiment description:
3D hand reconstruction on the mini obman dataset.
Hand branch: True
Object branch: True
Mano branch: False
Depth branch: False
Classifier Weight: 0
Penetration Loss: False
Penetration Loss Weight: 0
Additional Loss start at epoch: 1201
Contact Loss: False
Contact Loss Weight: 0
Contact Loss Sigma (m): 0.005
Independent Obj Scale: False
Ignore other: False
nb_label_class: 6
Image encoder, the branch has latent size 256
DeepSdf - INFO - Finish constructing the dataset
DeepSdf - INFO - start_epoch:1, current_rank:0
DeepSdf - INFO - epoch:1, current_rank:0
Traceback (most recent call last):
File "train.py", line 715, in
main_function(exp_cfg, args.continue_from, args.local_rank, args.opt_level, args.slurm)
File "train.py", line 465, in main_function
for i, (input_iter, label_iter, meta_iter) in enumerate(sdf_loader):
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/gaeun/mount4t/AlignSDF/utils/data.py", line 162, in getitem
hand_samples, hand_labels = unpack_sdf_samples(self.data_source, data_key, num_sample, hand=True, clamp=self.clamp, filter_dist=self.filter_dist)
File "/home/gaeun/mount4t/AlignSDF/utils/sdf_utils.py", line 172, in unpack_sdf_samples
npz = np.load(npz_path)
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/numpy/lib/npyio.py", line 405, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'data/obman/train/sdf_hand/00018168.npz'

Killing subprocess 12576
Traceback (most recent call last):
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/gaeun/anaconda3/envs/alignsdf/bin/python', '-u', 'train.py', '--local_rank=0', '-e', 'experiments/obman/30k_1e2d_mlp5.json']' returned non-zero exit status 1.

@zerchen
Copy link
Owner

zerchen commented Jul 8, 2023

Hi,

I created this split only for students to conduct experiments under limited computing resources. I did not do experiments using the sdf_hand_mini and sdf_obj_mini.
To use this split, you need to generate a new json file like this and use it in your config file (https://github.com/zerchen/AlignSDF/blob/master/experiments/obman/30k_1e2d_mlp5.json).
Hope it helps.

@shqmffl486
Copy link
Author

shqmffl486 commented Jul 12, 2023

Thank you for your reply.
I learned it beforehand and trained it, but there seems to be an error in the process of making the last mesh.
What do you think is the problem?
So the Eval_obman file and several files were created in it, but the contents were missing

DeepSdf - INFO - time used: 85.93944382667542
DeepSdf - INFO - save at 100
DeepSdf - INFO - Distributing BatchNorm running means and vars
Traceback (most recent call last):
File "train.py", line 715, in
main_function(exp_cfg, args.continue_from, args.local_rank, args.opt_level, args.slurm)
File "train.py", line 669, in main_function
reconstruct(encoderDecoder, specs, split_filename, output_path, start_point=start_points[local_rank], end_point=end_points[local_rank], task=task, device=device, cube_dim=128, label_out=use_optim_mano, eval_mode=use_eval_mode)
File "/home/gaeun/mount4t/AlignSDF/reconstruct.py", line 95, in reconstruct
utils.mesh.create_mesh_combined_decoder(hand_branch, obj_branch, cls_branch, loaded_model.module.decoder, latent, mano_results, obj_results, cam_intr, specs, mesh_filename, N=cube_dim, max_batch=int(2 ** 18), scale=scale, device=device, label_out=label_out, viz=viz, eval_mode=eval_mode, task=task)
File "/home/gaeun/mount4t/AlignSDF/utils/mesh.py", line 157, in create_mesh_combined_decoder
out_labels[head: min(head + max_batch, num_out_vertices)] = predicted_class.argmax(dim=1).detach().cpu()
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
Killing subprocess 19631
Traceback (most recent call last):
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/gaeun/anaconda3/envs/alignsdf/bin/python', '-u', 'train.py', '--local_rank=0', '-e', 'experiments/obman/30k_1e2d_mlp5.json']' returned non-zero exit status 1.

@shqmffl486
Copy link
Author

I think this line is the problem.
mesh.py 157, out_labels[head: min(head + max_batch, num_out_vertices)] = predicted_class.argmax(dim=1).detach().cpu()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants