Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between results from inference and the paper #10

Open
unoShin opened this issue Aug 25, 2021 · 25 comments
Open

Difference between results from inference and the paper #10

unoShin opened this issue Aug 25, 2021 · 25 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@unoShin
Copy link

unoShin commented Aug 25, 2021

First, thanks for your great work.

I trained the model using script 'python train.py --gpu 0-1 --backbone LPSKI' with Human3.6M and MPII datasets.
The protocol is 1 and train epoch was 25.

And I tested the model with test.py and the result is like below :
Protocol 1 error (PA MPJPE) >> tot: 42.72
Directions: 37.63 Discussion: 39.01 Eating: 45.51 Greeting: 43.06 Phoning: 41.33 Posing: 41.10 Purchases: 35.78 Sitting: 43.50 SittingDown: 57.36 Smoking: 47.08 Photo: 51.04 Waiting: 38.32 Walking: 30.94 WalkDog: 46.21 WalkTogether: 38.39

I found the average MPJPE of Protocol1 on paper is 35.2 which is different from my result.
Did I miss something to get the right result?? Like other settings in config.py...

Also, my train time was 16 hours with RTX2080 and the train time on paper is 3 days with 2 RTX titans.
So I also wonder what makes time difference between my result and the paper.

@SangbumChoi
Copy link
Owner

Did you checked that the config.py is initially set to extra small model (I reported three different types which are small, large, and extra small)? seems like 16 hours of training time is also seems that you used extra small model. Please let me know if you have further question.

@unoShin
Copy link
Author

unoShin commented Aug 25, 2021

Did you mean embedding_size in config.py to change the type?
It is 2048 and I checked the large model uses 2048 embedding channels on paper.

The other setting in config.py is like below :
class Config:

## dataset
# training set
# 3D: Human36M, MuCo
# 2D: MSCOCO, MPII 
trainset_3d = ['Human36M']
trainset_2d = ['MPII']

# testing set
# Human36M, MuPoTS, MSCOCO
testset = 'Human36M'

## directory
cur_dir = osp.dirname(os.path.abspath(__file__))
root_dir = osp.join(cur_dir, '..')
data_dir = osp.join(root_dir, 'data')
output_dir = osp.join(root_dir, 'output')
model_dir = osp.join(output_dir, 'model_dump')
pretrain_dir = osp.join(output_dir, 'pre_train')
vis_dir = osp.join(output_dir, 'vis')
log_dir = osp.join(output_dir, 'log')
result_dir = osp.join(output_dir, 'result')

## input, output
input_shape = (256, 256) 
output_shape = (input_shape[0]//8, input_shape[1]//8)
width_multiplier = 1.0
depth_dim = 32
bbox_3d_shape = (2000, 2000, 2000) # depth, height, width
pixel_mean = (0.485, 0.456, 0.406)
pixel_std = (0.229, 0.224, 0.225)

## training config
embedding_size = 2048
lr_dec_epoch = [17, 21]
end_epoch = 25
lr = 1e-3
lr_dec_factor = 10
batch_size = 64

## testing config
test_batch_size = 1
flip_test = True
use_gt_info = True

## others
num_thread = 20
gpu_ids = '0'
num_gpus = 1
continue_train = False

@SangbumChoi
Copy link
Owner

no you should try to change depth_dim 32 to 64 and also if there is error shows up then try to manage output_shape also
maybe correct answer should be like

input_shape = (256, 256) 
output_shape = (input_shape[0]//4, input_shape[1]//4)
width_multiplier = 1.0
depth_dim = 64
bbox_3d_shape = (2000, 2000, 2000) # depth, height, width
pixel_mean = (0.485, 0.456, 0.406)
pixel_std = (0.229, 0.224, 0.225)

@unoShin
Copy link
Author

unoShin commented Aug 25, 2021

It actually made an error like this :

File "/home/unolab/Yoonho/Pose/MobileHumanPose/main/model.py", line 67, in forward
loss_coord = torch.abs(coord - target_coord) * target_vis
RuntimeError: The size of tensor a (8) must match the size of tensor b (64) at non-singleton dimension 0

I changed the output shape to solve the error :
output_shape = (input_shape[0]//(8math.sqrt(2)), input_shape[1]//(8math.sqrt(2)))

And it made another error :
File "/home/unolab/Yoonho/Pose/MobileHumanPose/main/model.py", line 29, in soft_argmax
heatmaps = heatmaps.reshape((-1, joint_num, cfg.depth_dim*cfg.output_shape[0]*cfg.output_shape[1]))
TypeError: reshape(): argument 'shape' must be tuple of ints, but found element of type float at pos 3

@SangbumChoi SangbumChoi added the enhancement New feature or request label Aug 25, 2021
@SangbumChoi
Copy link
Owner

@unoShin I found that there was slight mis-match of large model so I uploaded in large branch in case of skip concat. Please let me know if this won't work

@unoShin
Copy link
Author

unoShin commented Aug 25, 2021

@SangbumChoi Thank you :)

@unoShin
Copy link
Author

unoShin commented Aug 26, 2021

I checked it works for training and it will take about 50 hours with RTX 2080.
Thank you for your support!

@SangbumChoi
Copy link
Owner

@unoShin Sounds great please close this issue (otherwise I will in shortly) when the score is similiar to the paper :)

@unoShin
Copy link
Author

unoShin commented Aug 30, 2021

Protocol 1 error (PA MPJPE) >> tot: 40.13
Directions: 34.42 Discussion: 35.80 Eating: 44.37 Greeting: 41.69 Phoning: 38.69 Posing: 37.26 Purchases: 34.61 Sitting: 42.46 SittingDown: 54.30 Smoking: 42.58 Photo: 48.37 Waiting: 35.72 Walking: 29.49 WalkDog: 43.04 WalkTogether: 35.15

I trained the model with new branch(large, 25epochs) and got the result like that.
There still is difference between 40.13 and 35.2(on paper) in MPJPE.

config.py :
trainset_3d = ['Human36M']
trainset_2d = ['MPII']

# testing set
# Human36M, MuPoTS, MSCOCO
testset = 'Human36M'

## directory
cur_dir = osp.dirname(os.path.abspath(__file__))
root_dir = osp.join(cur_dir, '..')
data_dir = osp.join(root_dir, 'data')
output_dir = osp.join(root_dir, 'output')
model_dir = osp.join(output_dir, 'model_dump')
pretrain_dir = osp.join(output_dir, 'pre_train')
vis_dir = osp.join(output_dir, 'vis')
log_dir = osp.join(output_dir, 'log')
result_dir = osp.join(output_dir, 'result')

## input, output
input_shape = (256, 256) 
output_shape = (input_shape[0]//4, input_shape[1]//4)
width_multiplier = 1.0
depth_dim = 64
bbox_3d_shape = (2000, 2000, 2000) # depth, height, width
pixel_mean = (0.485, 0.456, 0.406)
pixel_std = (0.229, 0.224, 0.225)

## training config
embedding_size = 2048
lr_dec_epoch = [17, 21]
end_epoch = 25
lr = 1e-3
lr_dec_factor = 10
batch_size = 16

## testing config
test_batch_size = 16
flip_test = True
use_gt_info = True

## others
num_thread = 20
gpu_ids = '0'
num_gpus = 1
continue_train = False

And protocol is 1 and bbox root file is from Subject 11 (trained on subject 1,5,6,7,8,9).
Did I do something wrong to have the wrong result?

@SangbumChoi
Copy link
Owner

Protocol 1 error (PA MPJPE) >> tot: 40.13
Directions: 34.42 Discussion: 35.80 Eating: 44.37 Greeting: 41.69 Phoning: 38.69 Posing: 37.26 Purchases: 34.61 Sitting: 42.46 SittingDown: 54.30 Smoking: 42.58 Photo: 48.37 Waiting: 35.72 Walking: 29.49 WalkDog: 43.04 WalkTogether: 35.15

I trained the model with new branch(large, 25epochs) and got the result like that.
There still is difference between 40.13 and 35.2(on paper) in MPJPE.

config.py :
trainset_3d = ['Human36M']
trainset_2d = ['MPII']

# testing set
# Human36M, MuPoTS, MSCOCO
testset = 'Human36M'

## directory
cur_dir = osp.dirname(os.path.abspath(__file__))
root_dir = osp.join(cur_dir, '..')
data_dir = osp.join(root_dir, 'data')
output_dir = osp.join(root_dir, 'output')
model_dir = osp.join(output_dir, 'model_dump')
pretrain_dir = osp.join(output_dir, 'pre_train')
vis_dir = osp.join(output_dir, 'vis')
log_dir = osp.join(output_dir, 'log')
result_dir = osp.join(output_dir, 'result')

## input, output
input_shape = (256, 256) 
output_shape = (input_shape[0]//4, input_shape[1]//4)
width_multiplier = 1.0
depth_dim = 64
bbox_3d_shape = (2000, 2000, 2000) # depth, height, width
pixel_mean = (0.485, 0.456, 0.406)
pixel_std = (0.229, 0.224, 0.225)

## training config
embedding_size = 2048
lr_dec_epoch = [17, 21]
end_epoch = 25
lr = 1e-3
lr_dec_factor = 10
batch_size = 16

## testing config
test_batch_size = 16
flip_test = True
use_gt_info = True

## others
num_thread = 20
gpu_ids = '0'
num_gpus = 1
continue_train = False

And protocol is 1 and bbox root file is from Subject 11 (trained on subject 1,5,6,7,8,9).
Did I do something wrong to have the wrong result?

Sorry for inconvenience. I was fool that I commit every intermediate progress on github so I just need to find those past commit. I will find you the appropriate large model code for everyone.
The little thing that might concern is that batch_size due to individual gpu circumstances.

Just one thing that you can check right now is that whether if extra-small model scores same in the paper.

I will let you know if I find one.

Thanks

@unoShin
Copy link
Author

unoShin commented Aug 30, 2021

@SangbumChoi I will try and let you know. Thank you.

@unoShin
Copy link
Author

unoShin commented Aug 30, 2021

@SangbumChoi Is there only difference in line 130 of lpnet_ski_concat.py?

@unoShin
Copy link
Author

unoShin commented Sep 2, 2021

09-02 09:52:46 Protocol 1 error (PA MPJPE) >> tot: 40.21
Directions: 36.56 Discussion: 37.00 Eating: 42.51 Greeting: 41.39 Phoning: 38.17 Posing: 36.55 Purchases: 36.60 Sitting: 42.26 SittingDown: 55.09 Smoking: 41.85 Photo: 48.03 Waiting: 36.51 Walking: 29.55 WalkDog: 43.62 WalkTogether: 35.50

Using that commit version, the result still has some difference with the result of the paper.

@SangbumChoi
Copy link
Owner

09-02 09:52:46 Protocol 1 error (PA MPJPE) >> tot: 40.21
Directions: 36.56 Discussion: 37.00 Eating: 42.51 Greeting: 41.39 Phoning: 38.17 Posing: 36.55 Purchases: 36.60 Sitting: 42.26 SittingDown: 55.09 Smoking: 41.85 Photo: 48.03 Waiting: 36.51 Walking: 29.55 WalkDog: 43.62 WalkTogether: 35.50

Using that commit version, the result still has some difference with the result of the paper.

@unoShin Hi, I have two question for you

  1. Did you use exactly same commit branch that I told you?
  2. What was your batch size? and 2d dataset for Human3.6M?

if both answer seems reasonable than I will re-train my code to announce. It might take more than one week

@unoShin
Copy link
Author

unoShin commented Sep 2, 2021

@SangbumChoi
Hi,

  1. Yes, I used this version : 70baeaf
    So I asked you whether the difference is only line 130 of lpnet_ski_concat.py between large branch and 70baeaf.

  2. Batch size is 8 and 2d dataset is MPII.
    And my training time is 2.21 hour/epoch with RTX 2080 and the total number of epochs is 25.

Thanks!

@SangbumChoi
Copy link
Owner

@unoShin I'm little bit concern that your batch size is different from original paper and code but let me re-check and share with you. Again this might takes some time

@unoShin
Copy link
Author

unoShin commented Sep 2, 2021

@SangbumChoi Thanks for your support :)

@ggfresh
Copy link

ggfresh commented Sep 9, 2021

Hi bro, I train with Human36 and MPII while get an error 400+. And the vis output result on 2D looks not so bad. I do not build bbox root file and use gt bbox, I want to figure out why I get a wrong error, how do you gene the bbox root file?

@SangbumChoi
Copy link
Owner

@ggfresh it seems like getting error with more than 400+ might be causing with old-branch (see this issue). and also if the image file seems cropped than actually you don't have to build a gt and root bbox. However, you can generate bbox root file by object detection or RootNet (https://github.com/mks0601/3DMPPE_ROOTNET_RELEASE)

@ggfresh
Copy link

ggfresh commented Sep 9, 2021

@ggfresh it seems like getting error with more than 400+ might be causing with old-branch (see this issue). and also if the image file seems cropped than actually you don't have to build a gt and root bbox. However, you can generate bbox root file by object detection or RootNet (https://github.com/mks0601/3DMPPE_ROOTNET_RELEASE)

thanks for the reply, which issue?

@ggfresh
Copy link

ggfresh commented Sep 9, 2021

@SangbumChoi When I test the epoch 24,the error is big.
image

I have the same problem, while my train loss is norm I think.
train_log_tmp

while I have use epochs 24-24

train_result

@SangbumChoi
Copy link
Owner

@ggfresh it is very awkward that already current opened issue claims at least 40 MPJPE. Your description has lack of information to debug and find error. As you said training seems normal, and you might want to actually display the jpg file.

@ggfresh
Copy link

ggfresh commented Sep 9, 2021

Sorry, after checking, it is found that my own data is inconsistent.

@junhee98
Copy link

junhee98 commented Jan 24, 2023

@unoShin I found that there was slight mis-match of large model so I uploaded in large branch in case of skip concat. Please let me know if this won't work

What was the uploaded large branch in case of skip concat? I couldn't find it on the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants