Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you help me figure out why my examination result shows rootba is slow than ceres and schur complement #2

Open
varyshare opened this issue Aug 6, 2021 · 8 comments

Comments

@varyshare
Copy link

I am reading your paper <Square Root Bundle Adjustment for Large-Scale Reconstruction, CVPR2021>. Your idea of using QR decomposition instead of traditional Schur Complement is awesome. I have run your source code rootba. The result image is shown in the end of the issue. From the picture, we can see QR-32(single precision QR in rootba) is slow than ceres and schur complement. I was puzzle about it. Could you help me figure out it?

#!/usr/bin/env bash

MY_EXAM_DATA_FOLDER="./rootba_testing_data_thread16"
declare -a my_exames=("qr32" "qr64" "sc64" "sc32" "ceres")
for i in "${my_exames[@]}"
do
    mkdir -p $MY_EXAM_DATA_FOLDER/$i
done

DATA_ROOT_PATH=/home/shaoping/readcode/rootba/data
./bin/bal -C $MY_EXAM_DATA_FOLDER/qr32/ --num-threads 0 --no-debug --no-use-double --use-householder-marginalization --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/qr64/ --num-threads 0 --no-debug --use-double --use-householder-marginalization --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/sc64/ --num-threads 0 --no-debug --solver-type SCHUR_COMPLEMENT  --use-double  --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/sc32/ --num-threads 0 --no-debug --solver-type SCHUR_COMPLEMENT --no-use-double  --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/ceres/ --num-threads 0 --no-debug --solver-type CERES --use-double  --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"

./scripts/plot-logs.py $MY_EXAM_DATA_FOLDER

图片

@NikolausDemmel
Copy link
Owner

Thanks for your interest. I'll have a look shortly.

@NikolausDemmel
Copy link
Owner

In general, the relative performance of the different methods in our experience can depend a lot on the hardware and number of CPU cores. One aspect is that from our experiments it seems to better take advantage of parallelization. Also, which method is faster depends a lot on the actual problem. We are not claiming that rootba has better runtime in all situations.

That being said, I've tried your script with the current master and compiled with default settings on two machines, and this is what I get.

2013 Macbook (i7 with 8 virtual cores):

macbook

Ubuntu 18.04 Desktop (Xeon W-2133 with 12 virtual cores):

linux

I'm not sure why you see something qualitatively very different. What hardware are you running on?

Two thoughts:

  • Are you actually running multithreaded with multiple cores?
  • Are you using OpenBLAS? Maybe try exporting in OPENBLAS_NUM_THREADS=1 in you shell before running rootba to make sure the multithreading in OpenBLAS is not interfering with the use of TBB in rootba. (See also the note about OpenBLAS in the readme, which has pointers explaining this in more details.)

@varyshare
Copy link
Author

Thank you for helping me! ! ! I will try it according to your suggestion.
If we run in thread=1, will the experiment result be similar to multithreading?

@NikolausDemmel
Copy link
Owner

If we run in thread=1, will the experiment result be similar to multithreading?

No, I expect different outcome with different number of threads. Note that OPENBLAS_NUM_THREADS=1 is unrealated to the number of threads you configure for ceres and rootba. This is controlled with the --num-threads command line argument (or corresponding config entry). But in your script you are already setting it to 0, meaning it should use the number of hardware threads. Maybe the detection of number of hardware threads is faulty. You can try passing an explicit value. For example, try --num-threads 8 if you have a CPU with 8 (virtual) cores.

@varyshare
Copy link
Author

varyshare commented Aug 7, 2021

Hello,
I checked my running environment. my processor is Intel® Core™ i7-10700 CPU @ 2.90GHz × 16, it has 16 cores.
And I didn't install OpenBlas. After set --num-threads 8 , the result remains to be ceres faster than rootba (both QR and SC). I will try another machine tomorrow. I guess may the TBB couldn't call the multi thread in my machine.
Thank you again.

@NikolausDemmel
Copy link
Owner

That's a bit strange. Yeah, maybe it is an issue with TBB. Your ceres runtime is similar to my Linux box, but the others are much slower, which is very surprising if it does indeed use multi-threading. Ceres does not use TBB in our configuration AFAIK, so it could make sense.

Maybe you can have a look yourself, but otherwise, you could post here your OS and maybe the full output of a fresh ./scripts/build-external.sh and (after deleting the build folder) ./scripts/build-rootba.sh plus the full command line output of your script. Maybe there is something in the logs that looks odd.

If you are using Ubuntu, you can double check which BLAS is configured with (just on case openblas got installed as a dependency of something):

update-alternatives --get-selections | grep "blas\|lapack"

@DengueTim
Copy link

Hi, I've been playing with this on a Macbook Air M1 with 8 and 4 threads. Using the ./bin/bal executable produces the expected results. However if I run the individual ./bin/bal_sc and ./bin/bal_qr executables the total_time accumulated doesn't show as pronounced results. Being 0.312s, 0.522s & 0.344s for qr32, qr64 & sc64 respectively. Also the total_time's are about 50 times smaller compared the to times from /bin/bal. The error looks the same. Why the big difference in runtime?

@NikolausDemmel
Copy link
Owner

That's very curious. Are you sure you have built all the binaries with the same configuration? Beware that by default ROOTBA_DEVELOPER_MODE is ON, which means even if you have different build folders (e.g. for debug or release), all binaries end up in the same bin folder.

Can you try wipe the bin and build folder and recompile all binaries? If you still see a difference another thing to confirm is that you are using the same config in all cases. Could you please paste the full command line call and output for all 3 runs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants