Could you help me figure out why my examination result shows rootba is slow than ceres and schur complement #2

varyshare · 2021-08-06T03:24:09Z

I am reading your paper <Square Root Bundle Adjustment for Large-Scale Reconstruction, CVPR2021>. Your idea of using QR decomposition instead of traditional Schur Complement is awesome. I have run your source code `rootba`. The result image is shown in the end of the issue. From the picture, we can see QR-32(single precision QR in rootba) is slow than ceres and schur complement. I was puzzle about it. Could you help me figure out it?

#!/usr/bin/env bash

MY_EXAM_DATA_FOLDER="./rootba_testing_data_thread16"
declare -a my_exames=("qr32" "qr64" "sc64" "sc32" "ceres")
for i in "${my_exames[@]}"
do
    mkdir -p $MY_EXAM_DATA_FOLDER/$i
done

DATA_ROOT_PATH=/home/shaoping/readcode/rootba/data
./bin/bal -C $MY_EXAM_DATA_FOLDER/qr32/ --num-threads 0 --no-debug --no-use-double --use-householder-marginalization --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/qr64/ --num-threads 0 --no-debug --use-double --use-householder-marginalization --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/sc64/ --num-threads 0 --no-debug --solver-type SCHUR_COMPLEMENT  --use-double  --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/sc32/ --num-threads 0 --no-debug --solver-type SCHUR_COMPLEMENT --no-use-double  --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/ceres/ --num-threads 0 --no-debug --solver-type CERES --use-double  --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"

./scripts/plot-logs.py $MY_EXAM_DATA_FOLDER

The text was updated successfully, but these errors were encountered:

NikolausDemmel · 2021-08-06T10:19:58Z

Thanks for your interest. I'll have a look shortly.

NikolausDemmel · 2021-08-07T10:49:29Z

In general, the relative performance of the different methods in our experience can depend a lot on the hardware and number of CPU cores. One aspect is that from our experiments it seems to better take advantage of parallelization. Also, which method is faster depends a lot on the actual problem. We are not claiming that rootba has better runtime in all situations.

That being said, I've tried your script with the current master and compiled with default settings on two machines, and this is what I get.

2013 Macbook (i7 with 8 virtual cores):

Ubuntu 18.04 Desktop (Xeon W-2133 with 12 virtual cores):

I'm not sure why you see something qualitatively very different. What hardware are you running on?

Two thoughts:

Are you actually running multithreaded with multiple cores?
Are you using OpenBLAS? Maybe try exporting in OPENBLAS_NUM_THREADS=1 in you shell before running rootba to make sure the multithreading in OpenBLAS is not interfering with the use of TBB in rootba. (See also the note about OpenBLAS in the readme, which has pointers explaining this in more details.)

varyshare · 2021-08-07T11:19:00Z

Thank you for helping me! ! ! I will try it according to your suggestion.
If we run in thread=1, will the experiment result be similar to multithreading?

NikolausDemmel · 2021-08-07T11:25:16Z

If we run in thread=1, will the experiment result be similar to multithreading?

No, I expect different outcome with different number of threads. Note that OPENBLAS_NUM_THREADS=1 is unrealated to the number of threads you configure for ceres and rootba. This is controlled with the --num-threads command line argument (or corresponding config entry). But in your script you are already setting it to 0, meaning it should use the number of hardware threads. Maybe the detection of number of hardware threads is faulty. You can try passing an explicit value. For example, try --num-threads 8 if you have a CPU with 8 (virtual) cores.

varyshare · 2021-08-07T12:26:07Z

Hello,
I checked my running environment. my processor is Intel® Core™ i7-10700 CPU @ 2.90GHz × 16, it has 16 cores.
And I didn't install OpenBlas. After set --num-threads 8 , the result remains to be ceres faster than rootba (both QR and SC). I will try another machine tomorrow. I guess may the TBB couldn't call the multi thread in my machine.
Thank you again.

NikolausDemmel · 2021-08-07T18:53:34Z

That's a bit strange. Yeah, maybe it is an issue with TBB. Your ceres runtime is similar to my Linux box, but the others are much slower, which is very surprising if it does indeed use multi-threading. Ceres does not use TBB in our configuration AFAIK, so it could make sense.

Maybe you can have a look yourself, but otherwise, you could post here your OS and maybe the full output of a fresh ./scripts/build-external.sh and (after deleting the build folder) ./scripts/build-rootba.sh plus the full command line output of your script. Maybe there is something in the logs that looks odd.

If you are using Ubuntu, you can double check which BLAS is configured with (just on case openblas got installed as a dependency of something):

update-alternatives --get-selections | grep "blas\|lapack"

DengueTim · 2022-11-09T18:51:05Z

Hi, I've been playing with this on a Macbook Air M1 with 8 and 4 threads. Using the ./bin/bal executable produces the expected results. However if I run the individual ./bin/bal_sc and ./bin/bal_qr executables the total_time accumulated doesn't show as pronounced results. Being 0.312s, 0.522s & 0.344s for qr32, qr64 & sc64 respectively. Also the total_time's are about 50 times smaller compared the to times from /bin/bal. The error looks the same. Why the big difference in runtime?

NikolausDemmel · 2022-11-16T23:35:08Z

That's very curious. Are you sure you have built all the binaries with the same configuration? Beware that by default ROOTBA_DEVELOPER_MODE is ON, which means even if you have different build folders (e.g. for debug or release), all binaries end up in the same bin folder.

Can you try wipe the bin and build folder and recompile all binaries? If you still see a difference another thing to confirm is that you are using the same config in all cases. Could you please paste the full command line call and output for all 3 runs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you help me figure out why my examination result shows rootba is slow than ceres and schur complement #2

Could you help me figure out why my examination result shows rootba is slow than ceres and schur complement #2

varyshare commented Aug 6, 2021

NikolausDemmel commented Aug 6, 2021

NikolausDemmel commented Aug 7, 2021

varyshare commented Aug 7, 2021

NikolausDemmel commented Aug 7, 2021

varyshare commented Aug 7, 2021 •

edited

Loading

NikolausDemmel commented Aug 7, 2021

DengueTim commented Nov 9, 2022

NikolausDemmel commented Nov 16, 2022

Could you help me figure out why my examination result shows rootba is slow than ceres and schur complement #2

Could you help me figure out why my examination result shows rootba is slow than ceres and schur complement #2

Comments

varyshare commented Aug 6, 2021

NikolausDemmel commented Aug 6, 2021

NikolausDemmel commented Aug 7, 2021

varyshare commented Aug 7, 2021

NikolausDemmel commented Aug 7, 2021

varyshare commented Aug 7, 2021 • edited Loading

NikolausDemmel commented Aug 7, 2021

DengueTim commented Nov 9, 2022

NikolausDemmel commented Nov 16, 2022

varyshare commented Aug 7, 2021 •

edited

Loading