Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize opencv builds #790

Open
fnoop opened this issue Oct 19, 2018 · 13 comments
Open

Optimize opencv builds #790

fnoop opened this issue Oct 19, 2018 · 13 comments
Milestone

Comments

@fnoop
Copy link
Member

fnoop commented Oct 19, 2018

Bump to opencv 4 and optimise for x86, raspberry and tegra platforms

@fnoop
Copy link
Member Author

fnoop commented Oct 23, 2018

Aruco compiler flags for optimised builds:
fnoop/aruco@db2feab#diff-d7ed762694f84a7b37a09f7dde5fa6aaR118

@fnoop
Copy link
Member Author

fnoop commented Oct 23, 2018

Useful info from vision_landing
goodrobots/vision_landing#95

@fnoop
Copy link
Member Author

fnoop commented Oct 23, 2018

@fnoop
Copy link
Member Author

fnoop commented Oct 23, 2018

Raspberry:

Look into VFPV4 and VFPV5 flags: -mfpu=neon-vfpv4
-D BUILD_TESTS=OFF -D BUILD_PERF_TESTS=OFF -D OPENCV_ENABLE_NONFREE=ON

Can we use -fast-math?

Note on this thread (note it's old version of gcc):
https://www.raspberrypi.org/forums/viewtopic.php?t=107203

And I complied your code with both gcc-5.0 and gcc-4.8 and use the flag: -mcpu=cortex-a7
-mfloat-abi=hard -mfpu=neon-vfpv4. I find the NEON version of the code runs slower on both gcc-5.0 and gcc-4.8. And the neon version on gcc-4.8 runs fast than gcc-5.0.
if I only use the flag -mfpu=neon-vfpv4,the code complied on gcc-4.8 is about 300% percent slower.

Good thread:
https://www.raspberrypi.org/forums/viewtopic.php?t=144115

@fnoop
Copy link
Member Author

fnoop commented Oct 23, 2018

Look into opencl:
https://github.com/doe300/VC4CL

@fnoop
Copy link
Member Author

fnoop commented Oct 23, 2018

@fnoop
Copy link
Member Author

fnoop commented May 2, 2019

@fnoop fnoop added this to the 2.0 milestone Aug 10, 2019
@fnoop
Copy link
Member Author

fnoop commented May 9, 2020

Linked to #963, which deals with optimizing for nvidia tegra platform

@fnoop fnoop modified the milestones: 2.0, 1.2 May 9, 2020
@fnoop
Copy link
Member Author

fnoop commented May 14, 2020

Raspberry cmake:

-- General configuration for OpenCV 4.3.0 =====================================
--   Version control:               4.3.0
--
--   Extra modules:
--     Location (extra):            /srv/maverick/var/build/opencv_contrib/modules
--     Version control (extra):     4.3.0
--
--   Platform:
--     Timestamp:                   2020-05-13T23:04:46Z
--     Host:                        Linux 4.19.97-v7l+ armv7l
--     CMake:                       3.13.4
--     CMake generator:             Unix Makefiles
--     CMake build tool:            /usr/bin/make
--     Configuration:               Release
--
--   CPU/HW features:
--     Baseline:                    VFPV3 NEON
--       requested:                 DETECT
--       required:                  VFPV3 NEON
--   C/C++:
--     Built as dynamic libs?:      YES
--     C++ standard:                11
--     C++ Compiler:                /usr/bin/c++  (ver 8.3.0)
--     C++ flags (Release):         -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Wini
t-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthre
ad -fomit-frame-pointer -ffunction-sections -fdata-sections  -mfpu=neon -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG  -DNDEBUG
--     C++ flags (Debug):           -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Wini
t-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthre
ad -fomit-frame-pointer -ffunction-sections -fdata-sections  -mfpu=neon -fvisibility=hidden -fvisibility-inlines-hidden -g  -O0 -DDEBUG -D_DEBUG
--     C Compiler:                  /usr/bin/cc
--     C flags (Release):           -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-pro
totypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-poin
ter -ffunction-sections -fdata-sections  -mfpu=neon -fvisibility=hidden -O3 -DNDEBUG  -DNDEBUG
--     C flags (Debug):             -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-pro
totypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-poin
ter -ffunction-sections -fdata-sections  -mfpu=neon -fvisibility=hidden -g  -O0 -DDEBUG -D_DEBUG
--     Linker flags (Release):      -Wl,--gc-sections -Wl,--as-needed
--     Linker flags (Debug):        -Wl,--gc-sections -Wl,--as-needed
--     ccache:                      NO
--     Precompiled headers:         NO
--     Extra dependencies:          dl m pthread rt

--   OpenCV modules:
--     To be built:                 alphamat aruco bgsegm bioinspired calib3d ccalib core datasets dnn dnn_objdetect dnn_superres dpm face features2d flann freetype fuzzy gapi hdf hfs highgui img_hash imgcodecs i
mgproc intensity_transform line_descriptor ml objdetect optflow phase_unwrapping photo plot python2 python3 quality rapid reg rgbd saliency shape stereo stitching structured_light superres surface_matching text t
racking ts video videoio videostab xfeatures2d ximgproc xobjdetect xphoto
--     Disabled:                    world
--     Disabled by dependency:      -
--     Unavailable:                 cnn_3dobj cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev cvv java js matlab ovis sfm v
iz
--     Applications:                tests perf_tests examples apps
--     Documentation:               NO
--     Non-free algorithms:         YES
--   Video I/O:
--     DC1394:                      YES (2.2.5)
--     FFMPEG:                      YES
--       avcodec:                   YES (58.35.100)
--       avformat:                  YES (58.20.100)
--       avutil:                    YES (56.22.100)
--       swscale:                   YES (5.3.100)
--       avresample:                YES (4.0.0)
--     GStreamer:                   YES (1.14.4)
--     OpenNI2:                     NO
--     v4l/v4l2:                    YES (linux/videodev2.h)
--
--   Parallel framework:            TBB (ver 2020.0 interface 11100)
--
--   Trace:                         YES (with Intel ITT)
--
--   Other third-party libraries:
--     Lapack:                      YES (/srv/maverick/software/openblas/lib/libopenblas.so)
--     Eigen:                       YES (ver 3.3.7)
--     Custom HAL:                  YES (carotene (ver 0.0.1))
--     Protobuf:                    build (3.5.1)
--
--   OpenCL:                        YES (no extra features)
--     Include path:                /srv/maverick/var/build/opencv/3rdparty/include/opencl/1.2
--     Link libraries:              Dynamic load

@fnoop
Copy link
Member Author

fnoop commented May 14, 2020

On raspbian buster, gcc (gcc version 8.3.0 (Raspbian 8.3.0-6+rpi1)) seems to choose good optimisations:

[flight] [mav@maverick-raspberry ~/software/opencv]$ gcc -mcpu=native -march=native -Q --help=target
The following options are target specific:
  -mabi=                      		aapcs-linux
  -march=                     		armv8-a+crc+simd
  -mcpu=                      		cortex-a72
  -mfloat-abi=                		hard
  -mfpu=                      		vfp

-mfpu perhaps could be changed. Available fpu options:
Known ARM FPUs (for use with the -mfpu= option): auto crypto-neon-fp-armv8 fp-armv8 fpv4-sp-d16 fpv5-d16 fpv5-sp-d16 neon neon-fp-armv8 neon-fp16 neon-vfpv3 neon-vfpv4 vfp vfp3 vfpv2 vfpv3 vfpv3-d16 vfpv3-d16-fp16 vfpv3-fp16 vfpv3xd vfpv3xd-fp16 vfpv4 vfpv4-d16
Important to note that NEON is not an fpu replacement but a SIMD implementation, and it costs to transfer data back and forth to NEON. -mfpu=neon-vfpv4 might be a good option.

This suggests -mfpu=neon-fp-armv8:
https://www.raspberrypi.org/forums/viewtopic.php?t=244095
and this:
https://forums.gentoo.org/viewtopic-t-1098908-start-0.html
and:
https://github.com/thortex/rpi3-opencv

Excellent reference:
https://gist.github.com/fm4dd/c663217935dc17f0fc73c9c81b0aa845

@fnoop
Copy link
Member Author

fnoop commented May 14, 2020

Note OpenCV Release build sets -O3 which automatically turns on -ftree-vectorize

@fnoop
Copy link
Member Author

fnoop commented May 14, 2020

fnoop added a commit that referenced this issue May 16, 2020
fnoop added a commit that referenced this issue Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant