Enable GPU optimizations in ORB_SLAM2 to achieve real time SLAM on the Jetson TX1 embedded computer.
Switch from OpenCV 2.4 to OpenCV 3.1
OpenCV 3.1 introduces several features helpful to this project: custom memory allocator, CUDA stream and rewrite of some essential algorithms, such as FAST and ORB. These features allow us to fully utilize more CUDA APIs, such as Unified Memory.
Feature extraction reimplemented
There are several execution hotspots in the original
ORB_SLAM2, including but not limited to
FAST corner detection,
Gaussian filter and
ORB feature extraction.
For example, in their
Key Point extraction procedure, an image is divided into many small tiles
FAST is invoked on each tile one or two times in order to achieve high accuracy.
The algorithm was effective yet inefficient.
Hence we implemented a slightly modified version of it in CUDA and parallelized the work
of each tile.
ORB feature extraction is also a costly but parallelizable procedure, so it's implemented with CUDA, too.
Overlap CPU and GPU execution
However, there are still some irregular code segments that cannot be parallelized. So our next goal is to maximize CPU/GPU overlap. Ideally if a CPU work is completed before a GPU kernel ends, then the CPU work would be considered "free"; unfortunately, many CPU work have data dependencies on other GPU results, thus CPU/GPU work scheduling must be done wisely. With the help of many profiles (thanks to NVVP), we've figured out a pretty good scheduling scheme to pipeline CPU and GPU work, such that GPU is kept as busy as possible while CPU can overlap many of it's execution with GPU.
The purple bars on the row "Default domain" indicates CPU work and the "Compute" row indicates GPU work.
Following are some charts of the speedups we achieved on an ordinary PC and on a Jetson TX1. The PC's CPU/GPU is Xeon E3 1231 / GTX 760. The statistics were mesured using chosen sequences of the KITTI dataset and live captured images from the camera module on top of TX1.
After enabling GPU optimization, the fps of live camera tracking is increased from 5.98 to 14.42 and frame processing time is reduced from 0.166s to 0.068s !