How to Build a Small and Lightweight Depth Perception System Through Stereo Vision

The advantages of using stereo vision to form depth information perception are numerous, including working well outdoors, being able to provide high-resolution depth maps, and being fabricated with low-cost off-the-shelf components. When you need to develop a customized embedded stereo perception system, following the instructions provided here will also be a relatively simple task.

How to Build a Small and Lightweight Depth Perception System Through Stereo Vision

There are various 3D sensor schemes to implement depth perception systems, including stereo vision cameras, lidars, and TOF (time-of-flight) cameras. Each option has its pros and cons, among them, embedded depth-sensing stereo systems are low-cost, rugged, suitable for outdoor use, and capable of providing high-resolution color point clouds.

There are various off-the-shelf stereo perception systems on the market today. Sometimes a system engineer needs to build a custom system to meet specific application needs based on factors such as accuracy, baseline (distance between two cameras), field of view, and resolution.

In this article, we first introduce the main parts of a stereo vision system and provide instructions for making a custom stereo camera using hardware components and open source software. Since this setup is focused on embedded systems, it will compute the depth map of any scene in real-time, without the need for a host computer. In another article, we will discuss how to build a custom stereo vision system with less space for use with a computer mainframe.

Overview of Stereo Vision

Stereoscopic vision is the extraction of 3D information from digital images by comparing information in a scene from two viewpoints. The relative position of the object in the two image planes provides information about the depth of the object from the camera.

An overview of the stereo vision system is shown in Figure 1 and includes the following key steps:

1. Calibration: Camera calibration includes internal calibration and external calibration. The internal calibration determines the image center, focal length and distortion parameters, while the external calibration determines the 3D position of the camera. This is a crucial step in many computer vision applications, especially when metrological information about the scene, such as depth, is required. We discuss the calibration steps in detail in Section 5 below.

2. Correction: Stereo correction refers to the process of reprojecting the image plane onto a common plane parallel to the line between the camera centers. After correction, the corresponding points are located on the same line, which greatly reduces the cost and ambiguity of matching. This step is done in the code provided to build your own system.

3. Stereo matching: This refers to the process of matching pixels between the left and right images, resulting in a parallax image. The code provided will use the semi-global matching (SGM) algorithm to build your own system.

4. Triangulation: Triangulation refers to the process of determining a point in a given 3D space given its projection onto two images. Parallax images are converted to 3D point clouds.

Figure 1: Overview of the Stereo Vision System

Design example

Let’s look at an example of a stereo system design. Following are the application requirements for mobile robots in dynamic environments with fast moving objects. The relevant scene size is 2 m, the distance from the camera to the scene is 3 m, and the required accuracy at 3 m is 1 cm.

See this article for more details on Stereo Accuracy. The depth error is given by: ΔZ=Z²/Bf * Δd, which depends on the following factors:

• Z is the range
• B is the baseline
• f is the focal length in pixels, related to the camera’s field of view and image resolution

There are various design options to meet these requirements. Based on the above scene size and distance requirements, we can determine the lens focal length for a particular sensor. Combined with the baseline, we can use the above formula to calculate the expected depth error at 3 m to verify that it meets the accuracy requirements.

Figure 2 shows two options, using a low-resolution camera with a long baseline or a high-resolution camera with a short baseline. The first option is a larger camera but with lower computing requirements, while the second option is a more compact camera with higher computing requirements. For this application, we chose the second option because the compact size is more suitable for mobile robots, we can use the Quartet embedded solution for TX2, which has a powerful on-board GPU for processing needs.

Figure 2: Stereo system design options for example application

hardware requirements

In this example, we used an IMX273 Sony Pregius global shutter sensor to mount two Blackfly S board-level 1.6-megapixel cameras on a 3D-printed pole at a 12 cm baseline. Both cameras have similar 6mm S-mount lenses. The camera is connected to the “Quartet Embedded Solutions for TX2” custom carrier board using two FPC cables. In order to synchronize the left and right cameras to capture images at the same time, a sync cable was made to connect the two cameras. Figure 3 shows the front and rear views of our custom embedded stereo system.

Figure 3: Front and rear views of a custom embedded stereo system

The following table lists all hardware components:




link method


Quartet carrier board with 8GB TX2 module



1.6 MP, 226 FPS, Sony IMX273, Color



S-Mount and IR Filters for BFS Color Board Level Cameras



6 mm S-mount lens


15 cm FPC cable for board level Blackfly S



NVIDIA® Jetson™ TX2/TX2 4GB/TX2i Active Cooler


NVIDIA® Jetson™ TX2/TX2 4GB/TX2i Active Heat Sink

Sync cable (homemade)


Mounting rod (homemade)


Both lenses should be adjusted to focus the camera within the desired distance for your application. Tighten the screws on each lens (circled in red in Figure 4) to maintain focus.

Figure 4: Stereo system side view showing lens screws

software requirements

a. Spinnaker

The Teledyne FLIR Spinnaker SDK is pre-installed in the Quartet Embedded Solutions for TX2. Spinnaker needs to communicate with the camera.

b. OpenCV 4.5.2 with CUDA support

SGM (the stereo matching algorithm we are using) requires OpenCV 4.5.1 or higher. Download the zip file containing the code for this article and extract it to the StereoDepth folder. The script to install OpenCV is Type the following command in the terminal:

cd ~/StereoDepth
chmod +x

The installer will ask you to enter an administrator password. The installer will start installing OpenCV 4.5.2. Downloading and building OpenCV can take several hours.


Code for grabbing and calibrating stereo images can be found in the “Calibration” folder. Use the SpinView GUI to identify the serial numbers of the left and right cameras. In our setup, the right camera is the master camera and the left camera is the slave camera. Copy the master and slave camera serial numbers to the file grabStereoImages.cpp lines 60 and 61. Build the executable with the following command in the terminal:

cd ~/StereoDepth/Calibration
mkdir build
mkdir -p images/{left, right}
cd build

Print out a checkerboard pattern from this link and stick it on a flat surface to use as a calibration target. For best results when calibrating, set Exposure Auto to Off in SpinView and adjust the exposure so that the checkerboard pattern is clear and the white squares are not overexposed, as shown in Figure 5. Gain and exposure can be set to automatic in SpinView after a calibration image is collected.

Figure 5: SpinView GUI Settings

To start collecting images, type


The code should start collecting images at about 1 frame/sec. The left image is stored in the images/left folder and the right image is stored in the images/right folder. Move the target so that it appears in every corner of the image. You can rotate the target to take images from near and far. By default, the program captures 100 image pairs, but this can be changed using command line arguments:

./grabStereoImages 20

This will only collect 20 pairs of images. Note that this will overwrite any images previously written to the folder. Some example calibration images are shown in Figure 6.

Figure 6: Example Calibration Image

After collecting the images, run the calibration Python code by typing:

cd ~/StereoDepth/Calibration

This will generate 2 files named “intrinsics.yml” and “extrinsics.yml” which contain the internal and external parameters of the stereo system. The code defaults to 30mm checkerboard squares, but can be edited as needed. At the end of the calibration, it will show the RMS error, indicating how good or bad the calibration is. Typical RMS error for a good calibration should be less than 0.5 pixels.

real-time depth map

The code to calculate parallax in real time is in the “Depth” folder. Copy the camera serial number to the file live_disparity.cpp lines 230 and 231. Build the executable with the following command in the terminal:

cd ~/StereoDepth/Depth
mkdir build
cd build

Copy the “intrinsics.yml” and “extrinsics.yml” files obtained during the calibration step to this folder.To run the live depth map demo, type


It will show the left camera image (the original uncorrected image) and the depth map (our final output). Some example output is shown in Figure 7. The distance from the camera is color-coded according to the legend to the right of the depth map. A black area in the depth map means that no disparity data was found in that area. Thanks to the NVIDIA Jetson TX2 GPU, it can run up to 5 fps at 1440 x 1080 and up to 13 fps at 720 x 540.

To see the depth of a specific point, click on the point in the depth map and the depth will be displayed, as shown in the last example in Figure 7.

Figure 7: Sampling the left camera image and corresponding depth map. The bottom depth map also shows the depth at a specific point.

The Links:   LQ104S1DG21 SEMIX703GB126V1