Object distance estimation using a monovision camera

ABSTRACT


INTRODUCTION
Estimating the depth information of an object from its pose in an environment is an essential part of computer vision but with monocular cameras, it is quite difficult to estimate the object's depth.Generally, monocular cameras acquire only 2D information about an object from a scene by virtue of perspective transformation which results in a loss of depth information [1], [2].Therefore, obtaining the depth information to have complete 3D information about the object's pose can be useful in many robotic applications such as pose estimation, picking and placing, and mapping.Traditional methods such as the use of Bluetooth, laser, ultrasonic and IR sensors have been used in the past to estimate the object's distance [3]- [5] but with the advent of vision sensors, stereo vision and monocular vision are the only two predominant methods used for estimating the object's distance in image-based visual servoing.The stereo vision, which is also known as the computerbased passive approach uses two cameras in the form of binocular structure or human eyes to estimate the depth information of the object [6]- [8].This can be achieved by placing two cameras horizontally apart and at equal distances from their center points to capture 2D images of the object in their views [9].Due to the distance separating the two cameras, the captured images are known as disparity images and are used for computing the depth information at the point where the field of view of the two cameras intersects.The stereo vision method is highly accurate but requires a large number of images to be processed in order to achieve precision.It also requires many complex computations due to the large number of images used hence, it is computationally intensive.This method is also expensive to implement because it requires the use of two cameras.In contrast to the stereo vision method, the monocular vision method involves the use of a single camera to estimate the object's distance based on the reference points of the camera's field of view [10].This method is fairly accurate but not computationally intensive because it requires only a few image registrations that enable the computer to process the images faster.Thus, this type of method can effectively reduce the system workload and save the computer a longer processing time [11].The monocular method used for visual servoing purposes is cheap and has low handling complexity due to the use of only one camera.

RELATED WORK
Object distance measurement plays a vital role in the acquisition of objects' depth information that complements the classic 2D visual perception used for robotic and autonomous systems applications.However, brief literature on distance estimation is presented in this section.Zhou et al. [12] used a monocular vision method to find the position and orientation of the object at a distance of 5 m.The relative translation and rotation values of X, Y, and Z directions were obtained through an unconstrained linear equation of rotation and translation matrix R, T and were computed using the inverse least-square method.Krishnan et al. [13] proposed a method of complex log mapping to measure the distance between the camera and the object's surface with an arbitrary pattern.The method is based on the use of two images taken at two different camera positions that are known while moving the camera along its optical axis.The distance of the object to the camera is therefore estimated by computing the ratio between the sizes of the object projected on the two images.
Chang et al. [14] proposed an efficient neural network method for achieving self-localization by a humanoid robot.Yang and Cao [15] also proposed a 6D pose estimation of an object using the Levenberg-Marquardt algorithm to refine the result of the decomposed homography matrix.Zhang et al. [16] proposed a method of estimating the localization of an object that is based on perspective transformation.Their method was presented in three stages.The first stage dealt with the calibration of the camera to calibrate the intrinsic parameters.The second constituted a model for computing the object's distance through perspective transformation by mapping the 3D points in the real world to the 2D image of a pinhole camera.The third stage, which is the measurement of the absolute distance between the camera and the target object, was achieved through the geometry formed from the perspective projections.
Muslikhin et al. [17] used a machine learning algorithm to classify the positions of the object in the image of the mono camera and then used the k-nearest neighbors (k-NN) approach to find the nearest point of the centroids to the closest class.Bui et al. [18] proposed the use of a single camera with a triangulation method to measure the distance of an object indirectly.The method is such that the distance to the object is determined based on one known angle and two sides of a triangle.Zheng et al. [19] presented a method of measuring an object's distance by a monocular vision camera on a mobile robot.However, the distance between the mobile robot and the target object was determined based on the sub-pixel image processing, mapping, and path planning method.Zhu and Fang [20] initially proposed to address the distance estimation problem with a deeplearning-based method by predicting directly the distance of a given object on red, green, and blue (RGB) images without the use of intrinsic parameters of the camera.They further enhanced the model with a key point regressor in which a projection loss was defined to estimate the distance of objects close to the monocular camera while facilitating the training and evaluation tasks with extended KITTI and nuScenes (mini) datasets of specified objects' distances.
Vajgl et al. [21] presented a Dist-YOLO method that is based on YOLO architecture in which the original loss function is updated to estimate the absolute distance of an object using the information from the monocular camera.Most of the methods used for estimating the object's distance in the literature are computationally intensive but, in this paper, a monovision camera was used to obtain a set of image-based data with the measured distances of the object and was computed by using a curve fitting technique to derive a nonlinear function for estimating the object's distance.

METHOD
To determine the distance of the object from the camera, which is the depth information, a single Pixy2 camera was used in this study.The Pixy2 camera is a vision sensor with an embedded image processor that can process captured RGB images and segment them to recognize objects of different colors while using its built-in color-based filtering algorithm called the color-connected components (CCC).As it has the capability of tracking up to seven different colors, which are red, blue, green, yellow, orange, cyan, and violet, it also has the functionality of tracking the object's position in the image in two dimensions The front and back of the views of the Pixy2 camera is shown in Figure 1.
Though the Pixy2 camera can perform other functions such as line tracking and barcode reading [22], in this study, it will be used to train a specific object with a single color positioned at a sequential distance from the camera to acquire a dataset for determining the object's distance.

Camera set-up
To train the Pixy2 camera to acquire the visual information of the object found in its field of view, the vision sensor needs to be installed in a position where the target object will be visible to the camera in order to avoid occlusion.So, the eye-in-hand configuration was used in this paper.The eye-in-hand configuration is a posture the camera takes when mounted on a manipulator and it can either be after or before the wrist of the robotic arm [23], [24].Figure 2 shows the Pixy2 camera mounted on the robot manipulator that is used for a pick and place purpose.

Distance measurement using a single Pixy2 camera
To measure the distance of the object using a single Pixy2 camera, a set of training data that can be used for estimating the object's distance was generated first from the experiment.However, in this method, a ripe tomato which is completely red was used as the target object in the experiment and was trained to be recognized by the Pixy2 camera using its PixyMon software.The ripe tomato was simultaneously positioned at a horizontal distance between 430 and 580 mm front of the robotic arm in the real world; and a vertical distance between 0 and 207 mm of the camera's image height.The horizontal and vertical distance parameters used in training the object were based on the manipulator's length (580 mm) and the entire image height (207 mm) of the camera.The object (ripe tomato) was placed sequentially in the camera's field of view (FOV) as shown in Figure 3.
However, on placing the ripe tomato sequentially in the camera's FOV, the respective distances of the ripe tomato from the camera's lens were measured using a measuring tape with an accuracy of ±0.Pixy2 camera based on its image processing and object tracking capabilities when the ripe tomato was placed sequentially within the specified horizontal and vertical distance parameters.The training data obtained from the experiment by placing the ripe tomato in sequential positions relative to the camera's reference position is given in Table 1.However, to determine the object's distance, which is the z-coordinate of the ripe tomato irrespective of its pose in the camera's FOV, the least-square method which takes the best-fit curve from a given dataset with a minimal sum of deviations [25] was employed to obtain the relationship between the area of the bounding box and the actual distance obtained from the training data in Table 1.The curve-fitting plot produced a non-linear relationship between the actual distance and the area of the bounding box in Figure 5. where y is the actual distance and x is the area of the bounding box.Hence, the distance is as in (2).Distance = 3285.4(Area) -0.322 (2) However, the relationship between the actual distance and the area of the bounding box variable in (2) was used to estimate the distance of the object from the camera.

RESULTS AND DISCUSSION
To estimate the object's distance, the distance-area relationship in (2) was used to estimate the distance of the ripe tomato from the Pixy2 camera using the area of the bounding box and the actual distance data in Table 1.Hence, the result was validated by determining the average error of the difference between the actual distance and the estimated distance.It can be seen from Table 2 that the slight deviation in the estimated distance resulted in an average error of 1.33 mm.Also, both estimated and actual distances were compared graphically as shown in Figure 6.

CONCLUSION
The low-cost monovision camera and the least-square method used in this paper can estimate the distance of the object from the camera irrespective of its pose in the camera's field of view under varying light conditions.The result from the experiment shows that the average error from the estimated object's distance is 1.33 mm.However, since this method is capable of complementing the 2D information that can be used for determining the object's location in cartesian space, therefore, it can be applied to many robotic and autonomous systems applications.

Figure 2 .
Figure 2. Pixy2 camera mounted on the elbow joint of the manipulator 5 .Therefore, to generate training data, the actual distances measured were recorded alongside the image data generated by the Pixy2 camera.The image data consists of the two coordinates (x, y), the width and height of the ripe tomato to determine the area of the bounding box as shown in Figure 4.These were estimated by the Pixy2 camera  ISSN: 2722-2586 IAES Int J Rob & Autom, Vol. 12, No. 4, December 2023: 325-331 328

Figure 3 .
Figure 3.A ripe tomato (object) placed sequentially in the camera's field of view

Figure 4 .
Figure 4.A captured ripe tomato bounded by a box in the image to obtain the trained image data for the computation of the object's distance

Figure 5 .
Figure 5.The graph of the actual distance against the area of the bounding box

Figure 6 .
Figure 6.Comparison of the estimated distance and actual distance of the object

Table 1 .
Data obtained from training the Pixy2 camera to estimate the positions of the ripe tomato when placed sequentially in the camera's field of view

Table 2 .
Result of the estimated distance and the average error