Detecting African hoofed animals in aerial imagery using convolutional neural network

Received Aug 31, 2020 Revised Dec 1, 2020 Accepted Feb 12, 2021 Small unmanned aerial vehicles applications had erupted in many fields including conservation management. Automatic object detection methods for such aerial imagery were in high demand to facilitate more efficient and economical wildlife management and research. This paper aimed to detect hoofed animals in aerial images taken from a quad-rotor in Southern Africa. Objects captured in this way were small both in absolute pixels and from an object-to-image ratio point of view, which were not perfectly suit for general purposed object detectors. We proposed a method based on the iconic Faster region-based convolutional neural networks (R-CNN) framework with atrous convolution layers in order to retain the spatial resolution of the feature map to detect small objects. A good choice of anchors was of prime importance in detecting small objects. The performance of the proposed Faster R-CNN with atrous convolutional filters in the backbone network was proven to be outstanding in our scenario by comparing to other object detection architectures.


INTRODUCTION
Unmanned aerial vehicles (UAVs) as a convenient and easy-to-get data acquisition tool has been applied in many wildlife conservation and research tasks. Automatic analysis of such aerial imagery is of significant importance as such data amounts dramatically. Object detection forms the basis of many computer vision applications. Work has been done in detecting terrestrial [1], marine [2], and celestial [3,4] species from aerial imagery in different environments. Africa holds a variety of unique hoofed wildlife species, and a number of them are under threat, some at critical risk of extinction. UAVs combined with computer vision techniques can assist conservation workers and researchers to a great extent.
Computer vision applications differ from case to case depending on the scenario and unique characteristics of the provided dataset. Challenges of object detection in aerial imagery were summarised in [5], such as small object size, large scale variations, crowded instances and various orientations. In among which the most challenging is the size and scale problem. The UAV has to operate at a certain altitude in order to provide a big field of view, avoid disturbing the targeted and other local species. The distance makes the animals captured in small scale. The full image resolution should be big enough to retain the absolute object size in pixels so that it carries sufficient information. This makes the ratio of the object size to the full image size quite small compared to objects in ground-perceived imagery. To illustrate this, summaries of the object size measured by the bounding-box areas are given in Figure 1. Our dataset was compared to the popular generic object detection datasets PASCAL-VOC [6] and MS-COCO [7], which contains everyday objects such as cars, pedestrians and pets. Figure 1 (a) shows the histogram of the absolute object size in pixels. Object size was represented by the square root of the bounding-box area. The bins are normalized to compare the datasets on the same scale. Majority of the objects in our wildlife dataset lies in edge length no bigger than 128 pixels. While the object size of our dataset is smaller than the other datasets, the full image size is much bigger. This makes the ratio of the object size against the full image size even smaller, as shown on Figure 1 (b). Figure 2 shows some example zebras cropped from the aerial imagery we captured from the wild nature. Figure 2 (a) shows the various orientations of the zebras from the bird's eye view. In Figure 2 (b), the zebras were found difficult to be distinguished from other animals due to the small scale and illumination conditions. Deep learning frameworks such as recurrent neural network (RNN) [8] and convolutional neural network (CNN) [9] have boosted machine learning applications to another level in recent years. Research in fields like nature language processing (NLP) [10], machine translation [11], and computer vision [12,13] have been dominated by deep learning. The word "deep" refers to the depth of information that the neural network extracts from the raw data. Modern CNN facilitated object detectors derive the feature representations by stacking a sequence of convolutional layers over the input image. During the process, the feature map size was continuously reduced to extract more abstract information and allow translation- invariance. This was normally achieved by strided convolution or max-pooling. Small objects were found to be difficult to handle as their location information could be lost after down-sampling. To detect the small objects, the feature map has to retain a reasonable resolution, and at the same time being semantically expressive. For this purpose, we applied a sequence of atrous convolutional layers to keep the feature map at the desired resolution. This fits the single-layer representation that detects objects of all scales from one single feature layer, such as the two-stage faster region based convolutional neural networks (R-CNN). The matching qualities of the anchors and ground-truth was found to be highly correlated to the detection performance. A fine feature stride and comprehensive set of anchors helps improve detection performance, especially for the small objects. However, it is a trade-off between detection accuracy and computation cost. Faster R-CNN [14] is the representative of a "two-stage" object detector, which trained a region proposal network (RPN) to generate object candidates. The candidates were then passed on to another network for multi-class classification and bounding-box fine-tuning. In the second stage, "ROI alignment" [15] ("ROI pooling" in early version) cropped the features for the object proposals and fit them into the same size. "Anchors" were a set of pre-defined bounding-boxes that serve as object proposals for the RPN. The output bounding-box were derived by predicting the "offsets" to the anchors. During training, anchors generate positive and negative examples according to its intersection with the ground-truth.
Skipping the proposal generating process, single shot object detector (SSD) proposed to make final predictions on class label and bounding-box coordinates offsets directly from the feature maps, unlike faster R-CNN that handles objects of all scales on the same feature map. SSD worked on a hierarchical feature pyramid and each feature layer was designated to objects of one scale. YOLO (You Only Look Once) [16] divided the image into a grid. Each grid cell was responsible for predicting the objects whose bounding-box centre lies in this cell. The class label, confidence and bounding-box coordinates were integrated as a single regression problem, which gained processing speed. But one obvious shortcoming is to deal with occluded objects whose centres lie in the same grid cell. Detecting small objects was also found not easy, as the grid division was coarse. In upgraded versions of YOLO [17,18], anchors were introduced to improve the performance on location prediction. Some recent work proposed to represent the object as coordinate points, and make predictions by grouping the points [19,20].
A fine feature map resolution is needed to improve detection performance on small objects. Methods for recovering spatial resolution while keeping semantic information were imported from image segmentation as it by nature requires dense prediction on pixel level. A common practice was to use linear up-pooling or transpose convolution (also called "de-convolution") [21] after continuous down-sampling. feature pyramid network (FPN) [22] laterally connected the up-sampled layers to the previous layers to reinforce the information, especially for the shallow layers. This flexible structure could serve as backbone network to many detection schemes. For example, RetinaNet [23] is approximately a combination of SSD and FPN, with a modified loss to mitigate influence of overwhelming numbers of easy negative examples.
In contrast to transpose convolution and up-pooling, atrous convolution (also called dilated convolution or "hole" algorithm) do not down-sample the original image but apply a pyramid of atrous filters with different dilation rates to extract features from different scales. A dilation filter is a normal convolutional filter inserted by zeros. Dilation rate is the distance to insert zeros, which controls the effective receptive field of the filter. This technique was adopted in object detection and applied in numerous occasions, such as road lane detection in [24] and bridge crack detection in [25]. In [26], Atrous convolution was reported to have improved detection performance of small objects. With a different scale of "small" objects, the author used SSD, which we found difficult to match the anchors. And they construct extra layers at the end of the CNN, while we apply atrous convolution in intermediate layers.
A very similar context as our work, African mammals were detected from aerial images in [27], where two sibling networks were constructed. One predicts class probability for each feature map cell, the other outputs bounding-box coordinates. The images were cropped into small pieces and detection was made in each piece. This is a common practice to deal with extremely high-resolution remote sensing images [28] . Obviously, there will be a bunch of objects cut into different parts and this is a problem when preparing training data and stitching the patches back together to form a unified detection.

RESEARCH METHOD 2.1. Backbone network
ResNet [29] addressed the degradation problem of very deep neural networks, where accuracy gets saturated when the network goes too deep. By adding the input to the output, very deep network gains detection accuracy. Figure 3 shows the residual unit that performs a convolution operation and "shortcut" connection between the input and output. The idea of residual learning was used in many other architectures like Darknet-53 of YOLO-v3 and Incepton-v3 of GoogleNet [30]. The network could grow very deep by stacking a bunch of convolutional layers. However, computational cost hampers the use of heavy architectures for big images in our dataset. Herein, we used ResNet-50 that contains 50 convolutional layers.
The main body of ResNet-50 contains 4 blocks that were composed by 3, 4, 6 and 3 layers of the residual units as shown in Figure 3, respectively. Originally, the spatial resolutions of the blocks were 1/4, 1/8, 1/16 and 1/32 of the input image size. Instead of continuously reduce the spatial resolution after Block 2, the output stride was kept at 1/8 of the input image size. To extract abstract features, All the 3x3 convolutional filters in Blocks 2, 3 and 4 are replaced by atrous convolutional filters. Essence of atrous convolution was to catch information from a bigger area and skip some in between by setting "holes" on the filter. The formulation of one-dimensional signals is: called "dilation rate". A fine feature map spatial resolution contributes to avoiding omitting the very small objects and makes anchors better matching the ground-truth. But it is a trade-off choice as the number of anchors increase exponentially on a double sized feature map. As the atrous convolutional layers are stacked consecutively and the abstract level accumulates in a very deep network, we propose to use the same dilation rate equals to 2 others than progressively enlarge the dilation rate. Ablation on the dilation rate will be done in Section 5. Bracket part in Figure 4 illustrates the modified ResNet-50 in detail, including layer sizes and depths. The rest depicts the flow of the features derived from the backbone network in Faster R-CNN.  The output stride is kept at 8 for several reasons. One is that the smallest object size in our dataset is about 11x11, nominally it would project to at least one-pixel on the feature map at the output stride of 8.
Another is that small objects need a small striding step to better match the anchors. The matching quality for each object scale has great impact on the detection results. As the number of small objects compose only small portion of the dataset. Misalignment between the anchors and ground-truth for several examples is not affordable for the small objects. The total number of anchors should also be considered. It is further discussed later together with the detection architecture and anchor settings.

Detection architecture
The detection architecture follows Faster R-CNN. A 3x3 convolutional filter will slide on the last layer of the backbone network and derives a feature map. Each cell of the feature map outputs predictions for the pre-defined set of anchors. Suppose there are k anchors for each cell, the predictions will be 2k object and non-object scores and 4k coordinate offsets for RPN.
Features of the object proposals generated from the RPN are derived by projecting the boundingboxes to the feature map. Then through ROI alignment, the features are cropped into size of 14x14 using bilinear interpolation and further down-sampled to 7x7, as shown in Figure 5. Features of different objects are aligned to the same size that can go through FC layers and output the final predictions. From Figure 1, the most popular object size in our dataset lies in between 48x48 and 64x64, which only projects to 6x6 to 8x8 on the feature map. It is far smaller than 14x14. We experimented with smaller ROI alignment size, such as 7x7 and 2x2 and did not achieve better results. Dividing features of a small objects into fine grained pieces helps describe the object in detail. Intersections of union (IoUs) between the anchors and the ground-truth were used as metrics to select positive and negative examples. For other implementation details of Faster R-CNN, we recommend referring to the original work. Figure 5. ROI alignment of the second stage; small objects also gain benefits by dividing features to fine grained pieces

Anchor settings
Anchors were empirically chosen in most of the detection architectures. In Faster R-CNN where predictions were made on a single feature map, the anchors must cover all the object scales. In SSD where predictions were made on multiple feature maps, each feature layer requires a specific designed object scale and overall, the anchors should match all the object scales, in the original work of SSD, the authors used the following formulation to assign the anchors for each feature layer: m is the number of layers, and is the minimum and maximum anchor scale that was assigned to the lowest and highest feature layer, respectively. Because the object scale of our dataset is very small and spread in a narrow range, which is reflected in Figure 1 (b), manually choosing the anchors is more flexible. For example, the ratio of a 16x16 anchor box to a 1080x1920 image is 0.00012, making s a very small number. There were semi-automatic anchor assigning methods, such as using centres of K-means clustering on the ground-truth boxes as the anchors. Obviously, K-means clustering only reveals the statistic patterns of the object sizes, but has no clue on the feature stride step, as well as the matching quality between the anchors and ground-truth. Feature stride step is of great importance on the small objects. And carefully balancing the feature map size and matching quality is needed for various detection architectures.
Exhaustive searching of the hyper-parameters such as image size, output stride and anchor scales was implemented for both the single-and multi-layer object detector. The number of anchors and the IoU matching qualities for each object scale are interested factors to be looked upon. Table 1 shows some results for the single-layer detector, we summarize the portion of the biggest IoU that exceeds thresholds 0.5, 0.6 and 0.7 under the combinations of image size, output stride and anchor scales. The aspect ratio is set to [0.5, 1, 2]. Anchor scales are represented by the edge length, for example, anchor scale 8 represents anchor box of size 8x8. For the same image size, the anchor scales seem not have much influence on the IoU matching qualities, probably due to small output stride step and fine division of the anchor scales. Table 2 shows some of the results for IoU matching qualities for multi-layer detectors. The output stride level is the order of 2 by which the feature map size was reduced. Feature layers of level 2-6 are [2 2 , 2 3 , 2 4 , 2 5 . 2 6 ] times smaller than the input image size. An extra anchor scale in between two anchor scales was added by the choice indicated by "intermediate scale".

RESULTS AND ANALYSIS 3.1. Dataset
The dataset was collected in semi-desert areas in southern Namibia using DJI phantom 3 and phantom 4 in December (summer in Namibia). We took numerous flights in different times during the day for a week, and the duration of the videos amounted to several hours. Three species were covered in this dataset: blue wildebeest (gnu), gemsbok (oryx), and zebra. These species were chosen because they spread widely in southern Africa, having well representation of the hoofed animals on this land. Additionally, they are gregarious, making it easy to record a number of instances in one shot. Also adding challenges to the detector by introducing crowded and occluded objects.
The resolution of the image is 1080x1920. Frames were taken from the videos every certain interval of seconds and were divided randomly into training and testing sets. 1693 frames with around 20000 instances were used for training, and 389 frames with 4017 instances were used as testing examples. The flight height was 10-20 meters, with sights from various angles to the object. The environment and illumination conditions were diversified on purpose to allow generalization.

Backbone network
ResNet-50 was taken as the basic feature extracting backbone network for its robustness and moderate size. To retain the output stride at 8, atrous convolutional layers with stride 2 replaced the normal convolutional layers and the stride 2 down-sampling operations. The last layer of the Block 4 was taken as the feature map in the Faster R-CNN detection scheme. Figure 6 (a) is a schematic of the backbone network. The highlighted layers are the activation feature maps. The FPN structure recovers the spatial resolution of the feature map by up-sampling and literal connection as shown in Figure 6 (b). The two network structures could both be used in Faster R-CNN that make predictions on one single feature map.
For multi-layer detectors, directly making predictions from the shallow layers is not feasible. It is better to enhance the sematic information by introducing a FPN structure. SSD takes the output layers of the FPN upon ResNet-50 as shown in Figure 6

Anchors
For small objects in a big images, better matching of the anchors with the ground-truth could only be achieved by elaborate division of anchor scales and a fine resolution feature map. To further reveal the IoU matching for the object scales, Figure 7 plot the biggest IoU of each ground-truth against the object size. The histogram shows the number of objects lie in each object scale. Examples are taken from Table 2, the hyperparameters are on the top of each figure (Figure 7 (a) and (b)). Intermediate anchors are inserted in the right column compared to the left (Figure 7 Table 1, the anchor scales are [16,32,64,128] with the original image size. For the multi-layer detectors, Row 4, and Row 6 of Table 2 has comparable IoU matching rate and Row 6 generates less anchors by omitting anchor scale of 16. However, the poorly matched objects are centralized on the small objects. As shown in Figure 8, the objects with biggest IoUs<0.5 are all below 32x32. This influences the detection for small objects. So hyper-parameters in Row 4 is chosen for SSD.

Model evaluation
The models were trained from scratch on the training set of our dataset. MS-COCO metrics were used to evaluate the detection performance. Only the overall mAP (mean average precision) and mAP for the large, medium, and small objects were considered, neglecting the corresponding mAR (mean average recall). Table 3 listed the evaluation results for the various detection models, including two-stage detector Faster R-CNN and one-stage SSD and YOLO-v3. The aspect ratios were all set to [0.5, 1, 2]. For Faster R-CNN, the backbone networks were based on ResNet-50. Hyper-parameters HP1 and HP2 defined the image size and anchor scales. HP1 was the settings with the optimum anchor choice balancing the IoU matching rate and the total number of anchors. Under the same backbone network with atrous filters. HP1 outperforms HP2 as the IoU matching rate between the anchors and ground-truth is higher. Rate1 and rate 2 defined the different dilation rates. Using the same smallest dilation rate [2,2,2] led to better results than enlarging the dilation rate for each block of Renet-50. This could be caused by the narrow range of the object scale combined with elaborate division of the anchors that makes enlarging the "receptive field" not necessary. HP2 uses image size that is half of the original image, which leads to much lighter network. But the performance is inferior due to the worse anchor matching rate and loss of information. SSD could only take the feature layers from a FPN structure. SHP, the hyper-parameters for SSD was the optimum option that was analysed previously. The detection performance was inferior for all object scales than the two-stage Faster R-CNN. Fine division of the anchor scales was compulsory for such data that were both objectively and relatively small. However, designing a proper anchor set for a sequence of feature layers is not easy. Good IoU matching for the object scales on both the small and the large ends is difficult to be met at the same time. YOLO-v3 outputs three feature layers, the anchor scales are set to [16,64,256]. Performance improvement was gained on all object scales over SSD with FPN structure. Figure 9 gives visualizations of some good detection results. Figure 9 (a)-(c), (d)-(f), and (g)-(j) are results for the species: zebra, oryx and blue wildebeest, respectively. The scenario is where the animals are clear and parted, cluttered, and occluded by one another, in distance and small in size and the corresponding results are Figure 9

CONCLUSION
We brought up the problem of detecting African hoofed mammals using aerial imagery taken from UAVs. In a ResNet-50 backbone network, a sequence of atrous convolutional filters were used to keep the feature map resolution at certain level and at the same time continue extracting deeper abstract semantic features. The detection performance for such data is sensitive to the matching quality between the anchor and the ground-truth. The image size, output stride and anchor choices combined together determine the IoU matching qualities. This feature extraction technique was proven robust in detecting small objects by comparing to FPN. The two-stage Faster R-CNN surpass single-stage detectors in detecting small objects, especially for our dataset where the object size is of small ratio to the full image.